ICLR 2026 · Rio de Janeiro

Weight Teleportation for Attack-Resilient Unlearning Protocols

A plug-and-play symmetry defense that reshapes the geometry of approximate unlearning — reducing privacy leakage without retraining, added noise, or changing the unlearning algorithm.

1Imperial College London    2Dartmouth College
Imperial College London Dartmouth College

Approximate machine unlearning leaks: the difference Δθ = θu − θorg encodes a forget-set gradient that attackers can invert. WARP interleaves a loss-invariant teleportation step with unlearning — walking along a symmetry orbit of the network to shrink forget-set gradients and disperse parameters, all while keeping predictions on the retain set intact. Result: up to −64% black-box and −92% white-box adversary advantage across six unlearning methods, with utility preserved.

The privacy gap in unlearning

Unlearning is efficient — but privacy is not automatic.

An adversary who holds both the original model θorg and the unlearned model θu observes their difference, which to first order is a weighted forget-set gradient. That single quantity is enough to mount strong membership inference and even data reconstruction attacks.

Threat model — adversary with access to both original and unlearned model runs membership inference and reconstruction attacks.
Fig 1. Threat model. An adversary with white-box access to both θorg and θu exploits the parameter difference Δθ for both membership inference and data reconstruction.
Leak source · 01

Large forget-set gradient norms

Samples with high gradient magnitude leave a strong, detectable signature in Δθ — proportional to how much the unlearning step had to move.

Leak source · 02

Small parameter displacement

Retain-loss regularisation keeps θu close to θorg. The forget signal stays readable above the noise.

Gradient norm predicts privacy risk

The two knobs WARP turns.

Average privacy risk increasing monotonically with forget-sample gradient norm (log scale).
Fig 2. Privacy risk grows with the gradient norm of a forgotten sample (U-LiRA, NGP on CIFAR-10). Samples with larger gradient magnitudes are measurably more exposed after unlearning.

WARP — symmetry & teleportation

Move along the loss-invariant manifold. Forget the path, keep the function.

Symmetry. A transformation g of parameters θ such that the input-to-output map is preserved: L(g·θ) = L(θ). Teleportation. Updating θ ↦ g·θ along this manifold — the network computes the same function, but its parameters change, and so does the geometry of Δθ.

Three-panel WARP methodology figure — Vanilla Unlearning vs. Teleportation Step vs. Comparison (easier/harder to attack).
Fig 3. Vanilla unlearning stays close to θorg, so Δθ directly encodes the forget gradient. A teleportation step along the retain-preserving manifold yields an equivalent function with smaller forget gradients and larger parameter dispersion — the attack surface collapses while utility is retained.

WARP objective. After each unlearning step, pick a teleport g that shrinks forget-gradient energy, disperses parameters from the original, and preserves retain utility:

WARP objective: g* in argmin over G of sum of forget gradient norms minus beta times parameter displacement, subject to retain loss constraint.

We instantiate 𝒢 via a retain null-space projection: using the top-k left singular vectors of retain activations R = UΣV, we form the projector Π = I − BB and take each teleport step along the retain-orthogonal direction. Predictions on 𝒟r drift only within numerical tolerance.

Headline numbers

Six unlearning algorithms · three datasets · two architectures.

−64%
Black-box adversary advantage reduced (U-LiRA AUC, NGP)
−92%
White-box adversary advantage reduced (gradient-diff AUC, PGU)
+30%
PSNR gain of our reconstruction attack over prior gradient inversion

Membership inference — with & without WARP

WARP consistently reduces membership leakage under both attack models — most dramatically at stringent low-FPR operating points.

Black-box (U-LiRA) White-box (Gradient Diff) Utility
Method AUC TPR @ 1% FPR AUC TPR @ 1% FPR Test Acc.
NGP0.5450.0300.6420.0340.808
+ WARP0.5160.0140.6140.0210.797
SCRUB0.5430.0470.7000.1020.815
+ WARP0.5260.0360.6570.0610.813
PGU0.6360.0400.6590.0640.804
+ WARP0.6310.0360.5330.0250.808
SalUn0.5720.0620.7210.0690.802
+ WARP0.5650.0590.7050.0620.803
SRF-ON0.5090.0150.6700.0430.814
+ WARP0.5060.0120.6290.0300.811
BadT.0.7250.1770.9380.3460.816
+ WARP0.6610.1370.9070.2790.818

CIFAR-10 / ResNet-18 · 64 shadow models × 10 forget sets. Full TPR@{0.1, 5}% and ViT-B/16 on Tiny-ImageNet results are in the paper.

Reconstructions collapse under WARP

Even a strong generative-prior attack fails to recover forgotten images.

Teleportation injects a component into Δθ that is nearly orthogonal to the true forget gradient gf, so the inverter collapses onto generic class priors: recoveries become label-consistent but semantically wrong.

Reconstruction comparison — original vs. NGP vs. NGP+WARP recoveries for four forgotten examples.
Fig 4. ImageNet-1K · ResNet-18 · NGP. Left-to-right in each triplet: original forgotten image, NGP recovery, NGP+WARP recovery. WARP drops reconstruction PSNR by 3.4 dB and doubles test MSE.
Method PSNR ↑ LPIPS ↓ SSIM ↑ Test MSE ↓ Feat MSE ↓
NGP10.740.560.120.105.39
+ WARP7.380.680.080.2111.28

Three things to remember

i

Black-box audits are not enough

Unlearning methods that look safe under U-LiRA can still leak substantially under a white-box gradient-difference test.

ii

Reshape the geometry — don't inject noise

WARP reshapes Δθ along a loss-invariant manifold. No training-time statistics, no added DP noise, no change to the unlearning algorithm.

iii

Beats DP-Langevin at matched utility

WARP outperforms projected DP-Langevin unlearning at comparable retain accuracy — with an information-theoretic backing (see Appendix O).

If you find WARP useful

@inproceedings{maheri2026warp, title = {Weight Teleportation for Attack-Resilient Unlearning Protocols}, author = {Maheri, Mohammad M. and Cadet, Xavier and Chin, Peter and Haddadi, Hamed}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2026} }