WARP · Weight Teleportation for Attack-Resilient Unlearning Protocols

Approximate machine unlearning leaks: the difference Δθ = θ_u − θ_org encodes a forget-set gradient that attackers can invert. WARP interleaves a loss-invariant teleportation step with unlearning — walking along a symmetry orbit of the network to shrink forget-set gradients and disperse parameters, all while keeping predictions on the retain set intact. Result: up to −64% black-box and −92% white-box adversary advantage across six unlearning methods, with utility preserved.

01 · The Threat

The privacy gap in unlearning

Unlearning is efficient — but privacy is not automatic.

An adversary who holds both the original model θ_org and the unlearned model θ_u observes their difference, which to first order is a weighted forget-set gradient. That single quantity is enough to mount strong membership inference and even data reconstruction attacks.

Threat model — adversary with access to both original and unlearned model runs membership inference and reconstruction attacks. — **Fig 1.** Threat model. An adversary with white-box access to both *θ_org* and *θ_u* exploits the parameter difference Δθ for both membership inference and data reconstruction.

Leak source · 01

Large forget-set gradient norms

Samples with high gradient magnitude leave a strong, detectable signature in Δθ — proportional to how much the unlearning step had to move.

Leak source · 02

Small parameter displacement

Retain-loss regularisation keeps θ_u close to θ_org. The forget signal stays readable above the noise.

02 · Key Observation

Gradient norm predicts privacy risk

The two knobs WARP turns.

Average privacy risk increasing monotonically with forget-sample gradient norm (log scale). — **Fig 2.** Privacy risk grows with the gradient norm of a forgotten sample (U-LiRA, NGP on CIFAR-10). Samples with larger gradient magnitudes are measurably more exposed after unlearning.

03 · Method

WARP — symmetry & teleportation

Move along the loss-invariant manifold. Forget the path, keep the function.

Symmetry. A transformation g of parameters θ such that the input-to-output map is preserved: L(g·θ) = L(θ). Teleportation. Updating θ ↦ g·θ along this manifold — the network computes the same function, but its parameters change, and so does the geometry of Δθ.

Three-panel WARP methodology figure — Vanilla Unlearning vs. Teleportation Step vs. Comparison (easier/harder to attack). — **Fig 3.** Vanilla unlearning stays close to θ_org, so Δθ directly encodes the forget gradient. A teleportation step along the retain-preserving manifold yields an equivalent function with smaller forget gradients and larger parameter dispersion — the attack surface collapses while utility is retained.

WARP objective. After each unlearning step, pick a teleport g^⋆ that shrinks forget-gradient energy, disperses parameters from the original, and preserves retain utility:

WARP objective: g* in argmin over G of sum of forget gradient norms minus beta times parameter displacement, subject to retain loss constraint.

We instantiate 𝒢 via a retain null-space projection: using the top-k left singular vectors of retain activations R_ℓ = U_ℓΣ_ℓV_ℓ^⊤, we form the projector Π_ℓ^⊥ = I − B_ℓB_ℓ^⊤ and take each teleport step along the retain-orthogonal direction. Predictions on 𝒟_r drift only within numerical tolerance.

04 · Results

Headline numbers

Six unlearning algorithms · three datasets · two architectures.

−64%

Black-box adversary advantage reduced (U-LiRA AUC, NGP)

−92%

White-box adversary advantage reduced (gradient-diff AUC, PGU)

+30%

PSNR gain of our reconstruction attack over prior gradient inversion

Membership inference — with & without WARP

WARP consistently reduces membership leakage under both attack models — most dramatically at stringent low-FPR operating points.

	Black-box (U-LiRA)		White-box (Gradient Diff)		Utility
Method	AUC	TPR @ 1% FPR	AUC	TPR @ 1% FPR	Test Acc.
NGP	0.545	0.030	0.642	0.034	0.808
+ WARP	0.516	0.014	0.614	0.021	0.797

SCRUB	0.543	0.047	0.700	0.102	0.815
+ WARP	0.526	0.036	0.657	0.061	0.813

PGU	0.636	0.040	0.659	0.064	0.804
+ WARP	0.631	0.036	0.533	0.025	0.808

SalUn	0.572	0.062	0.721	0.069	0.802
+ WARP	0.565	0.059	0.705	0.062	0.803

SRF-ON	0.509	0.015	0.670	0.043	0.814
+ WARP	0.506	0.012	0.629	0.030	0.811

BadT.	0.725	0.177	0.938	0.346	0.816
+ WARP	0.661	0.137	0.907	0.279	0.818

CIFAR-10 / ResNet-18 · 64 shadow models × 10 forget sets. Full TPR@{0.1, 5}% and ViT-B/16 on Tiny-ImageNet results are in the paper.

05 · Data Reconstruction

Reconstructions collapse under WARP

Even a strong generative-prior attack fails to recover forgotten images.

Teleportation injects a component into Δθ that is nearly orthogonal to the true forget gradient g_f, so the inverter collapses onto generic class priors: recoveries become label-consistent but semantically wrong.

Reconstruction comparison — original vs. NGP vs. NGP+WARP recoveries for four forgotten examples. — **Fig 4.** ImageNet-1K · ResNet-18 · NGP. Left-to-right in each triplet: original forgotten image, NGP recovery, NGP+WARP recovery. WARP drops reconstruction PSNR by **3.4 dB** and doubles test MSE.

Method	PSNR ↑	LPIPS ↓	SSIM ↑	Test MSE ↓	Feat MSE ↓
NGP	10.74	0.56	0.12	0.10	5.39
+ WARP	7.38	0.68	0.08	0.21	11.28

06 · Take-aways

Three things to remember

Black-box audits are not enough

Unlearning methods that look safe under U-LiRA can still leak substantially under a white-box gradient-difference test.

Reshape the geometry — don't inject noise

WARP reshapes Δθ along a loss-invariant manifold. No training-time statistics, no added DP noise, no change to the unlearning algorithm.

iii

Beats DP-Langevin at matched utility

WARP outperforms projected DP-Langevin unlearning at comparable retain accuracy — with an information-theoretic backing (see Appendix O).

07 · Citation

If you find WARP useful

@inproceedings{maheri2026warp, title = {Weight Teleportation for Attack-Resilient Unlearning Protocols}, author = {Maheri, Mohammad M. and Cadet, Xavier and Chin, Peter and Haddadi, Hamed}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2026} }