Residual RL & distillation

The residual stack is Cadenza’s answer to a hard deployment problem: you have a capable but frozen base policy (or VLA), the real robot’s dynamics don’t quite match what it was trained on, and you can’t afford to re-train the whole thing on-device. Instead you learn a small residual that nudges the base’s actions, govern it, and then distill the base+residual pair into a compact student that runs onboard without the base at all.

Residual RL and distillation need the rl extra: pip install -e ".[rl]" (installs torch). Without cadenza-lab they run on a deterministic proxy task (labelled as such); --real targets the gated cadenza-lab sim seam. The governed commands (train, eval, bench) require sign-in.

The control law

The residual never replaces the base — it corrects it:

a = clamp(a_base + α · gate · Δa)

The frozen base/VLA picks a_base; the small residual head emits Δa, scaled by α (action scale) and an optional gate. The base never receives a gradient.

`env residual init`: scaffold + profile

Establishes the residual architecture and dry-run profiles the head — params, per-step latency, peak memory — before you commit to a training run. No training happens here.

cadenza env residual init rescue-dog --alpha 0.15 --hidden-dim 256

Flag	Default	Purpose
`--hidden-dim <h>`	`256`	Residual MLP hidden width.
`--alpha <a>`	`0.15`	Action scale — how hard the residual is allowed to nudge.
`--obs-dim <n>`	base-derived	Observation width (auto-probed from the project’s encoder when available).
`--rate-limit <r>`	`0.5`	Per-step change limit on the residual output.
`--no-gate`	off	Drop the gate term (`a = clamp(a_base + α·Δa)`).
`--device cuda\|mps\|cpu`	auto	Profiling device (auto-selects cuda → mps → cpu).

Writes rescue-dog/residual/residual.json.

`env residual train`: governed PPO on the frozen base

PPO-trains the residual head against the frozen base under a perturbation curriculum (sparse reward). The Cadenza API governs the run: it picks the hyperparameters, decides when to stop, and returns the verdict — the client only collects rollouts and runs the raw PPO step each round.

cadenza env residual train rescue-dog

On DEPLOY the policy is saved and promoted as the residual baseline; on BLOCK it rolls back to the previous residual. The trained head lands at rescue-dog/residual/residual_policy.pt.

`env residual eval`: govern the residual

Re-scores the trained residual on its own — success / collision / residual-sanity / regression — and returns DEPLOY | BLOCK | NEEDS_DATA. --promote sets it as the baseline if it passes. See Governance.

cadenza env residual eval rescue-dog --promote

`env residual bench`: residual vs full RL

A head-to-head benchmark of the current full-RL stack against cadenza-cli’s residual RL — reporting compute, dollar cost, and accuracy, then a verdict on whether the residual arm wins.

cadenza env residual bench rescue-dog --steps 20000 --cost-per-gpu-hour 2.0

Flag	Default	Purpose
`--steps <n>`	config	Env steps per arm.
`--cost-per-gpu-hour <x>`	`2.0`	Dollar rate used to price each arm.
`--device <d>`	auto	Training device.
`--real`	off	Target the `cadenza-lab` sim instead of the proxy.

Distillation: a base-free onboard student

Once a residual is deployed, env distill collects teacher (base + residual) rollouts and trains a compact student that reproduces the teacher’s behaviour without loading the base — so it runs on CPU/MPS at the control-loop rate, optionally quantized to int8.

cadenza env distill rescue-dog --epochs 60 --quantize

Flag	Default	Purpose
`--epochs <n>`	`60`	Distillation epochs.
`--quantize`	off	Export an int8 student for tighter onboard latency.
`--device <d>`	auto	Training device.

The report prints the teacher→student success gap, student param count, and the onboard latency against the 50 Hz control budget (it tells you whether the student meets or MISSES the floor). Artifacts land in rescue-dog/student/.

Govern the student

cadenza env distill eval rescue-dog --promote

Scores the student on gap / success / regression and returns the same DEPLOY | BLOCK | NEEDS_DATA verdict, with rollback on BLOCK.

Full loop

# 1. scaffold + profile the residual head
cadenza env residual init rescue-dog --alpha 0.15

# 2. governed PPO on the frozen base (DEPLOY → promoted)
cadenza env residual train rescue-dog

# 3. prove it beats full RL on compute/cost/accuracy
cadenza env residual bench rescue-dog

# 4. distill into a base-free, int8 onboard student
cadenza env distill rescue-dog --quantize
cadenza env distill eval rescue-dog --promote

Every torch path here runs through Cadenza’s transparent acceleration layer — bf16, thread tuning, matmul precision, and torch.compile are applied automatically per hardware target.

Introduction

Configure

Using the CLI

Megan

Reference

Residual RL & distillation

The control law

`env residual init`: scaffold + profile

`env residual train`: governed PPO on the frozen base

`env residual eval`: govern the residual

`env residual bench`: residual vs full RL

Distillation: a base-free onboard student

Govern the student

Full loop

​The control law

​env residual init: scaffold + profile

​env residual train: governed PPO on the frozen base

​env residual eval: govern the residual

​env residual bench: residual vs full RL

​Distillation: a base-free onboard student

​Govern the student

​Full loop

The control law

`env residual init`: scaffold + profile

`env residual train`: governed PPO on the frozen base

`env residual eval`: govern the residual

`env residual bench`: residual vs full RL

Distillation: a base-free onboard student

Govern the student

Full loop