The control law
The residual never replaces the base — it corrects it:a_base; the small residual head emits Δa, scaled by
α (action scale) and an optional gate. The base never receives a gradient.
env residual init: scaffold + profile
Establishes the residual architecture and dry-run profiles the head — params,
per-step latency, peak memory — before you commit to a training run. No training
happens here.
| Flag | Default | Purpose |
|---|---|---|
--hidden-dim <h> | 256 | Residual MLP hidden width. |
--alpha <a> | 0.15 | Action scale — how hard the residual is allowed to nudge. |
--obs-dim <n> | base-derived | Observation width (auto-probed from the project’s encoder when available). |
--rate-limit <r> | 0.5 | Per-step change limit on the residual output. |
--no-gate | off | Drop the gate term (a = clamp(a_base + α·Δa)). |
--device cuda|mps|cpu | auto | Profiling device (auto-selects cuda → mps → cpu). |
rescue-dog/residual/residual.json.
env residual train: governed PPO on the frozen base
PPO-trains the residual head against the frozen base under a perturbation
curriculum (sparse reward). The Cadenza API governs the run: it picks the
hyperparameters, decides when to stop, and returns the verdict — the client only
collects rollouts and runs the raw PPO step each round.
DEPLOY the policy is saved and promoted as the residual baseline; on BLOCK
it rolls back to the previous residual. The trained head lands at
rescue-dog/residual/residual_policy.pt.
env residual eval: govern the residual
Re-scores the trained residual on its own — success / collision / residual-sanity
/ regression — and returns DEPLOY | BLOCK | NEEDS_DATA. --promote sets it as
the baseline if it passes. See Governance.
env residual bench: residual vs full RL
A head-to-head benchmark of the current full-RL stack against cadenza-cli’s
residual RL — reporting compute, dollar cost, and accuracy, then a verdict on
whether the residual arm wins.
| Flag | Default | Purpose |
|---|---|---|
--steps <n> | config | Env steps per arm. |
--cost-per-gpu-hour <x> | 2.0 | Dollar rate used to price each arm. |
--device <d> | auto | Training device. |
--real | off | Target the cadenza-lab sim instead of the proxy. |
Distillation: a base-free onboard student
Once a residual is deployed,env distill collects teacher (base + residual)
rollouts and trains a compact student that reproduces the teacher’s behaviour
without loading the base — so it runs on CPU/MPS at the control-loop rate,
optionally quantized to int8.
| Flag | Default | Purpose |
|---|---|---|
--epochs <n> | 60 | Distillation epochs. |
--quantize | off | Export an int8 student for tighter onboard latency. |
--device <d> | auto | Training device. |
meets or MISSES the floor). Artifacts land in rescue-dog/student/.
Govern the student
DEPLOY | BLOCK | NEEDS_DATA verdict, with rollback on BLOCK.