Cadenza runs the same compute — residual RL, the GRD LoRA+RL loop, distillation,
LoRA fine-tuning, frozen-VLA inference — across very different hardware. Rather
than make you pass device flags, the CLI ships a single acceleration layer
that detects your hardware once and applies the right knobs transparently.
Nobody calls it directly; every training and inference path runs through it.
Targets
| Target | Detected as | Applied |
|---|
| Apple Silicon (local dev) | arm64 + Metal | MPS device, thread tuning, matmul precision |
| AWS Graviton (cloud CPU) | aarch64 Linux | oneDNN + ARM Compute Library, bf16, threads |
| NVIDIA Jetson Orin (deploy) | aarch64 + CUDA | CUDA, TF32/bf16, torch.compile, int8 |
| x86 + CUDA (cloud GPU) | x86_64 + CUDA | CUDA, TF32/bf16 autocast, torch.compile |
The layer chooses device, thread/interop counts, float32 matmul precision
(TF32/bf16), bf16 autocast, torch.compile, channels_last, and int8 for
deploy — and applies each technique only where it has surface.
Safety first
Every aggressive step is guarded: if anything fails, it falls back to eager fp32
so acceleration can never break or slow a run. The layer is also empirical
about traps — for example, bf16 on Apple-Silicon CPU is emulated (≈100× slower),
so it is never blindly assumed; the layer probes before committing.
Kill-switches
Two environment variables make it debuggable:
| Variable | Effect |
|---|
CADENZA_ACCEL=0 | Disable everything — pure eager fp32, base thread counts. |
CADENZA_NO_COMPILE=1 | Keep dtype/thread tuning but skip torch.compile. |
Commands that train or profile print a one-line acceleration summary (e.g. in
env residual init and env residual bench), so you can see exactly what was
applied on your box.
Torch is imported lazily — the core CLI stays light, and only the rl / lora /
vla paths that already need torch trigger the acceleration layer. Hardware
detection uses the Python standard library only.