Skip to main content
Cadenza runs the same compute — residual RL, the GRD LoRA+RL loop, distillation, LoRA fine-tuning, frozen-VLA inference — across very different hardware. Rather than make you pass device flags, the CLI ships a single acceleration layer that detects your hardware once and applies the right knobs transparently. Nobody calls it directly; every training and inference path runs through it.

Targets

TargetDetected asApplied
Apple Silicon (local dev)arm64 + MetalMPS device, thread tuning, matmul precision
AWS Graviton (cloud CPU)aarch64 LinuxoneDNN + ARM Compute Library, bf16, threads
NVIDIA Jetson Orin (deploy)aarch64 + CUDACUDA, TF32/bf16, torch.compile, int8
x86 + CUDA (cloud GPU)x86_64 + CUDACUDA, TF32/bf16 autocast, torch.compile
The layer chooses device, thread/interop counts, float32 matmul precision (TF32/bf16), bf16 autocast, torch.compile, channels_last, and int8 for deploy — and applies each technique only where it has surface.

Safety first

Every aggressive step is guarded: if anything fails, it falls back to eager fp32 so acceleration can never break or slow a run. The layer is also empirical about traps — for example, bf16 on Apple-Silicon CPU is emulated (≈100× slower), so it is never blindly assumed; the layer probes before committing.

Kill-switches

Two environment variables make it debuggable:
VariableEffect
CADENZA_ACCEL=0Disable everything — pure eager fp32, base thread counts.
CADENZA_NO_COMPILE=1Keep dtype/thread tuning but skip torch.compile.
Commands that train or profile print a one-line acceleration summary (e.g. in env residual init and env residual bench), so you can see exactly what was applied on your box.
Torch is imported lazily — the core CLI stays light, and only the rl / lora / vla paths that already need torch trigger the acceleration layer. Hardware detection uses the Python standard library only.