Hardware acceleration

Targets

Target	Detected as	Applied
Apple Silicon (local dev)	arm64 + Metal	MPS device, thread tuning, matmul precision
AWS Graviton (cloud CPU)	aarch64 Linux	oneDNN + ARM Compute Library, bf16, threads
NVIDIA Jetson Orin (deploy)	aarch64 + CUDA	CUDA, TF32/bf16, `torch.compile`, int8
x86 + CUDA (cloud GPU)	x86_64 + CUDA	CUDA, TF32/bf16 autocast, `torch.compile`

The layer chooses device, thread/interop counts, float32 matmul precision (TF32/bf16), bf16 autocast, torch.compile, channels_last, and int8 for deploy — and applies each technique only where it has surface.

Safety first

Every aggressive step is guarded: if anything fails, it falls back to eager fp32 so acceleration can never break or slow a run. The layer is also empirical about traps — for example, bf16 on Apple-Silicon CPU is emulated (≈100× slower), so it is never blindly assumed; the layer probes before committing.

Kill-switches

Two environment variables make it debuggable:

Variable	Effect
`CADENZA_ACCEL=0`	Disable everything — pure eager fp32, base thread counts.
`CADENZA_NO_COMPILE=1`	Keep dtype/thread tuning but skip `torch.compile`.

Commands that train or profile print a one-line acceleration summary (e.g. in env residual init and env residual bench), so you can see exactly what was applied on your box.

Torch is imported lazily — the core CLI stays light, and only the rl / lora / vla paths that already need torch trigger the acceleration layer. Hardware detection uses the Python standard library only.

Introduction

Configure

Using the CLI

Megan

Reference

Targets

Safety first

Kill-switches

​Targets

​Safety first

​Kill-switches

Targets

Safety first

Kill-switches