The core loop
Every RL setup is the same loop repeated millions of times:- Observation — what the policy senses this tick.
- Action — what it decides to do.
- Reward — a scalar score for the result: did it move toward the goal, stay upright, use less energy?
- Rollout (episode) — one full run of observe → act → reward, from start to a terminal condition.
Reward shaping
The reward is the most important design choice in RL. A reward that’s too sparse (“+1 if you reach the goal, else 0”) gives the policy almost nothing to learn from. Reward shaping adds intermediate signal — progress toward the goal, stability, smoothness — so learning is faster and more stable. In Cadenza, mission specs let you shape reward and set success/fail predicates per phase, so the signal matches each part of a multi-step task.Why simulation is essential for RL
RL is hungry. It needs enormous numbers of rollouts, and early in training the policy is bad — it will fall, flail, and fail constantly. You cannot run that on a physical robot at scale:- Volume — RL needs thousands to millions of episodes; real robots can’t produce them in time.
- Safety — a half-trained policy would destroy hardware. In sim, failure is free.
- Speed — simulation runs faster than real time and in parallel.
- Coverage — you can spawn rare or dangerous scenarios on demand.
RL vs supervised learning
RL isn’t the only way to train a policy. When you already have good demonstrations, supervised learning (imitation) is simpler — just copy the expert. The two are complementary:| Reinforcement learning | Supervised / imitation | |
|---|---|---|
| Signal | Reward from outcomes | Labeled expert actions |
| Needs | A reward function + many rollouts | A dataset of good behavior |
| Strength | Discovers new behavior, optimizes for the goal | Fast, stable, data-efficient |
Next: Fine-tuning & LoRA
Turn rollouts into a sharper, specialized policy.