Skip to main content
Reinforcement learning (RL) is how a policy learns by doing. Rather than being shown the right answer, the policy acts, sees how well it did, and adjusts to do better next time. For robots — where the “right” action depends on physics, contact, and context — this trial-and-error loop is often the only way to learn behaviors that hold up in the real world.

The core loop

Every RL setup is the same loop repeated millions of times:
  • Observation — what the policy senses this tick.
  • Action — what it decides to do.
  • Reward — a scalar score for the result: did it move toward the goal, stay upright, use less energy?
  • Rollout (episode) — one full run of observe → act → reward, from start to a terminal condition.
The policy’s only job is to maximize cumulative reward over a rollout. Get the reward right and the behavior follows.

Reward shaping

The reward is the most important design choice in RL. A reward that’s too sparse (“+1 if you reach the goal, else 0”) gives the policy almost nothing to learn from. Reward shaping adds intermediate signal — progress toward the goal, stability, smoothness — so learning is faster and more stable. In Cadenza, mission specs let you shape reward and set success/fail predicates per phase, so the signal matches each part of a multi-step task.

Why simulation is essential for RL

RL is hungry. It needs enormous numbers of rollouts, and early in training the policy is bad — it will fall, flail, and fail constantly. You cannot run that on a physical robot at scale:
  • Volume — RL needs thousands to millions of episodes; real robots can’t produce them in time.
  • Safety — a half-trained policy would destroy hardware. In sim, failure is free.
  • Speed — simulation runs faster than real time and in parallel.
  • Coverage — you can spawn rare or dangerous scenarios on demand.
Cadenza’s Gym-style interface exposes exactly this loop, so you can run RL at the scale simulation makes possible.

RL vs supervised learning

RL isn’t the only way to train a policy. When you already have good demonstrations, supervised learning (imitation) is simpler — just copy the expert. The two are complementary:
Reinforcement learningSupervised / imitation
SignalReward from outcomesLabeled expert actions
NeedsA reward function + many rolloutsA dataset of good behavior
StrengthDiscovers new behavior, optimizes for the goalFast, stable, data-efficient
In practice, teams often pretrain a model on demonstrations and then refine it with RL and fine-tuning — which is exactly the workflow Cadenza supports.

Next: Fine-tuning & LoRA

Turn rollouts into a sharper, specialized policy.