Reinforcement learning

The core loop

Every RL setup is the same loop repeated millions of times:

Observation — what the policy senses this tick.

Action — what it decides to do.

Reward — a scalar score for the result: did it move toward the goal, stay upright, use less energy?

Rollout (episode) — one full run of observe → act → reward, from start to a terminal condition.

The policy’s only job is to maximize cumulative reward over a rollout. Get the reward right and the behavior follows.

Reward shaping

The reward is the most important design choice in RL. A reward that’s too sparse (“+1 if you reach the goal, else 0”) gives the policy almost nothing to learn from. Reward shaping adds intermediate signal — progress toward the goal, stability, smoothness — so learning is faster and more stable.

In Cadenza, mission specs let you shape reward and set success/fail predicates per phase, so the signal matches each part of a multi-step task.

Why simulation is essential for RL

RL is hungry. It needs enormous numbers of rollouts, and early in training the policy is bad — it will fall, flail, and fail constantly. You cannot run that on a physical robot at scale:

Volume — RL needs thousands to millions of episodes; real robots can’t produce them in time.

Safety — a half-trained policy would destroy hardware. In sim, failure is free.

Speed — simulation runs faster than real time and in parallel.

Coverage — you can spawn rare or dangerous scenarios on demand.

Cadenza’s Gym-style interface exposes exactly this loop, so you can run RL at the scale simulation makes possible.

RL vs supervised learning

RL isn’t the only way to train a policy. When you already have good demonstrations, supervised learning (imitation) is simpler — just copy the expert. The two are complementary:

	Reinforcement learning	Supervised / imitation
Signal	Reward from outcomes	Labeled expert actions
Needs	A reward function + many rollouts	A dataset of good behavior
Strength	Discovers new behavior, optimizes for the goal	Fast, stable, data-efficient

In practice, teams often pretrain a model on demonstrations and then refine it with RL and fine-tuning — which is exactly the workflow Cadenza supports.

Next: Fine-tuning & LoRA

Turn rollouts into a sharper, specialized policy.

Overview

Concepts

The core loop

Reward shaping

Why simulation is essential for RL

RL vs supervised learning

Next: Fine-tuning & LoRA

​The core loop

​Reward shaping

​Why simulation is essential for RL

​RL vs supervised learning

Next: Fine-tuning & LoRA

The core loop

Reward shaping

Why simulation is essential for RL

RL vs supervised learning