What is a policy?
A policy is a function from what the robot senses to what it does:- Observation — the robot’s view of the world this tick: joint states, camera frames, proprioception, sometimes a language instruction.
- Action — the command the policy emits: target velocities, joint torques, or a higher-level action call.
Vision-language-action (VLA) models
A VLA model is a policy that takes in pixels and language and outputs actions. It’s the robotics analogue of a multimodal LLM: instead of replying with text, it replies with motion. VLA models are powerful because they:- generalize across tasks described in natural language (“pick up the red block”),
- fuse perception and intent in a single model,
- and start from broad pretraining, so you can specialize them with relatively little task-specific data.
World models
A world model learns to predict — given the current state and a candidate action, what happens next. Instead of mapping observation straight to action, a world model lets a system imagine outcomes and plan against them. World models are useful for:- Planning — search over imagined rollouts before committing to an action.
- Sample efficiency — learn from predicted experience, not just real or simulated steps.
- Reasoning — chain predictions together for multi-step tasks.
How they fit together
These aren’t mutually exclusive. A real system often layers them: In Cadenza, the inference stack is built to host exactly this: register your own VLA or world model, or compose them — sequential, chain-of-thought, or custom — behind one clean interface. The simulator provides the observations; your model provides the actions.How policies learn: RL
Trial, error, and reward.
Specializing a model: fine-tuning
Adapt a base model to your robot with LoRA.