Skip to main content
At the top of the stack sit the learned components. This page explains the kinds of models that drive a robot and how they relate to the policy — the thing you ultimately train and deploy.

What is a policy?

A policy is a function from what the robot senses to what it does:
observation  ->  policy  ->  action
Everything else in physical AI exists to produce, train, or run a good policy. A policy can be a small neural network, a large multimodal model, or a composition of several models working together.
  • Observation — the robot’s view of the world this tick: joint states, camera frames, proprioception, sometimes a language instruction.
  • Action — the command the policy emits: target velocities, joint torques, or a higher-level action call.

Vision-language-action (VLA) models

A VLA model is a policy that takes in pixels and language and outputs actions. It’s the robotics analogue of a multimodal LLM: instead of replying with text, it replies with motion. VLA models are powerful because they:
  • generalize across tasks described in natural language (“pick up the red block”),
  • fuse perception and intent in a single model,
  • and start from broad pretraining, so you can specialize them with relatively little task-specific data.
That last point is why fine-tuning matters so much: you rarely train a VLA from scratch — you adapt a capable base model to your robot and task.

World models

A world model learns to predict — given the current state and a candidate action, what happens next. Instead of mapping observation straight to action, a world model lets a system imagine outcomes and plan against them. World models are useful for:
  • Planning — search over imagined rollouts before committing to an action.
  • Sample efficiency — learn from predicted experience, not just real or simulated steps.
  • Reasoning — chain predictions together for multi-step tasks.

How they fit together

These aren’t mutually exclusive. A real system often layers them: In Cadenza, the inference stack is built to host exactly this: register your own VLA or world model, or compose them — sequential, chain-of-thought, or custom — behind one clean interface. The simulator provides the observations; your model provides the actions.

How policies learn: RL

Trial, error, and reward.

Specializing a model: fine-tuning

Adapt a base model to your robot with LoRA.