Models & policies

What is a policy?

A policy is a function from what the robot senses to what it does:

observation  ->  policy  ->  action

Everything else in physical AI exists to produce, train, or run a good policy. A policy can be a small neural network, a large multimodal model, or a composition of several models working together.

Observation — the robot’s view of the world this tick: joint states, camera frames, proprioception, sometimes a language instruction.

Action — the command the policy emits: target velocities, joint torques, or a higher-level action call.

Vision-language-action (VLA) models

A VLA model is a policy that takes in pixels and language and outputs actions. It’s the robotics analogue of a multimodal LLM: instead of replying with text, it replies with motion.

VLA models are powerful because they:

generalize across tasks described in natural language (“pick up the red block”),

fuse perception and intent in a single model,

and start from broad pretraining, so you can specialize them with relatively little task-specific data.

That last point is why fine-tuning matters so much: you rarely train a VLA from scratch — you adapt a capable base model to your robot and task.

World models

A world model learns to predict — given the current state and a candidate action, what happens next. Instead of mapping observation straight to action, a world model lets a system imagine outcomes and plan against them.

World models are useful for:

Planning — search over imagined rollouts before committing to an action.

Sample efficiency — learn from predicted experience, not just real or simulated steps.

Reasoning — chain predictions together for multi-step tasks.

How they fit together

These aren’t mutually exclusive. A real system often layers them:

In Cadenza, the inference stack is built to host exactly this: register your own VLA or world model, or compose them — sequential, chain-of-thought, or custom — behind one clean interface. The simulator provides the observations; your model provides the actions.

How policies learn: RL

Trial, error, and reward.

Specializing a model: fine-tuning

Adapt a base model to your robot with LoRA.

Overview

Concepts

What is a policy?

Vision-language-action (VLA) models

World models

How they fit together

How policies learn: RL

Specializing a model: fine-tuning

​What is a policy?

​Vision-language-action (VLA) models

​World models

​How they fit together

How policies learn: RL

Specializing a model: fine-tuning

What is a policy?

Vision-language-action (VLA) models

World models

How they fit together