Cortex 2.0

Cortex 2.0 at a glance

•Foresight. Cortex 2.0 adds planning to manipulation by predicting future outcomes before committing best one to motion.
•Fewer costly failures. Those futures are scored for progress, stability, and efficiency, so the robot avoids bad branches before they turn into retries or recoveries.
•Video-first world modeling. The world model learns predictive dynamics in visual latent space, while action generation remains in the execution stack.

Introduction

Cortex 2.0 extends our original Cortex architecture by introducing a world model into the learning loop. Rather than replacing the existing VLA, the world model complements it, enabling reasoning and evaluation over future outcomes.

Robotic manipulation in real-world industrial environments presents a distinct set of challenges: actions are irreversible, failures are costly, and the consequences of decisions often unfold over long horizons. While recent Vision–Language–Action (VLA) models have demonstrated impressive generalization and flexibility, they remain fundamentally reactive systems, optimized to select the next action given the current observation.

In complex tasks like returns handling, the robot often faces cluttered bins, unknown object poses, and long-horizon failure modes like gradual slip, jams, or collisions that only emerge after several steps. Cortex 2.0 addresses this by shifting from try-and-see control to plan-and-try. From the current state, it generates a variable set of $k$ candidate future trajectories in visual latent space, then scores each candidate according to expected success and efficiency. The score is fed forward as advantage signal guiding action generation, biasing the policy toward motions associated with higher-quality predicted outcomes.

We train our world model in visual space because the data is cheap and far more abundant: unlimited video on the internet, and in deployment cameras get a free multiplier: one robot produces one action stream, but 10 cameras can yield ~10× more observational training signal for world models.

This helps the world model learn the underlying semantics of the physical world: joint commands are numbers with weak semantic structure and strong embodiment dependence, while pixels encode rich, transferable regularities about objects, contact, and motion.

Plan → Score → Execute

From Reactive Policies to Physical Reasoning

Cortex 1.0 is centered around a Mixture-of-Experts (MoE) VLA stack that turns multi-modal perception, RGB, depth/3D geometry, proprioception, and task context, into a control signal. A VLM/MoE layer first produces a high-level, task-conditioned decision state (e.g., subgoal structure and grounded constraints), and the VLA then combines this context with the current observation to emit action chunks that route into embodiment-specific low-level controllers. This design generalizes across tasks and robots, but it remains fundamentally reactive: decisions are optimized for the next action given the current state, without explicitly evaluating potential futures.

Cortex 2.0 introduces a separation between planning and execution. The world model operates in latent space to plan future observations: given the current robot and environment state, it rolls out $k$ candidate future states of the environment. These candidates are evaluated by our PRO module, that assigns a score for each candidate. We use this score as an advantage indicator for downstream control. The VLA is conditioned on the top-scored rollout and its associated advantage score. The world model provides visual foresight, PRO supplies an advantage signal over candidate futures, and the VLA translates this foresight into robust low-level actions.

Architecture Overview

Cortex 2.0 extends the original execution architecture of our previous model with a complementary planning module. At a high level, the current observation is encoded into a latent state $z_t$ . A MoE reasoning module produces structured task context $s_t$ , our world model then predicts future observations, planned rollouts over a predefined horizon. The outcome heads (PRO module) score these planned futures and select the most promising candidate, which is then realized by the VLA and executed on the robot.

How Cortex 2.0 Works

Understand. We combine what the robot sees (vision embeddings), what it feels about itself (robot state and contact force) and the task instruction (goal condition) to build a compact representation of the current scene and to provide information about the task that is to be executed.

Plan. Cortex 2.0 uses a world model to predict a set of future scene rollouts in video latent space. These rollouts are futures of the world—what the scene would likely look like over the next timesteps under different plausible choices—before committing to any real action.

Score. Each predicted rollout is scored for what matters in real operations using our PRO heads:

progress head (are we moving toward task completion?)
risk (slips, collisions, jams, unstable placements)
efficiency (fewer retries, smoother execution)

Cortex 2.0 then selects the best-scoring plan.

Execute. The execution policy turns the chosen plan into real robot motion—biasing behavior toward actions that are more stable and less likely to trigger costly recoveries.

Across single-arm pick-and-place, dual-arm pick-and-place, towel folding, and returns handling, Cortex 2.0 runs the same loop: generate k visual-latent rollouts, score them for stability/risk/efficiency, and commit only to the best-scored trajectory. This breaks the common reactive pattern where a VLA repeats the same move after a miss—planning filters out bad futures first, then executes the most promising branch.

Because Cortex 2.0 evaluates plans in visual space, its planning generalizes across tasks and robot embodiments. Learning and transferring plans directly in action space is much harder, since action commands depend on a robot's specific kinematics and controllers; the same “good” behavior can require very different joint motions on different robots.

Early Results

We evaluate Cortex against state-of-the-art open-source visuomotor policies on a bimanual manipulation platform with dual UR5e arms. Three tasks of increasing complexity test performance across diverse manipulation primitives.

Shoebox

Multi-step sequential manipulation: open box lid, remove packing paper, and extract shoes into bins.

Sorting Screws

Fine-grained grasping of small, reflective metal screws and placement into correct toolbox compartments.

Sorting Items & Trash

Category-based sorting from cluttered bins, distinguishing everyday items from trash under varied conditions.

Planning budget

A key advantage of Cortex 2.0 is that we can dial the amount of planning per decision by changing K, the number of imagined future rollouts sampled and scored before committing to an action.

Success rate (left axis) rises as K increases, while time per step (right axis) also increases—capturing the central design trade-off in Cortex 2.0: more foresight yields better decisions, at higher compute/latency cost. For the task evaluations below, we fix a low-latency setting of K=2. Because planning runs in visual latent space, we can adjust planning compute through (i) K and (ii) rollout quality (e.g., denoising steps), choosing higher budgets for costly failure modes (e.g., packing) and lower budgets when recovery is cheap (e.g., regrasping).

Shoebox (long-horizon sequential task)

Cortex 2.0 achieves highest success rate while also being significantly faster than baseline policies—and it completes rollouts without human interventions. Baselines either slow down due to repeated retries and late-stage deadlocks, or fail to complete the full sequence reliably.

Sorting Screws (fine-grained precision)

Cortex 2.0 reaches near-perfect per-operation success on small, reflective screws, with the shortest average completion time and without requiring human intervention. Baselines struggle with accurate grasps and placement, leading to drops, misplacements, and periodic deadlocks that require intervention.

Sorting Items & Trash (cluttered repeated pick-and-place)

Cortex 2.0 achieves the highest per-object success while maintaining the best throughput, completing rollouts end-to-end without human intervention. Baselines require interventions to finish and often slow down from repeated local replanning around failed grasps, frequently reaching the runtime limit without fully completing the task.

View full benchmark results

Where We Are Today

We are currently training Cortex 2.0 on an increasing volume of our real-world data. The initial focus is to validate the planning loop on a narrow set of high-impact workflows, beginning with tasks inside returns handling, where long-horizon failures are common and where better foresight directly reduces operational cost. While continuing to expand use cases and tasks, we can already observe great impact of incorporating potential future outcomes into VLA policies.

Towards In-Context Learning

Making the policy video aware and training a world model in conjunction with policy training is a step towards in-context learning for robotics: given a sequence of demonstrations in form of a video, the robot can execute these exact steps without re-training. Today's LLMs exhibit in-context learning capabilities in many applications: agents can be conditioned to execute certain tasks just by language. We are working towards similar capabilities for robots to unlock generalization for physical AI.

Cortex 2.0 in Action

Across these four tasks, the baseline policy often gets trapped in a loop: it misses once, retries the same motion, and compounds the failure. With Cortex 2.0, the robot evaluates a few futures outcomes first and commits to the option that stays stable—so the same setups complete smoothly instead of spiraling into recovery.

Return Handling at Active Ants

Parcel Handling

Item Handling & Dispatch

Our world model helps by planning safe box opening and sensible sequencing to efficiently pick the item and avoid snags or occlusions. It also predicts which bin placement trajectory will lead to the most stable final state, reducing misplacements and the need for downstream recovery.

Kitting at Deltilog

Item Picking

Box Filling (Dual-Arm)

In picking, a miss is usually just a regrasp—recovery is cheap. In box filling, small errors compound: a slightly off approach can trigger collisions with the box walls or other items, cause snags, or create damage that's costly to undo. That's why we spend more planning budget (higher k) during packing: we're not only looking for a successful placement, but the safest one. PRO-style outcome heads score risk continuously—penalizing futures with high-speed contact, compression, edge impacts, or surface scraping—even if the item still ends up in the box.

Parcel Closing at Arvato

Manual Process

Automated (Dual-Arm)

Our world model helps by predicting how the carton sheet and the items will move and settle during placement, so the robot can choose an order and alignment that keeps everything flat and ready to close the parcel.

Returns Putaway at Radial

Manual Process

Automated (Dual-Arm)

This workflow benefits from lookahead because putaway is a tightly timed handoff: planning helps keep the scan readable while avoiding arm–arm and bin-edge interference. It also favors placements that settle stably (no snagging, rebound, or re-occlusion), reducing downstream retries in high-throughput sorting.

We are also hiring! If you'd be interested in joining us please get in touch.

For researchers interested in our work, collaborations, or other queries, please write to research@sereact.ai.