MARS Policy

Multimodality Only When It Matters

Jindou Jia*‡, Tuo An*, Yuxuan Hu*, Gen Li, Jingliang Li,
Bohan Hou, Xiangyu Chen, Jiaqi Bai, Bofan Lyu, Jianfei Yang

NTU MARS Lab

*Equal contribution Project lead Corresponding author

§ 01 — Motivation

Generative policies are expressive — but slow. Deterministic ones are fast — but mode-averaging.

Generative

Flow-matching and diffusion policies model behaviour as a stochastic process. They capture multi-modal action distributions, but every inference call pays for stochastic noise initialisation and an iterative denoising loop.

Deterministic

Action-to-action regressors are an order of magnitude cheaper, yet they collapse to the mean whenever the demonstration data covers several equally valid behaviours, silently degrading rollout quality.

MARS: Inject a proper amount of noise only at the proper time.

Concept figure: expert trajectories, generative policy, deterministic policy, MARS policy on a 2D navigation task
FIG. 01 · 2D Navigation

An illustrative 2D navigation task. (a) Expert trajectories exhibit two valid paths around the obstacle. (b) A generative policy retains both modes but pays a heavy denoising cost everywhere. (c) A deterministic policy collapses the two modes into a single, often colliding path. (d) MARS keeps multimodality only where it matters — the colour map shows the per-step multimodality degree rising near the branch point and falling to zero elsewhere.

§ 02 — Real-World

Four manipulation tasks. Two robots.

The four real-world tasks, each with two valid modes: Pick Cup (handle or rim), Pick Vegetable (daikon or carrot), Block Push (orange or purple), Push-T (upper or lower).
FIG. 02 · The four tasks

Each real-world task admits two valid modes.

Pick Cup

A coffee mug exposes two equally valid grasps — by the handle or by the rim. MARS samples each grasp in turn.

Galaxea R1 Lite
MARSours
Click a mode — MARS commits to either grasp from the same start.
Flow Matching
Generative, but jitterier.
A2A Deterministic — commits to a single grasp every rollout.

Pick Vegetable

Carrot and daikon sit on opposite sides of a mango at every held-out evaluation; either pick is a success. MARS preserves the modal balance.

Galaxea R1 Lite
MARSours
Click a mode — MARS commits to either pick, run to run.
Flow Matching
Generative, but jitterier.
A2A Deterministic — always the same pick.

Push-T

From a fixed start pose, the T-shaped block can be aligned by going around it from either side. MARS samples a different path on each rollout while still landing the T inside the target.

Franka
MARS · Upper route Sweeps over the upper edge of the T before nudging it in.
MARS · Lower route Same policy, same start — approaches from below and rotates it in.

Block Push

From an identical initial cube pose, two distinct goal regions are equally valid. The dataset is roughly balanced; only a policy that preserves modal balance can land in each.

Franka
MARS · Upper goal Pushes the cube toward the upper target region.
MARS · Lower goal Same start state — picks the lower target instead. A2A would average both and miss.
Real-world success rates, modal balance γ, and inference latency analysis.
FIG. 03 · Real-world summary

Success, modal balance, and latency across all four tasks.

(a) Success rate — higher is better. (b) Modal balance γ — closer to 1 means both modes are used evenly. (c) Per-step inference latency in Push-T task; MARS cuts latency ~83% versus FM-DiT.

§ 03 — Simulation Benchmark

Eight tasks, Two environments.

We separate simulated tasks by whether the demonstrations are genuinely multimodal or essentially unimodal. MARS wins both — and on unimodal tasks it actually trains faster than the deterministic baseline by modelling residual action diversity that strict regression discards.

Push Cube benchmark — bimodal left/right push directions; MARS keeps both modes while the deterministic baseline collapses to one.
Grasp Eyeglass benchmark — bimodal grasp poses; flow-matching drifts toward a single pose while MARS stays expressive.
Collision Avoidance benchmark — bimodal slow/fast speeds; MARS sustains both strategies where baselines oscillate or collapse.
FIG. 04 · Multimodal benchmarks

Strategically multimodal tasks.

(a) Push Cube · bimodal directions. (b) Grasp Eyeglass · bimodal grasp poses. (c) Collision Avoidance · bimodal speeds. Each panel: success rate (top) and modal balance γ (bottom) vs training epochs.

Legend: Generative, Deterministic, MARS.
Learning curves on strategically unimodal simulated tasks.
FIG. 05 · Unimodal benchmarks

Strategically unimodal tasks.

(a) Close Box · RLBench. (b) Stack Cube · ManiSkill. (c) Pick Cube · ManiSkill. (d) Close Drawer · LIBERO. Even in tasks that appear strategically unimodal yet exhibit nuanced trajectory variations, MARS policy exhibits superior training efficiency over the deterministic one.

§ 04 — Method

How MARS works.

A small modal scheduling network reads the recent action history and predicts where multimodality is needed. Its output then governs both the source distribution of the flow and the number of denoising steps at inference.

MARS policy architecture diagram
FIG. 06 · Architecture

Observation encoder feeds a modal scheduling head that emits per-dimension weights w. Those weights interpolate the flow source between the action prior and Gaussian noise, then drive an adaptive ODE-step budget at inference.

  1. Modal scheduling

    A lightweight head predicts from the recent action context — one weight per action dimension. High where demonstrations branch; near zero where they don't.

  2. Adaptive source flow

    Instead of pure Gaussian noise, the flow starts from a hybrid:

  3. Diversity-aware loss & adaptive steps

    Training jointly minimises flow-matching, reconstruction and a diversity term matching source spread to target spread. At inference the ODE budget scales with :

§ 05 — Additional Analysis

Qualitative trajectory comparison across policy architectures.
FIG. 07 · Qualitative landscape

Eight policies, one navigation task.

Trajectories from Expert, FM, IBC, BET, A2A, Noised-A2A, VITA, ACT, and MARS on the 2D navigation benchmark.

A2A policy failure modes on 2D navigation.
FIG. 08 · Architectural failure

Deterministic regression averages into walls.

A2A policy on 2D Navigation. A2A occasionally reaches the target by collapsing to a single mode (a), but frequently gets stuck near an obstacle (b). Successful runs are largely attributable to training-time stochasticity.

Reconstruction loss compromises multimodal expressivity.
FIG. 09 · Training-time failure

Adding reconstruction loss kills expressivity.

Reconstruction loss compromises multimodal expressivity. We visualize 100 rollouts of the flow matching policy on 2D Navigation. (a) Standard flow matching distributes trajectories across all four valid passages (counts: 20/24/29/24). (b) Adding a reconstruction loss with weight biases the policy toward the two central passages (10/36/40/4), with the outer passages substantially under-explored. This motivates our per-dimension gating strategy that adaptively suppresses the reconstruction penalty based on recognized multimodal degree.

Effect of initial flow variance on optimisation and evaluation.
FIG. 10 · Source-variance sweep

Why pure Gaussian noise is the wrong default.

Effect of initial flow variance on optimization and evaluation. Settings: 30 training epochs, 100 demonstrations, and 50 evaluation rollouts. Left: Training loss curves under different initial variances (0–10) of the flow source distribution. Larger variance leads to slower and noisier convergence. Right: Evaluation success rate under different initial variances, showing that excessively large variance degrades success rate.

Robustness analysis of modal balance under data scarcity and over-training.
FIG. 11 · Robustness on Pick Cup

Modal balance under stress.

Robustness analysis of modal balance under real-world data scarcity and overtraining (Pick Cup). The default configuration (“Adopted”) consists of 200 demonstrations and 400 training epochs. MARS maintains a resilient and stable modal balance (γ = 0.8) across both data-restricted (50 demonstrations) and overtrained regimes (800 epochs). In contrast, the generative baseline exhibits severe mode degradation.

§ 06 — Cite

BibTeX

@misc{jia2026marspolicymultimodalitymatters,
  title         = {MARS Policy: Multimodality Only When It Matters},
  author        = {Jindou Jia and Tuo An and Yuxuan Hu and Gen Li and
                   Jingliang Li and Bohan Hou and Xiangyu Chen and Jiaqi Bai and
                   Bofan Lyu and Jianfei Yang},
  year          = {2026},
  eprint        = {2605.29766},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2605.29766}
}