Pick Cup
A coffee mug exposes two equally valid grasps — by the handle or by the rim. MARS samples each grasp in turn.
Multimodality Only When It Matters
*Equal contribution ‡Project lead †Corresponding author
§ 01 — Motivation
Flow-matching and diffusion policies model behaviour as a stochastic process. They capture multi-modal action distributions, but every inference call pays for stochastic noise initialisation and an iterative denoising loop.
Action-to-action regressors are an order of magnitude cheaper, yet they collapse to the mean whenever the demonstration data covers several equally valid behaviours, silently degrading rollout quality.
MARS: Inject a proper amount of noise only at the proper time.
vs. flow-matching at inference
average gain on real-world manipulation tasks
8 simulated · 4 real-world · Franka & Galaxea R1 Lite
§ 02 — Real-World
A coffee mug exposes two equally valid grasps — by the handle or by the rim. MARS samples each grasp in turn.
Carrot and daikon sit on opposite sides of a mango at every held-out evaluation; either pick is a success. MARS preserves the modal balance.
From a fixed start pose, the T-shaped block can be aligned by going around it from either side. MARS samples a different path on each rollout while still landing the T inside the target.
From an identical initial cube pose, two distinct goal regions are equally valid. The dataset is roughly balanced; only a policy that preserves modal balance can land in each.
§ 03 — Simulation Benchmark
We separate simulated tasks by whether the demonstrations are genuinely multimodal or essentially unimodal. MARS wins both — and on unimodal tasks it actually trains faster than the deterministic baseline by modelling residual action diversity that strict regression discards.
(a) Push Cube · bimodal directions. (b) Grasp Eyeglass · bimodal grasp poses. (c) Collision Avoidance · bimodal speeds. Each panel: success rate (top) and modal balance γ (bottom) vs training epochs.
§ 04 — Method
A small modal scheduling network reads the recent action history and predicts where multimodality is needed. Its output then governs both the source distribution of the flow and the number of denoising steps at inference.
A lightweight head predicts from the recent action context — one weight per action dimension. High where demonstrations branch; near zero where they don't.
Instead of pure Gaussian noise, the flow starts from a hybrid:
Training jointly minimises flow-matching, reconstruction and a diversity term matching source spread to target spread. At inference the ODE budget scales with :
§ 05 — Additional Analysis
§ 06 — Cite
@misc{jia2026marspolicymultimodalitymatters,
title = {MARS Policy: Multimodality Only When It Matters},
author = {Jindou Jia and Tuo An and Yuxuan Hu and Gen Li and
Jingliang Li and Bohan Hou and Xiangyu Chen and Jiaqi Bai and
Bofan Lyu and Jianfei Yang},
year = {2026},
eprint = {2605.29766},
archivePrefix = {arXiv},
primaryClass = {cs.RO},
url = {https://arxiv.org/abs/2605.29766}
}