Action-to-Action Flow Matching

Nanyang Technological University *Equal contribution Corresponding Author

Abstract

Diffusion-based policies have recently achieved remarkable success in robotics by formulating action prediction as a conditional denoising process. However, the standard practice of sampling from random Gaussian noise often requires multiple iterative steps to produce clean actions, leading to high inference latency that incurs a major bottleneck for real-time control. In this paper, we challenge the necessity of uninformed noise sampling and propose Action-to-Action flow matching (A2A), a novel policy paradigm that shifts from random sampling to initialization informed by the previous action. Unlike existing methods that treat proprioceptive action feedback as static conditions, A2A leverages historical proprioceptive sequences, embedding them into a high-dimensional latent space as the starting point for action generation. This design bypasses costly iterative denoising while effectively capturing the robot's physical dynamics and temporal continuity. Extensive experiments demonstrate that A2A exhibits high training efficiency, fast inference speed, and improved generalization. Notably, A2A enables high-quality action generation in as few as a single inference step (0.56 ms latency), and exhibits superior robustness to visual perturbations and enhanced generalization to unseen configurations. Lastly, we also extend A2A to video generation, demonstrating its broader versatility in temporal modeling.

Key Insight: From Noise-to-Action to Action-to-Action

Traditional diffusion models were originally developed for high-fidelity image synthesis and video generation, where generation typically begins from uninformed noise due to the absence of meaningful priors. We observe that robots operate under a fundamentally different regime, that enable action-to-action feasible:

  1. Modern robotic systems have continuous proprioceptive feedback
  2. Adjacent action chunks exhibit inherent similarity due to physical consistency
  3. Historical actions can serve as a strong initialization signal for action generation
A2A Main Framework

Ingredient 1: High Training Efficiency

A2A converges to high success quickly and keeps stable performance even with limited data.

  • Fast convergence: Reaches a stable 100% success rate within 40 training epochs.
  • High sample efficiency: Reaches and maintains a high performance ceiling as the number of demonstrations increases.
Training Efficiency Details

Quantitative Comparison

Success rates across 5 simulation tasks (100 demonstrations, 30 epochs)

Methods Steps Close Box Pick Cube Stack Cube Open Drawer Pick-Place Bowl
A2A (Ours) 6 92% 92% 86% 92% 90%
VITA 6 88% 88% 80% 90% 92%
FM-UNet 10 82% 70% 28% 34% 68%
FM-DiT 10 58% 88% 26% 28% 84%
DDPM-UNet 100 72% 60% 36% 64% 66%
DDPM-DiT 100 58% 58% 16% 14% 68%
DDIM-UNet 40 70% 56% 36% 64% 82%
Score-UNet 100 36% 36% 12% 0% 4%
ACT 1 82% 86% 32% 80% 60%

Real-world test

Real-world test

Pick the cube from different locations with a limited 10 training demonstrations. A2A demonstrates rapid adaptation and high success rates, showcasing its superior data efficiency in novel scenarios.

Ingredient 2: Fast Inference Speed

A2A can run in single-step with sub-millisecond latency while keeping high performance.

  • Rapid performance peak: Success rates saturate at just 4 inference steps.
  • Single-step can still work well: When limited to one-step inference, success rises above 90% after 32 epochs.
  • Latency: Mean inference latency ~1 ms, and 0.56 ms in the single-step regime (RTX 5090 benchmark).
Inference Speed Analysis

Real-World Inference Comparison

Side-by-side comparison of different methods performing the same task in real-world settings. With 30 demonstrations, DDPM-UNet and FM-UNet achieves lower success rate and exhibit significant hesitation during the operation.

A2A (Ours)

DDPM-UNet

FM-UNet

Ingredient 3: Improved Generalization

A2A generalizes better than diffusion/flow baselines under heavy visual randomization and unseen Configuration.

Visual Generalization

We evaluate A2A under progressively challenging visual perturbations. Level 0 is the original training environment, while Levels 1-3 incrementally add new backgrounds, lighting conditions, and viewpoints.

Level 0 Original
Level 1 + New Backgrounds
Level 2 + Lighting
Level 3 + Viewpoints

Visual Generalization Results

Success rates under different levels of visual perturbation

Methods Steps Level 0 Level 1 Level 2 Level 3
A2A (Ours) 6 92% 42% 40% 38%
A2A (Ours) 1 90% 36% 34% 32%
FM-UNet 10 70% 24% 20% 16%
DDPM-UNet 100 60% 18% 14% 10%

Real-world test

Real-World Generalization

(a) In-distribution evaluation: A2A achieves 100% success rate on the standard Pick Cube task with only 30 training trajectories. (b) Out-of-distribution stress test: When the target cube is replaced with an unseen glowing variant, baseline methods (FM-UNet, DDPM-UNet) completely fail (0% success), while A2A maintains robust performance with 80% success rate.

Pick Cube (In-Distribution)

Pick Cube (New Target)

Initial State Generalization

By injecting a small amount of Gaussian noise into the historical actions, A2A demonstrates strong generalization to diverse initial configurations. Even when objects are placed in unseen positions or orientations at the start of an episode, A2A successfully adapts and completes the task.

Configuration 1

Configuration 2

Configuration 3

BibTeX

@article{jia2026a2a,
      title={Action-to-Action Flow Matching},
      author={Jindou Jia and Gen Li and Xiangyu Chen and Tuo An and Yuxuan Hu and Jingliang Li and Xinying Guo and Jianfei Yang},
      journal={arXiv preprint arXiv:2602.07322},
      year={2026}
}