Feedback World Model Enables
Precise Guidance of Diffusion Policy

Tuo An*, Jindou Jia*, Gen Li, Jingliang Li, Chuhao Zhou,
Pengfei Liu, Bofan Lyu, Jiaqi Bai, Xinying Guo, Geng Li, Jianfei Yang
MARS Lab, Nanyang Technological University, Singapore
*Equal contribution.  Corresponding author.
Overview of the Feedback-Guided Policy

Overview of the Feedback-Guided Policy. During denoising, the feedback world model predicts the latent outcome of the current action trajectory, and an action-aware energy steers the policy toward expert-like, action-relevant states. After each executed action, the new observation updates the feedback state, which corrects subsequent predictions and forms an outer loop that suppresses prediction drift at inference time.

World models predict robot futures and steer policies—until the robot drifts out of distribution and the predictions go stale.

We turn each executed action into a free correction signal: the gap between the model's prediction and the robot's actual next observation is fed back online to keep the world model honest—with no extra training data, no parameter updates. Combined with action-aware guidance that focuses on the dimensions a robot can actually move, this closed-loop view forms our feedback world model.

↓ 76.4%
world-model prediction error
↑ 30%
OOD success rate

Real-World Demos

Three tasks on a physical arm, two policies side by side: Baseline diffusion policy (left) versus Ours with the feedback world model and action-aware guidance (right). The robot starts each rollout from an out-of-distribution pose; clips loop at 16× real time, shown from a top-down and a side camera.

Peach Pick-and-Place

Baseline
16×Top-down
16×Side
Ours
16×Top-down
16×Side

Drawer-Open

Baseline
16×Top-down
16×Side
Ours
16×Top-down
16×Side

Drawer-Open — Variation

Baseline
16×Top-down
16×Side
Ours
16×Top-down
16×Side

Closing the Loop at Inference Time

Most world-model guidance methods treat the model as a static open-loop predictor at deployment. New observations seed the next prediction but never correct the model's own state, so prediction error compounds — the longer the horizon and the further from the training distribution, the worse it gets.

Our feedback world model carries a small additional state that is refreshed after every environment step. The gap between the predicted and the actually observed transition becomes a corrective signal for subsequent predictions, with no extra training data and no parameter updates. From a control perspective the update reads as a latent-space observer; under a linear feedback formulation it admits convergence guarantees. Algorithm 1 below distils the full recipe: a one-time offline pass for action-aware weights, then a single online loop that wraps standard denoising with feedback correction.

Algorithm 1 Inference Pipeline
online
Require: Diffusion policy score $s_\phi$, encoder $\psi$, world model $f_\theta$, expert demos $\mathcal{D}_{\mathrm{expert}}$, feedback gain $L$, controllability strength $\beta$, guidance window $\tau_g \le T$.
Offline preprocessing
1 $\mathcal{Z}^{E} \leftarrow \{\,\psi(O)\mid O\in\mathcal{D}_{\mathrm{expert}}\,\}$;  compute action-aware weights $\{w_j^{(\beta)}\}_{j=1}^{D}$ via Eqs. (14)–(15).
Online deployment
2 Observe $o_0$, encode $z_0 = \psi(o_0)$, initialise feedback state $\bar z_0 \leftarrow z_0$.
3 for each timestep $t$ do
4 Sample $A_t^{(T)} \sim \mathcal{N}(0,\mathbf{I})$.
5 for denoising step $\tau = T,\dots,1$ do
6 if $\tau \le \tau_g$ then ▷ guidance only on the last $\tau_g$ low-noise steps
7 Predict next latent with feedback-corrected dynamics: $z_{t+1}^{\mathrm{fb}}(A_t^{(\tau)})$ via Eq. (10).
8 Retrieve nearest expert latent $z_{i^\star}^{E}$ (Eq. 5) and form $E_{\mathrm{ctrl}}(A_t^{(\tau)})$ (Eq. 16).
9 $\tilde s \leftarrow s_\phi(A_t^{(\tau)},O_t,\ell,\tau) - \lambda\,\nabla_{A_t^{(\tau)}} E_{\mathrm{ctrl}}(A_t^{(\tau)})$. ▷ Eq. (7)
10 else
11 $\tilde s \leftarrow s_\phi(A_t^{(\tau)},O_t,\ell,\tau)$.
12 end if
13 Denoise the action sequence from $A_t^{(\tau)}$ to $A_t^{(\tau-1)}$ using $\tilde s$.
14 end for
15 Execute the first $T_a$ actions of $A_t = A_t^{(0)}$ in $\mathcal{E}$.
16 Advance feedback state $\bar z_{t+1} \leftarrow \bar z_t + \delta t \cdot \hat v(z_t, A_t^{(0)})$ via Eq. (11).
17 Receive $o_{t+1}$, encode $z_{t+1} = \psi(o_{t+1})$. ▷ form $e_{t+1} = z_{t+1} - \bar z_{t+1}$ via Eq. (12) for the next step
18 end for

Insight. Deployment is already producing a free supervision signal: every executed action surfaces the gap between what was predicted and what actually happened. We close that loop online, turning a fixed open-loop predictor into an observer that self-corrects.

Action-Aware Guidance

Better predictions are only useful if guidance can act on them. Action-aware guidance weights latent dimensions by how strongly each one responds to the candidate action — end-effector motion, object pose, contact moments — and downweights what the robot cannot move: background, lighting, texture.

Action-aware controllability visualization

Per-dimension action controllability is estimated once, offline, from expert rollouts. At inference, this weighting concentrates the gradient on directions the policy can actually influence, suppressing noise from action-irrelevant variation.

Quantitative Results

We benchmark across four LIBERO-10 tasks from LIBERO-Plus, three representative Robomimic tasks, and two real-world manipulation tasks. Every task is perturbed in its initial robot pose to push the policy into out-of-distribution territory. Feedback correction cuts world-model prediction error by up to 76.4%, and the resulting policy lifts the average OOD success rate by 30%.

Latent prediction MSE on simulated and real-world OOD tasks

Latent prediction MSE under OOD perturbations. Base WM is the open-loop predictor; feedback correction lowers the error on every task.

Real-world OOD success-rate comparison

Real-world OOD success rate. Combining feedback correction with action-aware guidance yields the largest gains over the base diffusion policy.

Qualitative Results

Simulated tasks. Four LIBERO-Plus settings drawn from LIBERO-10, plus three Robomimic tasks — each evaluated with the robot's initial pose perturbed out of the training distribution.

Simulated task suite — LIBERO-Plus and Robomimic

Real-world deployment. Peach pick-and-place and drawer-open on a physical arm. The baseline drifts as soon as the initial pose moves; our closed-loop policy stays on task and the OOD success rate roughly doubles.

Real-world task rollouts
Real-world OOD success rate

Latent-space trajectories. Predicted and observed states are projected into the world-model latent space at every step. Without correction, the predicted trajectory peels away from the expert manifold; with feedback, it is pulled back in over time.

Latent-space rollout trajectories

Simulation OOD Scenes

To probe robustness under controlled distribution shift, we re-initialise each Robomimic task by sampling small joint-angle offsets on the robot arms; everything else — object placement, lighting, viewpoint — is held fixed. Each row shows the in-distribution start (left) next to three increasingly perturbed OOD starts, from two camera angles.

Square

In-Distribution
Square in-distribution top view Square in-distribution front view
OOD #1  |Δ|≤11.0°
Square OOD #1 top view Square OOD #1 front view
OOD #2  |Δ|≤17.4°
Square OOD #2 top view Square OOD #2 front view
OOD #3  |Δ|≤11.4°
Square OOD #3 top view Square OOD #3 front view

Top row: agentview camera  ·  Bottom row: front-view camera.

ToolHang

In-Distribution
ToolHang in-distribution side view ToolHang in-distribution front view
OOD #1  |Δ|≤11.0°
ToolHang OOD #1 side view ToolHang OOD #1 front view
OOD #2  |Δ|≤17.4°
ToolHang OOD #2 side view ToolHang OOD #2 front view
OOD #3  |Δ|≤11.4°
ToolHang OOD #3 side view ToolHang OOD #3 front view

Top row: side view  ·  Bottom row: front view.

Transport (dual-arm)

In-Distribution
Transport in-distribution shoulder camera Transport in-distribution wrist camera
OOD #1  |Δ|≤17.2°
Transport OOD #1 shoulder camera Transport OOD #1 wrist camera
OOD #2  |Δ|≤17.4°
Transport OOD #2 shoulder camera Transport OOD #2 wrist camera
OOD #3  |Δ|≤17.9°
Transport OOD #3 shoulder camera Transport OOD #3 wrist camera

Top row: shoulder camera 0  ·  Bottom row: robot-0 wrist camera. Arm-joint perturbations are sampled independently for the two arms.

BibTeX

@misc{an2026feedback,
  title         = {Feedback World Model Enables Precise Guidance of Diffusion Policy},
  author        = {An, Tuo and Jia, Jindou and Li, Gen and Li, Jingliang and
                   Zhou, Chuhao and Liu, Pengfei and Lyu, Bofan and Bai, Jiaqi and
                   Guo, Xinying and Li, Geng and Yang, Jianfei},
  year          = {2026},
  eprint        = {2605.15705},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO}
}