CompassAD

Intent-Driven 3D Affordance Grounding in Functionally Competing Objects

Jingliang Li, Jindou Jia, Tuo An, Chuhao Zhou,
Xiangyu Chen, Shilin Shan, Boyu Ma, Bofan Lyu, Gen Li, Jianfei Yang

MARS Lab, Nanyang Technological University, Singapore  ·  Corresponding Authors

The proposed task: given a multi-object point cloud and a natural language query describing an intended action, predict a per-point affordance mask. The same composition can yield different targets depending on the query intent.
Given a multi-object point cloud and a natural-language query describing an intended action, the goal is to predict a per-point affordance mask on the correct object. The same composition yields different targets depending on the query intent — and abstains when no object can fulfill it.

"When told to cut the cake," a robot must choose the knife over nearby scissors — despite both objects affording the same cutting function.

№ 01 · Abstract

In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. Existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query.

We formalize Intent-Driven Confusable Affordance Grounding, a 3D setting that requires predicting a per-point affordance mask on the correct object within a multi-object point cloud, conditioned on implicit natural-language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusing multi-object compositions.

We further propose CompassNet, a framework with two dedicated modules. Instance-bounded Cross Injection (ICI) constrains language–geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-Level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective real-world transfer.

0
Confusing
pairs
0
Affordance
types
0
Multi-object
compositions
0
Intent-driven
queries
0
aIoU
gain vs SOTA
0
SIM
gain vs SOTA
№ 02 · The Task

Kitchens, workshops, and offices present multiple objects in close proximity. A knife and scissors both afford cutting. A skateboard and surfboard both afford riding. Shared affordance renders the target ambiguous from appearance alone — correct selection must be conditioned on task intent.

Given a multi-object point cloud and a natural-language query describing an intended action, our task is to predict a per-point affordance mask on the correct object, disambiguate confusable alternatives, and abstain when no object can fulfill the intent.

Formulation
ŷ = fθ(𝒫, 𝒬) ∈ [0,1]N
Point cloud 𝒫 + query 𝒬 → per-point affordance probability ŷ.
Two queries 𝒬a ≠ 𝒬b over the same 𝒫 yield different masks.
Task teaser: same composition, different intents activate different objects
For a scissors–knife pair on a table: "prepare vegetable slices" activates the knife blade; "cut paper" selects the scissors. For "play music," no activation occurs.
№ 03 · Method

CompassNet.

Instance-Bounded Cross Injection × Bi-Level Contrastive Refinement.

CompassNet architecture: instance-bounded grouping, region-language cross-attention with background token, gated propagation, and bi-level contrastive refinement
ICI confines region–language interaction within each instance via (i) instance-bounded grouping, (ii) region-language cross-attention with a learnable background token, and (iii) gated propagation back to points. BCR adds two training-only contrastive losses: TG-Softmax ranks the in-object region best matching the intent, and TP-HardNeg suppresses high-scoring negatives on confusable surfaces.
ICI Instance-Bounded Cross Injection

Cross-modal fusion is computed independently within each instance, so text cues cannot diffuse to competing surfaces.

  1. Step 1

    Instance-bounded grouping

    Radius-graph connected components assign each point an instance label; FPS selects region centers within each instance only.

  2. Step 2

    Region–language cross-attention

    Per-instance regions attend to text tokens augmented with a learnable background token — a sink for content no token can explain, enabling correct abstention.

  3. Step 3

    Gated region-to-point propagation

    Each point inverse-distance-pools from its kp=3 nearest enhanced regions; a multiplicative gate routes language back to points without injecting activations from scratch.

BCR Bi-Level Contrastive Refinement

Two complementary contrastive losses sharpen object selection and point-level discrimination. Both are training-only — zero inference overhead.

  1. Region · TG-Softmax

    Rank in-object regions by intent

    Soft-label softmax ranks regions by affinity to the query; rewards regions whose points are most densely affordance-positive.

  2. Point · TP-HardNeg

    Suppress confident false positives

    Smooth-max hard-negative mining enforces a margin between affordance points and the hardest negative on the wrong-object surface or same-object non-affordance region.

No-Cost at Inference

BCR backpropagates through ICI's representations during training but is dropped at inference. Zero added parameters or computation in the deployed model.

№ 04 · Dataset

A Benchmark of Controlled Confusion.

30 confusing pairs · 16 affordance types · 6,422 compositions · 87,964 intent-driven queries.

Overview of the CompassAD benchmark: affordance distribution, category distribution, hierarchy of confusing pairs, source breakdown, and affordance–object confusion matrix
Overview of the CompassAD benchmark. (a) Affordance concept distribution. (b) Object category distribution. (c) Hierarchy of confusing pairs grouped by target affordance type. (d) Source breakdown of the collected 3D object instances. (e) Confusion matrix between affordance and object categories.
01 · Curation

Pairs by design

30 confusing pairs across 16 affordance types, sourced from synthetic CAD models and real-world scans. Each pair shares at least one target affordance and is functionally interchangeable.

02 · Compositions

Cluttered by construction

Each composition pairs one confusing pair with up to two distractor objects (2–4 instances total). Randomized placement and permuted slots prevent positional shortcuts.

03 · Queries

Intent-only language

GPT-generated intent-driven queries with no object names. Includes positive, negative (for abstention), and unseen splits to evaluate language generalization.

№ 05 · Results

State-of-the-art on seen and unseen splits.

+4.02 aIoU on Test-Seen · +3.54 aIoU on Test-Unseen vs. prior SOTA.

Method Venue aIoU ↑ AUC ↑ SIM ↑ MAE ↓
Test-Seen
3D-SPSCVPR 20225.2364.70.1580.096
ReferTransNeurIPS 20215.8166.30.1710.093
ReLACVPR 20236.4768.90.1930.091
IAGNetICCV 20237.6472.40.2140.086
GREATCVPR 20259.2375.70.2370.082
PointReferCVPR 202410.5279.30.2600.079
GLANCEICCV 202514.1887.50.2960.077
CompassNetOurs18.2089.20.3680.061
Test-Unseen
3D-SPSCVPR 20224.1261.80.1360.099
ReferTransNeurIPS 20214.5863.50.1480.096
ReLACVPR 20235.0665.70.1670.094
IAGNetICCV 20236.0768.40.1860.089
GREATCVPR 20257.3172.00.2090.085
PointReferCVPR 20248.4775.80.2320.082
GLANCEICCV 202511.8285.20.2680.075
CompassNetOurs15.3687.40.3390.059

All methods trained and evaluated under identical settings. Red bold = best, underline = second-best. aIoU is the primary metric.

Qualitative Comparison

Same composition, different intents → different activated objects and regions.

Qualitative comparison: each triplet shows GT, CompassNet, and GLANCE on various confusing compositions
Each triplet shows ground truth (GT), CompassNet (Ours), and GLANCE (SOTA). Left: same composition queried with different intents activates different objects/regions (chair seat vs. bed surface). Right: diverse confusing pairs (knife/scissors, skateboard/surfboard, kettle/cup). Red denotes higher affordance probability.
Ablation Study +

Removing ICI drops aIoU by 2.96; removing BCR drops by 1.50. Every sub-component contributes.

Configuration aIoU ↑ AUC ↑ SIM ↑ MAE ↓
CompassNet (Full)18.2089.20.3680.061
Baseline (w/o ICI & BCR)13.8087.20.2900.079
Ablation on ICI
w/o ICI15.2487.80.3110.073
w/o Background token17.5888.90.3600.064
w/o Group relevance loss17.2688.60.3520.066
w/o Gated propagation16.9388.50.3430.068
Ablation on BCR
w/o BCR16.7088.40.3530.066
w/o TG-Softmax17.2888.70.3570.065
w/o TP-HardNeg17.6288.90.3620.063
№ 06 · Robotics

From Benchmark to Robot Arm.

Real-world robotic grasping in confusing multi-object compositions
Each row shows the captured scene, CompassNet's affordance prediction on the reconstructed point cloud (red = high probability), and the executed grasp. Top: a cutting query selects the knife over scissors. Bottom: a hammering query selects the hammer over distractors.
№ 07 · Citation

Cite this work.

If you find CompassAD useful in your research, please consider citing.

@article{Li2026CompassAD,
  title          = {CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects},
  author         = {Li, Jingliang and Jia, Jindou and An, Tuo and Zhou, Chuhao and
                    Chen, Xiangyu and Shan, Shilin and Ma, Boyu and Lyu, Bofan and
                    Li, Gen and Yang, Jianfei},
  year           = {2026},
  eprint         = {2604.02060},
  archivePrefix  = {arXiv},
  primaryClass   = {cs.CV}
}