Multi-Modal Manipulation via Multi-Modal Policy Consensus

1University of Illinois Urbana-Champaign, 2Columbia University, 3Massachusetts Institute of Technology, 4Harvard University
* Equal contribution. Equal advising.

Retains Sparse But Important Signals

Each modality has its own expert that processes its inputs independently, preventing vision from dominating critical tactile information in contact-rich tasks

Modular Design for Incremental Learning

Train modality-specific policies independently and compose them without retraining the entire system

Robust to Corruption & Perturbations

Maintains performance under sensor corruption, occlusions, and physical perturbations during execution

Why This Approach?

Feature Concatenation (Traditional)

  • Vision dominates sparse tactile signals
  • Monolithic training—must retrain everything when adding sensors
  • Single point of failure

Policy Consensus (Ours)

  • Each expert preserves its modality's information
  • Modular—compose independently trained policies
  • Graceful degradation under sensor failures

What You Gain

⏱️

Faster Iteration

Add new sensors without retraining from scratch—save days of compute time

🎯

Better Performance

Significantly outperforms feature concatenation baselines on multimodal manipulation tasks

🛡️

Real-World Robustness

Continues working under sensor corruption and environmental perturbations

Audio Summary

Prefer to listen? Hear a summary of our paper

Media Coverage & Demos

Teaser Video

Technical Video

Is Feature Concatenation the Policy Bottleneck?

Feature concatenation baseline vs. factorized MoE fusion vs. ours

Modality Importance Analysis

Perturbation-based analysis reveals dynamic shifts between modalities across task stages

Policy Adaptiveness Under Perturbations

Our policy maintains performance under runtime perturbations, object repositioning, and sensor corruptions

Runtime Perturbation
Object Repositioning
Sensor Corruption
Puzzle Perturbation
Repositioning + Sensor Corruption

Modular Policy Composition

Independently trained policies can be composed without retraining, enabling incremental integration

RGB Only → Task Fail
Compose RGB and Tactile → Task Success

Limitations and Failure Cases

Occasional failures under extreme sensor corruptions

Get stuck in the bag
Can't place spoon in target location

Explore Our Work

Access our code, dataset, and documentation.

BibTeX

@misc{chen2025multimodalmanipulationmultimodalpolicy, title={Multi-Modal Manipulation via Multi-Modal Policy Consensus}, author={Haonan Chen and Jiaming Xu and Hongyu Chen and Kaiwen Hong and Binghao Huang and Chaoqi Liu and Jiayuan Mao and Yunzhu Li and Yilun Du and Katherine Driggs-Campbell}, year={2025}, eprint={2509.23468}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2509.23468}, }