Multi-Modal Manipulation via Multi-Modal Policy Consensus

1University of Illinois Urbana-Champaign, 2Columbia University, 3Massachusetts Institute of Technology, 4Harvard University
* Equal contribution. Equal advising.

Teaser Video

Technical Video

Abstract

Effectively integrating diverse sensory representations is crucial for robust robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental integration of new representations. We evaluate our approach on real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, as well as simulated manipulation tasks in RLBench, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities, for example transitioning from vision to multi-model when entering occluded spaces.

Is Feature Concatenation the Policy Bottleneck?

Feature concatenation baseline vs. factorized MoE fusion vs. ours

Modality Importance Analysis

Perturbation-based analysis reveals dynamic shifts between modalities across task stages

Policy Adaptiveness Under Perturbations

Our policy maintains performance under runtime perturbations, object repositioning, and sensor corruptions

Runtime Perturbation
Object Repositioning
Sensor Corruption
Puzzle Perturbation
Repositioning + Sensor Corruption

Modular Policy Composition

Independently trained policies can be composed without retraining, enabling incremental integration

RGB Only → Task Fail
Compose RGB and Tactile → Task Success

Limitations and Failure Cases

Occasional failures under extreme sensor corruptions

Get stuck in the bag
Can't place spoon in target location

BibTeX

@misc{chen2025multimodalmanipulationmultimodalpolicy, title={Multi-Modal Manipulation via Multi-Modal Policy Consensus}, author={Haonan Chen and Jiaming Xu and Hongyu Chen and Kaiwen Hong and Binghao Huang and Chaoqi Liu and Jiayuan Mao and Yunzhu Li and Yilun Du and Katherine Driggs-Campbell}, year={2025}, eprint={2509.23468}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2509.23468}, }