Effectively integrating diverse sensory representations is crucial for robust robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental integration of new representations. We evaluate our approach on real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, as well as simulated manipulation tasks in RLBench, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities, for example transitioning from vision to multi-model when entering occluded spaces.
Feature concatenation baseline vs. factorized MoE fusion vs. ours
Perturbation-based analysis reveals dynamic shifts between modalities across task stages
Our policy maintains performance under runtime perturbations, object repositioning, and sensor corruptions
Independently trained policies can be composed without retraining, enabling incremental integration
Occasional failures under extreme sensor corruptions