Each modality has its own expert that processes its inputs independently, preventing vision from dominating critical tactile information in contact-rich tasks
Train modality-specific policies independently and compose them without retraining the entire system
Maintains performance under sensor corruption, occlusions, and physical perturbations during execution
Add new sensors without retraining from scratch—save days of compute time
Significantly outperforms feature concatenation baselines on multimodal manipulation tasks
Continues working under sensor corruption and environmental perturbations
Prefer to listen? Hear a summary of our paper
Feature concatenation baseline vs. factorized MoE fusion vs. ours
Perturbation-based analysis reveals dynamic shifts between modalities across task stages
Our policy maintains performance under runtime perturbations, object repositioning, and sensor corruptions
Independently trained policies can be composed without retraining, enabling incremental integration
Occasional failures under extreme sensor corruptions
Access our code, dataset, and documentation.