A Unified Mixture-of-Experts Architecture for Scalable Multi-Control Video Generation World Modeling
Jianjie Fang, Yongyan Xu, Ziyou Wang, Yuchao Huang, Zhaolu Wang, Rongze Tang, Mingyuan Jia, Baining Zhao, Weichen Zhang, Xin Zhang, Haisheng Su, Yu Shang, Chen Gao, Wei Wu, Xinlei Chen, Yong Li
Paper, arXiv, and model weights will be released soon.
Worldscape-MoE is a unified world-model training framework for multi-control video generation. It introduces a Mixture-of-Experts design into Diffusion Transformers to learn from heterogeneous supervisory controls, including camera poses, robotic arms, and hand joints, within a single extensible world model.
By combining shared experts for cross-control world knowledge with modality-specific experts for control specialization, Worldscape-MoE aims to scale embodied and interactive world modeling beyond single-control supervision.
Watch the full demo on YouTube
If you find this project useful, please consider citing:
@misc{fang2026worldscapemoe,
title = {Worldscape-MoE: A Unified Mixture-of-Experts Architecture for Scalable Multi-Control Video Generation World Modeling},
author = {Fang, Jianjie and Xu, Yongyan and Wang, Ziyou and Huang, Yuchao and Wang, Zhaolu and Tang, Rongze and Jia, Mingyuan and Zhao, Baining and Zhang, Weichen and Zhang, Xin and Su, Haisheng and Shang, Yu and Gao, Chen and Wu, Wei and Chen, Xinlei and Li, Yong},
year = {2026},
note = {Project page: https://embodiedcity.github.io/Worldscape-MoE/}
}