Large language models (LLMs) have unified diverse linguistic tasks within a single framework, yet such unification remains unexplored in human motion generation. Existing methods are confined to isolated tasks, limiting flexibility for free-form and omni-objective generation.
To address this, we propose OmniMoGen, a unified framework that enables versatile motion generation through interleaved text-motion instructions. Built upon a concise RVQ-VAE and transformer architecture, OmniMoGen supports end-to-end instruction-driven motion generation.
We construct X2Mo, a large-scale dataset of over 137K interleaved text-motion instructions, and introduce AnyContext, a benchmark for evaluating interleaved motion generation. Experiments show that OmniMoGen achieves state-of-the-art performance on text-to-motion, motion editing, and AnyContext, exhibiting emerging capabilities such as compositional editing, self-reflective generation, and knowledge-informed generation. These results mark a step toward the next intelligent motion generation.
OmniMoGen introduces a unified paradigm for human motion generation by modeling interleaved text–motion instructions within a single autoregressive framework. Instead of designing task-specific architectures, OmniMoGen treats motion as a discrete language and learns to follow diverse motion-related instructions in a unified manner.
OmniMoGen employs an RVQ-VAE to tokenize human motion into discrete units and concatenates motion tokens with text tokens in a single sequence. A unified autoregressive transformer is trained to model this interleaved sequence, enabling omni-objective motion generation by instruction.
We construct X2Mo, a large-scale dataset consisting of interleaved text–motion instructions spanning in-context generation, motion editing, multi-turn editing, and reflection. X2Mo provides structured supervision for learning unified motion generation.
AnyContext evaluates motion generation under complex interleaved contexts, requiring models to selectively follow attributes from reference motions according to textual instructions.
Through unified training on interleaved instructions, OmniMoGen exhibits strong emerging capabilities beyond basic generation, including compositional editing, self-reflective refinement, and knowledge-informed motion synthesis.
@misc{bu2025omnimogenunifyinghumanmotion,
title={OmniMoGen: Unifying Human Motion Generation via Learning from Interleaved Text-Motion Instructions},
author={Wendong Bu and Kaihang Pan and Yuze Lin and Jiacheng Li and Kai Shen and Wenqiao Zhang and Juncheng Li and Jun Xiao and Siliang Tang},
year={2025},
eprint={2512.19159},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.19159},
}