OmniMoGen

Unifying Human Motion Generation via Learning from Interleaved Text–Motion Instructions

1Zhejiang University, 2HiThink Research
*Equal Contribution, Corresponding Author

Abstract

Large language models (LLMs) have unified diverse linguistic tasks within a single framework, yet such unification remains unexplored in human motion generation. Existing methods are confined to isolated tasks, limiting flexibility for free-form and omni-objective generation.

To address this, we propose OmniMoGen, a unified framework that enables versatile motion generation through interleaved text-motion instructions. Built upon a concise RVQ-VAE and transformer architecture, OmniMoGen supports end-to-end instruction-driven motion generation.

We construct X2Mo, a large-scale dataset of over 137K interleaved text-motion instructions, and introduce AnyContext, a benchmark for evaluating interleaved motion generation. Experiments show that OmniMoGen achieves state-of-the-art performance on text-to-motion, motion editing, and AnyContext, exhibiting emerging capabilities such as compositional editing, self-reflective generation, and knowledge-informed generation. These results mark a step toward the next intelligent motion generation.

Overview

OmniMoGen introduces a unified paradigm for human motion generation by modeling interleaved text–motion instructions within a single autoregressive framework. Instead of designing task-specific architectures, OmniMoGen treats motion as a discrete language and learns to follow diverse motion-related instructions in a unified manner.

Unified Architecture

OmniMoGen employs an RVQ-VAE to tokenize human motion into discrete units and concatenates motion tokens with text tokens in a single sequence. A unified autoregressive transformer is trained to model this interleaved sequence, enabling omni-objective motion generation by instruction.

X2Mo Dataset

We construct X2Mo, a large-scale dataset consisting of interleaved text–motion instructions spanning in-context generation, motion editing, multi-turn editing, and reflection. X2Mo provides structured supervision for learning unified motion generation.

AnyContext Benchmark

AnyContext evaluates motion generation under complex interleaved contexts, requiring models to selectively follow attributes from reference motions according to textual instructions.

Emerging Capabilities

Through unified training on interleaved instructions, OmniMoGen exhibits strong emerging capabilities beyond basic generation, including compositional editing, self-reflective refinement, and knowledge-informed motion synthesis.

Qualitative Results

Text-to-Motion

Motion Editing

BibTeX

@misc{bu2025omnimogenunifyinghumanmotion,
      title={OmniMoGen: Unifying Human Motion Generation via Learning from Interleaved Text-Motion Instructions}, 
      author={Wendong Bu and Kaihang Pan and Yuze Lin and Jiacheng Li and Kai Shen and Wenqiao Zhang and Juncheng Li and Jun Xiao and Siliang Tang},
      year={2025},
      eprint={2512.19159},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.19159}, 
}