Recent advances in generative modeling and tok- enization have driven significant progress in text-to-motion generation, leading to enhanced quality and realism in generated motions. However, effectively leveraging textual information for conditional motion generation remains an open challenge. We observe that current approaches, primarily relying on fixed-length text embeddings (e.g., CLIP) for global semantic injection, struggle to capture the composite nature of human motion, resulting in suboptimal motion quality and controllability. To address this limitation, we propose the Composite Aware Semantic Injection Mechanism (CASIM), comprising a composite aware text encoder and a text-motion aligner that learns the dynamic correspondence between text and motion tokens. Notably, CASIM is model and representation-agnostic, readily integrating with both autoregressive and diffusion-based methods. Experiments on HumanML3D and KIT bench- marks demonstrate that CASIM consistently improves motion quality, text-motion alignment, and retrieval scores across state-of-the-art methods. Qualitative analyses further highlight the superiority of our composite aware approach over fixed-length semantic injection, enabling precise motion control from text prompts and stronger generalization to unseen text inputs.
CASIM consists of two major components: Composite Aware Text Encoder (Left) for extracting granular word-level embeddings and Text-Motion Aligner (Middle) for aligning motion embeddings with relevant textual embeddings inside a motion generator. The Text-Motion Aligner can be integrated with three genres of motion generation models (Right).
@article{chang2025casim,
title={CASIM: Composite Aware Semantic Injection for Text to Motion Generation},
author={Chang, Che-Jui and Liu, Qingze Tony and Zhou, Honglu and Pavlovic, Vladimir and Kapadia, Mubbasir},
journal={arXiv preprint arXiv:2502.02063},
year={2025}
}