Semantically Consistent Text-to-Motion with Unsupervised Styles

Linjun Wu1, Xiangjun Tang1, Jingyuan Cong2, He Wang3, Bo Hu4, Xu Gong4, Songnan Li4, Yuchen Liao4, Yiqian Wu1, Chen Liu1, Xiaogang Jin1,
1Zhejiang University, 2University of California San Diego, 3University College London, 4Tencent Technology Co., Ltd.
Siggraph 2025
Teaser Image

A showcase of generated motions driven by the unsupervised style of bird gliding. Our method synthesizes motions by combining textual descriptions of desired motion content with unsupervised style reference motions.

Abstract

Text-to-stylized human motion generation leverages text descriptions for motion generation with fine-grained style control with respect to a reference motion. However, existing approaches typically rely on supervised style learning with labeled datasets, constraining their adaptability and generalization for effective diverse style control. Additionally, they have not fully explored the temporal correlations between motion, textual descriptions, and style, making it challenging to generate semantically consistent motion with precise style alignment. To address these limitations, we introduce a novel method that integrates unsupervised style from arbitrary references into a text-driven diffusion model to generate semantically consistent stylized human motion. The core innovation lies in leveraging text as a mediator to capture the temporal correspondences between motion and style, enabling the seamless integration of temporally dynamic style into motion features. Specifically, we first train a diffusion model on a text-motion dataset to capture the correlation between motion and text semantics. A style adapter then extracts temporally dynamic style features from reference motions and integrates a novel Semantic-Aware Style Injection (SASI) module to infuse these features into the diffusion model. The SASI module computes the semantic correlation between motion and style features based on texts, selectively incorporating style features that align with motion content, ensuring semantic consistency and precise style alignment. Our style adapter does not require a labeled style dataset for training, enhancing adaptability and generalization of style control. Extensive evaluations show that our method outperforms previous approaches in terms of semantic consistency and style expressivity.

Video

Pipeline

Our method takes text descriptions of motion content and unlabeled style reference motions as input, generating stylized motions that preserve semantic consistency with the content texts while aligning with the reference style. To achieve this, we first train a text-conditioned diffusion model, which combines a text encoder and a denoising U-Net model to enable motion generation from text prompts. Next, we train a style adapter, which utilizes a CNN style encoder to extract temporally dynamic style features from reference motions and injects these features into the U-Net layers through the Semantics-Aware Style Inject (SASI) module. The SASI module leverages text as a mediator to capture the temporal correspondences between motion latent and style features, injecting style features into the layers of the denoising U-Net.
Teaser Image

Comparison on unsupervised styles

We compare our model to three baseline methods: StableMoFusion+MCM_LDM, StableMoFusion+DecouplingContact, and SMooDi. When the motion involves multiple actions, our method seamlessly integrates distinct styles into corresponding actions.

A man throws jabs and crouches to dodge, then stands up and steps back to escape.

Style Reference

StableMoFusion+MCM_LDM

StableMoFusion+DecouplingContact

SMooDi

Ours

A man walks to a chair and sits down, then he stands up and walks away.

Style Reference

StableMoFusion+MCM_LDM

StableMoFusion+DecouplingContact

SMooDi

Ours

Even if the content of style motions diverges from texts, our method still produces compelling results.

A person trips, rolls forward and stands up.

Style Reference

StableMoFusion+MCM_LDM

StableMoFusion+DecouplingContact

SMooDi

Ours

A person strides forward, then jumps high in place.

Style Reference

StableMoFusion+MCM_LDM

StableMoFusion+DecouplingContact

SMooDi

Ours

Here we demonstrate a combination of content texts and style references that are out of distribution.

A person performs breakdancing (a dynamic street dance).

Style Reference

StableMoFusion+MCM_LDM

StableMoFusion+DecouplingContact

SMooDi

Ours


Additional Application: Style Transfer

Our approach enables motion style transfer while ensuring superior preservation of content.

Content Reference

Style Reference

MCM_LDM

DecouplingContact

SMooDi

Ours

Additional Application: Stylized Motion In-Between

Our approach also allows for stylized motion in-between. Orange frames represent keyframes, and purple frames represent generated results.

An old man walks forward while raising both hands.

MDM

Ours (+ style from keyframes)

Without explicit style control, previous diffusion methods use the statistically probable motions to reach target keyframes, disrupting the “old man” style. Our method, on the other hand, can use the style from keyframe sequences to create motion that retains the “old man” style.

A man walks forward while raising both hands.

Style Reference

Ours (+ style from reference)

Given a reference motion with a relaxed style, our approach enables a style transition from an “old man” pace to a relaxed pace.

Limitations

If the style characteristics conflict with the content texts, our method prioritizes the content.

A person is skipping rope.

Content Reference

Style Reference