Semantically Consistent Text-to-Motion with Unsupervised Styles

Text-to-stylized human motion generation leverages text descriptions for motion generation with fine-grained style control with respect to a reference motion. However, existing approaches typically rely on supervised style learning with labeled datasets, constraining their adaptability and generalization for effective diverse style control. Additionally, they have not fully explored the temporal correlations between motion, textual descriptions, and style, making it challenging to generate semantically consistent motion with precise style alignment. To address these limitations, we introduce a novel method that integrates unsupervised style from arbitrary references into a text-driven diffusion model to generate semantically consistent stylized human motion. The core innovation lies in leveraging text as a mediator to capture the temporal correspondences between motion and style, enabling the seamless integration of temporally dynamic style into motion features. Specifically, we first train a diffusion model on a text-motion dataset to capture the correlation between motion and text semantics. A style adapter then extracts temporally dynamic style features from reference motions and integrates a novel Semantic-Aware Style Injection (SASI) module to infuse these features into the diffusion model. The SASI module computes the semantic correlation between motion and style features based on texts, selectively incorporating style features that align with motion content, ensuring semantic consistency and precise style alignment. Our style adapter does not require a labeled style dataset for training, enhancing adaptability and generalization of style control. Extensive evaluations show that our method outperforms previous approaches in terms of semantic consistency and style expressivity.

Our method takes text descriptions of motion content and unlabeled style reference motions as input, generating stylized motions that preserve semantic consistency with the content texts while aligning with the reference style. To achieve this, we first train a text-conditioned diffusion model, which combines a text encoder and a denoising U-Net model to enable motion generation from text prompts. Next, we train a style adapter, which utilizes a CNN style encoder to extract temporally dynamic style features from reference motions and injects these features into the U-Net layers through the Semantics-Aware Style Inject (SASI) module. The SASI module leverages text as a mediator to capture the temporal correspondences between motion latent and style features, injecting style features into the layers of the denoising U-Net.

We compare our model to three baseline methods: StableMoFusion+MCM_LDM, StableMoFusion+DecouplingContact, and SMooDi. When the motion involves multiple actions, our method seamlessly integrates distinct styles into corresponding actions.

Even if the content of style motions diverges from texts, our method still produces compelling results.

Here we demonstrate a combination of content texts and style references that are out of distribution.

Our approach enables motion style transfer while ensuring superior preservation of content.

Our approach also allows for stylized motion in-between. Orange frames represent keyframes, and purple frames represent generated results.

Without explicit style control, previous diffusion methods use the statistically probable motions to reach target keyframes, disrupting the “old man” style. Our method, on the other hand, can use the style from keyframe sequences to create motion that retains the “old man” style.

Given a reference motion with a relaxed style, our approach enables a style transition from an “old man” pace to a relaxed pace.

If the style characteristics conflict with the content texts, our method prioritizes the content.

Semantically Consistent Text-to-Motion with Unsupervised Styles

Abstract

Video

Pipeline

Comparison on unsupervised styles

A man throws jabs and crouches to dodge, then stands up and steps back to escape.

A man walks to a chair and sits down, then he stands up and walks away.

A person trips, rolls forward and stands up.

A person strides forward, then jumps high in place.

A person performs breakdancing (a dynamic street dance).

Additional Application: Style Transfer

Additional Application: Stylized Motion In-Between

An old man walks forward while raising both hands.

A man walks forward while raising both hands.

Limitations

A person is skipping rope.