Disentangled Clothed Avatar Generation with Layered Representation

Weitian Zhang1
Sijing Wu1
Manwen Liao2
1Shanghai Jiao Tong University 2The University of Hong Kong
Arxiv
Code

We propose LayerAvatar to efficiently generate diverse clothed avatars with components fully disentangled. The generated avatars can be animated and synthesized in novel views. They can also be decomposed into body, hair, and clothes for component transfer.

Abstract


Clothed avatar generation has wide applications in virtual and augmented reality, filmmaking, and more. Previous methods have achieved success in generating diverse digital avatars, however, generating avatars with disentangled components (\eg, body, hair, and clothes) has long been a challenge. In this paper, we propose LayerAvatar, the first feed-forward diffusion-based method for generating component-disentangled clothed avatars. To achieve this, we first propose a layered UV feature plane representation, where components are distributed in different layers of the Gaussian-based UV feature plane with corresponding semantic labels. This representation supports high-resolution and real-time rendering, as well as expressive animation including controllable gestures and facial expressions. Moreover, we propose a semantic-aware compositional rendering strategy to facilitate the full disentanglement of each component. Based on the well-designed representation, we train a single-stage diffusion model and introduce constrain terms to address the severe occlusion problem of the innermost human body layer. Extensive experiments demonstrate the impressive performances of our method in generating disentangled clothed avatars, and we further explore its applications in component transfer.

Full Video



Method Overview


LayerAvatar learns a feed-forward diffusion model to generate clothed avatars with each component disentangled. The clothed avatars are represented as layered UV feature plane where components are represented separately. After decoding the feature plane into attribute maps, we can extract 3D Gaussians from them through SMPL-X-based templates. Generated clothed avatars are then transformed into targeted pose space for further supervision. Reconstruction loss and constrain loss are both utilized to facilitate the disentanglement and handle the severe occlusion of human body layer.

Random Generation

Our method can generate fully disentangled avatars wearing diverse clothes. The generated digital avatars exhibit details such as distinct fingers and cloth wrinkles.

Novel Pose Animation

We demonstrate the novel pose animation ability using pose sequences in AMASS and X-Avatar. Our method can also handel vivid gesture and facial expression control.

Component Transfer

We exhibit component transfer application of our method. With disentangled components, we can directly transfer hairstyles, clothes, and shoes to enable customization of digital avatars.

Citation



@article{zhang2025layeravatar,
    title={Disentangled Clothed Avatar Generation via Layered Representation}, 
    author={Weitian Zhang and Sijing Wu and Manwen Liao and Yichao Yan},
    year={2025},
    journal={arXiv preprint arXiv:2501.04631},
}