We present Vid2Avatar-Pro, a method to create photorealistic and animatable 3D human avatars from monocular in-the-wild videos. Building a high-quality avatar that supports animation with diverse poses from a monocular video is challenging because the observation of pose diversity and view points is inherently limited. The lack of pose variations typically leads to poor generalization to novel poses, and avatars can easily overfit to limited input view points, producing artifacts and distortions from other views. In this work, we address these limitations by leveraging a universal prior model (UPM) learned from a large corpus of multi-view clothed human performance capture data. We build our representation on top of expressive 3D Gaussians with canonical front and back maps shared across identities. Once the UPM is learned to accurately reproduce the large-scale multi-view human images, we fine-tune the model with an in-the-wild video via inverse rendering to obtain a personalized photorealistic human avatar that can be faithfully animated to novel human motions and rendered from novel views. The experiments show that our approach based on the learned universal prior sets a new state-of-the-art in monocular avatar reconstruction by substantially outperforming existing approaches relying only on heuristic regularization or a shape prior of minimally clothed bodies (e.g., SMPL) on publicly available datasets.
a) We employ a large corpus of multi-view dynamic clothed human performances to train a cross-identity universal prior model (UPM). During training, UPM is conditioned on the normalized identity-specific texture map and takes the posed position map as input to predict Gaussian attributes. We extract the canonical 3D Gaussians and synthesize human rendering for training pose/shape parameters by applying forward LBS and rasterization. We minimize the loss over the entire universal human corpus.
b) Given a monocular in-the-wild video of an unseen identity, we track the human pose/shape parameters and reconstruct the canonical textured template. We further deploy a diffusion-based model tailored for canonical texture inpainting to complete the canonical texture map. We then fine-tune our pre-trained UPM on the monocular observations via inverse rendering to recover person-specific details.
Our approach can create photorealistic avatars from monocular in-the-wild videos captured using an iPhone.
The created avatars can be animated with diverse novel motions.
This work was done when Chen Guo was an intern at Meta. We thank Yuliang Xiu, Ethan Weber, Yu Rong and Artem Sevastopolsky for their feedback and discussions. We also thank Yuliang Xiu, Jiye Lee, Jihyun Lee for helping us with the in-the-wild video capture.
@inproceedings{guo2025vid2avatarpro,
title = {Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior},
author = {Guo, Chen and Li, Junxuan and Kant, Yash and Sheikh, Yaser and Saito, Shunsuke and Cao, Chen},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025},
}