Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-supervised Scene Decomposition

1ETH Zürich, 2Max Planck Institute for Intelligent Systems, Tübingen
CVPR 2023

Vid2Avatar, a method to reconstruct detailed 3D avatars from monocular videos in the wild via self-supervised scene decomposition.


We present Vid2Avatar, a method to learn human avatars from monocular in-the-wild videos. Reconstructing humans that move naturally from monocular in-the-wild videos is difficult. Solving it requires accurately separating humans from arbitrary backgrounds. Moreover, it requires reconstructing detailed 3D surface from short video sequences, making it even more challenging. Despite these challenges, our method does not require any groundtruth supervision or priors extracted from large datasets of clothed human scans, nor do we rely on any external segmentation modules. Instead, it solves the tasks of scene decomposition and surface reconstruction directly in 3D by modeling both the human and the background in the scene jointly, parameterized via two separate neural fields. Specifically, we define a temporally consistent human representation in canonical space and formulate a global optimization over the background model, the canonical human shape and texture, and per-frame human pose parameters. A coarse-to-fine sampling strategy for volume rendering and novel objectives are introduced for a clean separation of dynamic human and static background, yielding detailed and robust 3D human geometry reconstructions. We evaluate our methods on publicly available datasets and show improvements over prior art.



To reconstruct detailed geometry and appearance of the implicit neural avatars from monocular videos in the wild, we solve the tasks of scene decomposition and surface reconstrution directly in 3D in contrast to prior works that utilize off-the-shelf 2D segmentation tools or manually labeled masks. To achieve this, we model both the human and background in the scene implicitly, parametrized via two separate neural fields which are learned jointly from images to composite the whole scene. To alleviate the ambiguity of in-contact body and scene parts and to better delineate the surfaces, we contribute novel objectives that leverage the dynamically updated human shape in canonical space to regularize the ray opacity.



Our approach outperforms existing state-of-the-art methods due to the superiority of our method to the better decoupling of humans from the background vis self-supervised scene decomposition.

Online Video


More Qualitative Results

Our method can generalize to different human shapes, garment styles and facial features even under challenging poses and complicated environments.

360° Visualization

The reconstructed 3D avatar can be viewed from any angle.

SynWild Dataset

We create a new dataset called SynWild to evaluate the human surface reconstruction from monocular videos in the wild. More details and the download link for the dataset will be available soon.


Chen Guo was supported by Microsoft Research Swiss JRC Grant. Xu Chen was supported by the Max Planck ETH Center for Learning Systems. We thank Manuel Kaufmann and Juan Zarate for proofreading. We also thank Doriano van Essen for helping us with the SynWild dataset. We sincerely thank Marquese Scott for his amazing dancing videos!


      title={Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-supervised Scene Decomposition},
      author={Guo, Chen and Jiang, Tianjian and Chen, Xu and Song, Jie and Hilliges, Otmar},    
      booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      month     = {June},
      year      = {2023},