We propose a relightable and articulated neural avatar for the photorealistic synthesis of humans under arbitrary viewpoints, body poses, and lighting.
We present a novel framework to model humans while disentangling their geometry, texture, and also lighting environment from monocular rgb videos.
To simplify this otherwise ill-posed task we first estimate the coarse geometry and texture of the person via smpl+d model fitting and then learn an articulated neural representationfor photorealistic image generation.
We only require a short video clip of the person to create the avatarand assume no knowledge about the lighting environment.
We also propose to pretrain the proposed approach using synthetic images and demonstrate that it leads to better disentanglement between geometry and texture while also improving robustness to novel body poses.
Finally, we also present a new photorealistic synthetic dataset, relighting humans, to quantitatively evaluate the performance of the proposed approach.
Authors
Umar Iqbal, Akin Caliskan, Koki Nagano, Sameh Khamis, Pavlo Molchanov, Jan Kautz
Neural avatars of humans have numerous applications across telepresence, animation, and visual content creation.
To enable the widespread adoption of these neural avatars, they should be easy to generate and animate under novel poses and viewpoints, able to render in photo-realistic image quality, and easy to relight in novel environments.
Existing methods commonly aim to learn such neural avatars using monocular videos recorded in unknown environments.
While this allows photo-realistically image quality and animation, the synthesized images are always limited to the lighting environment of the training video.
To create an avatar, we only require a short monocular video clip of the person in unconstrained environment, clothing, and body pose.
We then propose a novel convolutional neural network that is trained on synthetic data to remove the shading information from the coarse texture and pass them to our proposed neural avatar framework, which generates refined normal and albedo maps of the person under the target body pose using two separate convolutional networks.
Given the normal map, albedo map, and lighting information, we generate the final shaded image using spherical harmonics (sh)lighting.
During training, since the environment lighting is unknown, we jointly optimize it together with the person s appearance and propose novel regularization terms to prevent the leaking of lighting into the albedo texture.
This not only improves the generalizability of the neural avatar to novel body poses but also learns to decouple the texture and geometry information.
Result
We present a novel framework for learning relightable and articulated neural avatars of humans from unconstrained rgb videos while also disentangling their geometry, albedo texture, and environmental lighting.
We show that it can generate photorealistic images of people under any novel body pose, viewpoints, and lighting.
We also propose a new photorealistic synthetic dataset to quantitatively evaluate the performance of our proposed method and believe that it will prove to be very useful to further the research in this direction.