MixVoxels: Fast voxel representation for dynamic scenes
Mixed Neural Voxels for Fast Multi-view Video Synthesis
Syntheticizing high-fidelity videos from real-world multi-view input is challenging because of the complexities of real-world environments and highly-dynamic motions.
The proposed mixvoxels represents the 4-d dynamic scenes as a mixture of static and dynamic voxels and processes them with different networks.
To separate the two kinds of voxels, we propose a novel variationfield to estimate the temporal variance of each voxel and design an inner-product time query method to efficiently query multiple timesteps, which is essential to recover the high-dynamic motions.
As a result, with 15 minutes of training for dynamic scenes with inputs of 300-frame videos, the proposed method achieves better psnr than previous methods.
Authors
Feng Wang, Sinan Tan, Xinghang Li, Zeyue Tian, Huaping Liu
Multi-view 3d video synthesis is a critical and challenging problem with many potential applications such as interactively free-viewpoint control for movies, cinematic effects like freeze-frame, novel view replays for sporting events, and various potential vr/ar applications.
To model more complex real-world dynamic scenes, a more practical solution is to use multi-view synchronized videos to provide dense spatial-temporal supervisions.
Although significant improvements have been achieved, some challenges still exist such as (1) the training and rendering take a lot of time and computation resources, and (2) highly dynamic scenes with complex motions are still difficult to track.
In this paper, we design a novel method named mixvoxels to address the above two challenges and design an efficient inner-product time query to simultaneously query a large number of time steps, which is essential to recover the sharp details for highly-dynamic objects (like a fast-moving hand).
Besides, we represent the dynamic scenes as a mixed static-dynamic voxel-grid representation, which is recently popular due to the fast training and rendering speed on static scenes.
Specifically, the 3d spaces are split into static and dynamic voxels by our proposed variation field.
The two components are processed by different models to reduce the redundant computations for the static space.
Theoretically, once a dynamic scene consists of some static spaces, the training speed will benefit from the proposed mixed voxels.
Result
This paper demonstrates a new method named mixvoxels to efficiently reconstruct the 4d dynamic scenes and synthesize novel view videos with only 15 minutes of training.
The core of our method is to split the 3d space into static and dynamic components with the proposed variation field, and process them with different branches.
The separation speeds up the training and makes the dynamic branch focus on the dynamic parts to improve the performance.
We also design an efficient dynamic voxel-grid representation with an inner-product time query.