Scalable Neural Video Representations with Learnable Positional Features
Succinct representation of complex signals using coordinate-based neural
representations (CNRs) has seen great progress, and several recent efforts
focus on extending them for handling videos. Here, the main challenge is how to
(a) alleviate a compute-inefficiency in training CNRs to (b) achieve
high-quality video encoding while (c) maintaining the parameter-efficiency. To
meet all requirements (a), (b), and (c) simultaneously, we propose neural video
representations with learnable positional features (NVP), a novel CNR by
introducing "learnable positional features" that effectively amortize a video
as latent codes. Specifically, we first present a CNR architecture based on
designing 2D latent keyframes to learn the common video contents across each
spatio-temporal axis, which dramatically improves all of those three
requirements. Then, we propose to utilize existing powerful image and video
codecs as a compute-/memory-efficient compression procedure of latent codes. We
demonstrate the superiority of NVP on the popular UVG benchmark; compared with
prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also
exceeds their encoding quality as 34.07$\rightarrow$34.57 (measured with the
PSNR metric), even using $>$8 times fewer parameters. We also show intriguing
properties of NVP, e.g., video inpainting, video frame interpolation, etc.