FactorMatte: Redefining Video Matting for Re-Composition Tasks
We propose factor matting, an alternative formulation of the video matting problem in terms of counterfactual video synthesis that is better suited for re-composition tasks.
The goal of factor matting is to separate the contents of video into independent components, each visualizing a counterfactual version of the scene where contents of other components have been removed.
We show that factor matting maps well to a more general bayesian framing of the matting problem that accounts for complex conditional interactions between layers.
Based on this observation, we present a method for solving the factor matting problem that produces useful decompositions even for video with complex cross-layer interactions like splashes, shadows, and reflections.
Our method is trained per-video and requires neither pre-training on external large datasets,nor knowledge about the 3d structure of the scene.
We conduct extensive experiments, and show that our method not only can disentangle scenes with complex interactions, but also outperforms top methods on existing tasks such as classical video matting and background subtraction.
Modern digital video matting functions much the same way as its analog predecessor, digitally mixing images according to the compositing equation.
This equation assumes that the content of a scene can be factored unambiguously into foreground and background layers, and re-compositing with these layers assumes that each can be manipulated independently of the other.
These assumptions place significant constraints on the source material being composited, which is especially problematic in the context of natural video matting, where the goal is to extract this source material from regular video for later re-compositioning.
In this paper, we propose, an adjusted framing of the matting problem that factors video into more independent components for downstream editing tasks.
We formulate factor matting as counterfactual video synthesis and relate each of these counterfactual components to a conditional prior on the appearance of different scene content and show that factor matting closely follows a bayesian formulation of the matts problem where common limiting assumptions about the independence of different layers have been removed.
Our solution to the factor matting problem offers a convenient framework for combining classical matting priors with conditional ones based on expected deformations in a scene.
Result
Factor matting is a challenging re-compositing task due to the complex overlaps between the foreground shadow and the background mountain trails and flung-up snow caused by the skier landing on the mountainside.
This task is particularly challenging due to the transparency of the water; the discriminator must learn the flow, color, and texture of the water from the surroundings and apply this knowledge to the region where the water and foot overlap.
For instance, in figure the transparent water splash is both in front of and behind the child s foot.
Not only can our model separate out the water surrounding the feet, but also the water right in front of the feet.
Our training can be expedited by providing a clean frame featuring the static background without foreground objects or conditional effects.
Acquiring this additional background image is often easy if the users record the input video themselves, so we adopt this practice for videos sourced from cellphone recordings.
We test our method on videos with and without complex interactions.
To include more challenging test cases than those considered in prior work, we also collected clips from in-the-wild videos on youtube, and recorded additional videos with standard consumer cellphones.
For all the methods we evaluate, we use the provided segmentation masks, if any, or else an officially-documented method to automatically generate input masks.
Real-world videos featuring interacting elements lack ground truth counterfactual components, so we provide qualitative evaluations on these in sections,, and.
In sections and, we use datasets and simulated videos that do have ground truth decompositions to provide quantitative comparisons.