3D convolutional networks are prevalent for video recognition. While
achieving excellent recognition performance on standard benchmarks, they
operate on a sequence of frames with 3D convolutions and t
Traditionally, 3D indoor scene reconstruction from posed images happens in
two phases: per-image depth estimation, followed by depth merging and surface
reconstruction. Recently, a family of methods h
We propose dilated cost volumes to capture small and large displacements simultaneously, allowing pixel-wise optical flowestimation without the need for the sequential estimation strategy commonly adopted in state-of-the-art optical flow approaches.
To process the cost volume to get pixel-wise optical flow, existing approaches employ 2d or separable 4d convolutions, which we show either suffer from high gpu memoryconsumption, inferior accuracy, or large model size.
We present a novel 3D shape completion method that operates directly on unstructured point clouds, thus avoiding resource-intensive data structures like voxel grids. To this end, we introduce KAPLAN,
In this study, we bridge the gap between 2d and 3dconvolutions by reinventing the 2d convolutions.
We propose acs(axial-coronal-sagittal) convolutions to perform natively 3d representation learning, while utilizing the pretrained weights on 2d datasets.
Motion, as the uniqueness of a video, has been critical to the development of
video understanding models. Modern deep learning models leverage motion by
either executing spatio-temporal 3D convolution
We present an end-to-end network that we call correlate-and-excite (coex) that aggregates a costvolume computed from input left and right images using 3-dimensional convolutions.
We show that simple channel excitation of cost volume guided by image can improve performance considerably and propose a novel method of using top-k selection prior to soft-argmin disparity regression for computing the final disparity estimate.
Multi-organ segmentation is one of most successful applications of deeplearning in medical image analysis.
In this work, we propose a new framework for combining 3d and 2d segmentation models, in which the segmentation is realized through high-resolution 2d convolutions, but guided by spatial contextual information extracted from a low-resolution 3d model.
We consider the problem of generating plausible and diverse video sequences,
when we are only given a start and an end frame. This task is also known as
inbetweening, and it belongs to the broader are
3D Convolution Neural Networks (CNNs) have been widely applied to 3D scene
understanding, such as video analysis and volumetric image recognition.
However, 3D networks can easily lead to over-paramete
Neural Cellular Automata (NCAs) have been proven effective in simulating
morphogenetic processes, the continuous construction of complex structures from
very few starting cells. Recent developments in
3D point cloud interpretation is a challenging task due to the randomness and
sparsity of the component points. Many of the recently proposed methods like
PointNet and PointCNN have been focusing on l
Deep learning is showing an increasing number of audience in medical imaging
research. In the segmentation task of medical images, we oftentimes rely on
volumetric data, and thus require the use of 3D
State-of-the-art 3D-aware generative models rely on coordinate-based MLPs to
parameterize 3D radiance fields. While demonstrating impressive results,
querying an MLP for every sample along each ray le
In recent years 3D object detection from LiDAR point clouds has made great
progress thanks to the development of deep learning technologies. Although
voxel or point based methods are popular in 3D obj
Video Instance Segmentation (VIS) is a task that simultaneously requires
classification, segmentation, and instance association in a video. Recent VIS
approaches rely on sophisticated pipelines to ach
Recent methods in stereo matching have continuously improved the accuracy
using deep models. This gain, however, is attained with a high increase in
computation cost, such that the network may not fit
This paper presents Volumetric Transformer Pose estimator (VTP), the first 3D
volumetric transformer framework for multi-view multi-person 3D human pose
estimation. VTP aggregates features from 2D key
Video-based person re-identification (re-ID) is an important technique in
visual surveillance systems which aims to match video snippets of people
captured by different cameras. Existing methods are m
To predict and anticipate future outcomes or reason about missing information
in a sequence is a key ability for agents to be able to make intelligent
decisions. This requires strong temporally cohere