Seeing Beyond the Brain: Conditional Diffusion Model with Sparse Masked Modeling for Vision Decoding
Decoding visual stimuli from brain recordings aims to deepen our
understanding of the human visual system and build a solid foundation for
bridging human and computer vision through the Brain-Computer Interface.
However, reconstructing high-quality images with correct semantics from brain
recordings is a challenging problem due to the complex underlying
representations of brain signals and the scarcity of data annotations. In this
work, we present MinD-Vis: Sparse Masked Brain Modeling with Double-Conditioned
Latent Diffusion Model for Human Vision Decoding. Firstly, we learn an
effective self-supervised representation of fMRI data using mask modeling in a
large latent space inspired by the sparse coding of information in the primary
visual cortex. Then by augmenting a latent diffusion model with
double-conditioning, we show that MinD-Vis can reconstruct highly plausible
images with semantically matching details from brain recordings using very few
paired annotations. We benchmarked our model qualitatively and quantitatively;
the experimental results indicate that our method outperformed state-of-the-art
in both semantic mapping (100-way semantic classification) and generation
quality (FID) by 66% and 41% respectively. An exhaustive ablation study was
also conducted to analyze our framework.
Authors
Zijiao Chen, Jiaxin Qing, Tiange Xiang, Wan Lin Yue, Juan Helen Zhou