Multimodal Attentional Network for Audio-Visual Event and Audio-Visual Video Propagation
MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing
Recognizing and localizing events in videos is a fundamental task for video understanding.
Most previous works attempted to analyze videos from a holistic perspective, however, they do not consider semantic information at multiple scales, which makes the model difficult to localize events in various lengths.
In this paper, we present a multimodal pyramid attentional network (mm-pyramid) that captures and integrates multi-level temporal features for audio-visual eventlocalization and audio-visual video parsing.
Specifically, we first propose the attentionive feature pyramid module, which captures temporal pyramid features via several stacking pyramid units, each of them is composed of a fixed-size attention block and dilated convolution block.
We also design an adaptive semantic fusion module, which leverages a unit-level attention block and a selective fusion block to integrate pyramid features interactively.
Extensive experiments on audio-visual event localization and weakly-supervised audio-visual video parsing tasks verify the effectiveness of our approach.