ACCDOA: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization and Detection
Kazuki Shimada, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji
Neural-network (NN)-based methods show high performance in sound event
localization and detection (SELD). Conventional NN-based methods use two
branches for a sound event detection (SED) target and a direction-of-arrival
(DOA) target. The two-branch representation with a single network has to decide
how to balance the two objectives during optimization. Using two networks
dedicated to each task increases system complexity and network size. To address
these problems, we propose an activity-coupled Cartesian DOA (ACCDOA)
representation, which assigns a sound event activity to the length of a
corresponding Cartesian DOA vector. The ACCDOA representation enables us to
solve a SELD task with a single target and has two advantages: avoiding the
necessity of balancing the objectives and model size increase. In experimental
evaluations with the DCASE 2020 Task 3 dataset, the ACCDOA representation
outperformed the two-branch representation in SELD metrics with a smaller
network size. The ACCDOA-based SELD system also performed better than
state-of-the-art SELD systems in terms of localization and location-dependent
detection.