Hierarchical Knowledge Distillation for Dialogue Sequence Labeling
This paper presents a novel knowledge distillation method for dialogue
sequence labeling. Dialogue sequence labeling is a supervised learning task
that estimates labels for each utterance in the target dialogue document, and
is useful for many applications such as dialogue act estimation. Accurate
labeling is often realized by a hierarchically-structured large model
consisting of utterance-level and dialogue-level networks that capture the
contexts within an utterance and between utterances, respectively. However, due
to its large model size, such a model cannot be deployed on
resource-constrained devices. To overcome this difficulty, we focus on
knowledge distillation which trains a small model by distilling the knowledge
of a large and high performance teacher model. Our key idea is to distill the
knowledge while keeping the complex contexts captured by the teacher model. To
this end, the proposed method, hierarchical knowledge distillation, trains the
small model by distilling not only the probability distribution of the label
classification, but also the knowledge of utterance-level and dialogue-level
contexts trained in the teacher model by training the model to mimic the
teacher model's output in each level. Experiments on dialogue act estimation
and call scene segmentation demonstrate the effectiveness of the proposed
method.