Data augmentation has recently seen increased interest in NLP due to more
work in low-resource domains, new tasks, and the popularity of large-scale
neural networks that require large amounts of training data. Despite this
recent upsurge, this area is still relatively underexplored, perhaps due to the
challenges posed by the discrete nature of language data. In this paper, we
present a comprehensive and unifying survey of data augmentation for NLP by
summarizing the literature in a structured manner. We first introduce and
motivate data augmentation for NLP, and then discuss major methodologically
representative approaches. Next, we highlight techniques that are used for
popular NLP applications and tasks. We conclude by outlining current challenges
and directions for future research. Overall, our paper aims to clarify the
landscape of existing literature in data augmentation for NLP and motivate
additional work in this area.
Authors
Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, Eduard Hovy