A Survey on Data Augmentation for Text Classification
Data augmentation, the artificial creation of training data for machine
learning by transformations, is a widely studied research field across machine
learning disciplines. While it is useful for increasing the generalization
capabilities of a model, it can also address many other challenges and
problems, from overcoming a limited amount of training data over regularizing
the objective to limiting the amount data used to protect privacy. Based on a
precise description of the goals and applications of data augmentation (C1) and
a taxonomy for existing works (C2), this survey is concerned with data
augmentation methods for textual classification and aims to achieve a concise
and comprehensive overview for researchers and practitioners (C3). Derived from
the taxonomy, we divided more than 100 methods into 12 different groupings and
provide state-of-the-art references expounding which methods are highly
promising (C4). Finally, research perspectives that may constitute a building
block for future work are given (C5).