Deep neural networks are becoming more and more popular due to their
revolutionary success in diverse areas, such as computer vision, natural
language processing, and speech recognition. However, the decision-making
processes of these models are generally not interpretable to users. In various
domains, such as healthcare, finance, or law, it is critical to know the
reasons behind a decision made by an artificial intelligence system. Therefore,
several directions for explaining neural models have recently been explored.
In this thesis, I investigate two major directions for explaining deep neural
networks. The first direction consists of feature-based post-hoc explanatory
methods, that is, methods that aim to explain an already trained and fixed
model (post-hoc), and that provide explanations in terms of input features,
such as tokens for text and superpixels for images (feature-based). The second
direction consists of self-explanatory neural models that generate natural
language explanations, that is, models that have a built-in module that
generates explanations for the predictions of the model.