Causal Attribution for the Interpretation of Black Box Predictive Models

A Causal Lens for Peeking into Black Box Predictive Models: Predictive Model Interpretation via Causal Attribution

Predictive models trained using machine learning across a wide range of high-stakes applications, e.g.We reduce the problem of interpreting a black box predictive model to that of estimating the causal effects of each of the model inputs on the model output, from observations of the model inputs and the corresponding outputs.We estimate the causal effects of model inputs on model output using variants of the potential outcomes framework for estimating causal effects from observational data.We show how the resulting causal attribution of responsibility for model output to the different model inputs can be used to interpret the predictive model and to explain its predictions.We present results of experiments that demonstrate the effectiveness of our approach to the interpretation of black box predictive models via causal attribution in the case of deep neural network models trained on one synthetic data set (where the input variables that impact the output variable are known by design) and two real-world data sets (where the input variables that impact the output variable are known by design).