We define a set of desiderata that capture both the aspirations of the x-ray artificial intelligence (xai) community and the practical constraints of deep learning.
We describe an effective way to satisfy all the desiderata : train the ai system to build a causal model of itself.
We implement this method in a simulated 3d environment, and show how it enables agents to generate faithful andsemantically-meaningful explanations of their own behavior.
These learned models provide new ways of building semantic control interfaces to artificial intelligence systems.
We argue that providing explanations of a system s behavior is, in essence, a task of building a of that system.
We thus train the base system to supply a self-model alongside its main representations.
The different form allows the self-model to deliver supplementary utility that the base system does not, e.g.
Providing an interface to external users, and enabling them to understand, predict, and control the base system in convenient ways.
Result
We train agents to model their belief state using messages in a synthetic language form.
We focus on the metrics of faithfulness previously introduced.
To measure this, we run evaluation episodes where, at the start of each trial, we inject a message indicating that the instructed tag is in some random room.
We find that the greatest gain of this strategy is its ability to imbue the self-model with causal faithfulness.