We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly.
We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format.
Performance at self-evaluation is further improved when we allow models to consider many of their own samples before predicting the validity of one specific possibility.
Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability"p(true)"that their answers are correct.
Next, we investigate whether models can be trained to predict"p(ik", the probability that"i know"the answer to a question, without reference to any particular proposed answer.
Models perform well at predicting"p(ik)", and partially generalize across tasks, though they struggle with calibration of"p(ik"on new tasks.
The predicted probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems.
Authors
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli