FIB: A Benchmark for Factual Inconsistency in Large Language Models
Evaluating the Factual Consistency of Large Language Models Through Summarization
Large language models (llm) have proven to be effective on a large variety of tasks, but they are also known to hallucinate information.
To measure whether an llm prefers factually consistent continuations of its input, we propose a new benchmark called fib(factual inconsistency benchmark) that focuses on the task of summarization.
Specifically, our benchmark involves comparing the scores an llm assigns to a factually consistent versus a factually inconsistent summary for an input news article.
We validate design choices in our benchmark including the scoring method and source of distractor summaries.
We find that existing large language models generally assign a higher score to factually consistent summaries than to factually inconsistent summaries.
However, if the factually inconsistent summaries occur verbatim in the document, then existing large language models assign a higher score to these factually inconsistent summaries than factually consistent summaries.
Authors
Derek Tam, Anisha Mascarenhas, Shiyue Zhang, Sarah Kwan, Mohit Bansal, Colin Raffel
Factual inconsistency is a widespread problem in natural language generation tasks.
These works usually study supervised summarization models that are either trained from scratch or fine-tuned from a pre-trained language model.
Recently, however, there has been a paradigm shift towards using large language models (llms) rather than supervised models.
In light of this new paradigm, our goal is to evaluate the factual consistency of large language models using text summarization as a testbed.
To achieve this goal, we propose the actual nconsistency enchmark (the actual factually consistent enchmark) to measure how often models prefer factually consistent summaries over factually inconsistent summaries.
The benchmark consists of over 3,500 pairs of summaries that were manually annotated as either factually consistent or factually inconsistent.
To explore the behavior of existing models on this benchmark, we evaluate 23 llms from 6 different model families including bloom, opt, gpt, and t0 ranging from 1 to 176b parameters.
We use accuracy on this binary classification task as a proxy for how factually consistent a model is.
To do so, we evaluate these models on three additional sources : (1) unedited reference summaries that we annotated as factually inconsistent, (2) summaries edited via factcc, and (3) summaries produced by mfma.
Result
We present a new benchmark for evaluating the factual consistency of language models, and evaluate 23 large language models on the new benchmark.
Our takeaways are (1) language models tend to assign higher scores to factually consistent summaries than to factually inconsistent summaries even when they are factually inconsistent, (2) length-normalized length-@xmath0 monte carlo (pmi) enables models to most effectively detect factually inconsistent terms, and (3) factcc-generated summaries are often assigned high scores by zero-shot decoder-only models.