We present the holistic evaluation of language models (helm) to improve the transparency of language models.
First, we taxonomize the vast space of potential scenarios (i.e.
Use cases) and metrics (i.e.
Desiderata) that are of interest for language models.
Then we select a broad subset based on coverageand feasibility, noting what's missing or underrepresented (e.g.
Questionanswering for neglected english dialects, metrics for trustworthiness).
Second, we adopt a multi-metric approach : we measure 7 metrics (accuracy, calibration, fairness, bias, toxicity, and efficiency) for each of 16 corescenarios when possible (87.5% of the time).
This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed.
We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g.
Reasoning, disinformation).
Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream language model evaluation.
Our evaluation surfaces 25 top-level findings.
Authors
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman
The rapid development, rising impact, and inadequate understanding demand that we benchmark language models holistically.
Yet the immense surface of model capabilities, limitations, and risks remains poorly understood.
We believe holistic evaluation involves three elements : given language models’ vast surface of capabilities and risks, we need to evaluate language models over a broad range of scenarios.
And for each scenario, we may have a broad set of desiderata: models should be accurate, robust, fair, efficient, and so on.
Holistic evaluation should represent these plural desiderata, evaluating every desideratum for each scenario considered.
Our of evaluation is the language model, not a scenario-specific system.
Therefore, holistic evaluation should provide a top-down taxonomy and make explicit all the major scenarios and metrics that are missing.
Overall, holistic evaluation builds transparency by assessing language models in their totality.
Result
We evaluate 30 language models on 16 core scenarios (e.g.
Our evaluation offers an immense array of model predictions along with quantitative metrics for these predictions.
Here, we provide a succinct analysis, foregrounding unique aspects of our evaluation that are made possible by its broad, holistic, and systematic nature.
We use our comprehensive evaluation to provide answers to important questions in the field like whether model accuracy correlates with scale or if more robust models are less biased.
We encourage exploration of these results interactively : we believe grounding and interrogating various quantitative trends by mapping them to explicit model behaviors is necessary for the community to have common understanding of these models.