Lila: A Unified Benchmark for Mathematical Reasoning
We propose a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions : (i) mathematical abilities e.g., arithmetic, calculus (ii) languageformat e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics.
We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of python programs, thereby obtaining explainable solutions in addition to the correct answer.
We additionally introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation.
Importantly, we find that multi-tasking leads to significant improvements(average relative improvement of 21.83% f1 score vs. single-task models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding.
Authors
Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, Ashwin Kalyan
We introduce, a unified mathematical reasoning benchmark that consists of 23 mathematical reasoning tasks.
Is constructed by extending 20 existing datasets spanning a wide range of topics in mathematics, varying degrees of linguistic complexity, and diverse question formats and background knowledge requirements.
Unifies various mathematical reasoning datasets under a single problem formulation i.e.
Given an input problem in natural language, generate a python program that upon execution returns the desired answer.
This formulation allows neural approaches to focus on the high-level aspects of mathematical problem solving (e.g., identifying potential solution strategies, decomposing the problem into simpler sub-problems), while leveraging external solvers (e.g., python builtins, sympy) to perform precise operations like adding huge numbers or simplifying expressions.
In addition to evaluating high-level problem solving, we also facilitate two other key ways to make a fair assessment of models on mathematical reasoning tasks : (1) we evaluate generalization e.g.
Result
Neural language models are able to generate a program that evaluates to the correct answer, even when the model can not directly compute the answer.
This means models are often able to generate programs and evaluate them to get an answer, even if the model is unable to compute the answer directly.
We identify two common cases.
First, the model leverages standard python as a calculator.
Second, the model is able to call external libraries that perform sophisticated computations.
For instance, this pattern is common in the and categories, which involve evaluating arithmetic expressions.
These findings suggest that the neural language model is a better starting point for downstream fine-tuning than the vanilla pre-trained gpt-neo-2.7b (neo) and (multi) on 100%, 40% and 20% of the held-out data from.
We find that the multi-tasking model (format) substantially improves upon the single task models (neo).
Achieves better average in-domain performance than the 23 individual per-task models (0.480 vs.394 average score) suggesting that it leverages cross-task structure not present in a single task s training set.
Multi-task training substantially improves out-of-domain generalization (0.448 vs.0.238), and in one case (format) s ood performance on held-out tasks is better than its iid performance (0.290.133 average score).
We also find that our multi-task model is robust to the linguistic perturbations we test in.
Format decouples synthesis from computation, while opening directions for further study on either aspect.
Program synthesis improves over answer prediction in all math categories except, with the largest improvements in and ; see table for examples.
We even see benefits of program synthesis in nli, a classification-based task.