Decomposing the natural language problem into runnable steps remains the only learning task for the llm, while solving is delegated to the interpreter.
We experiment with 12 reasoning tasks from the big-bench hard and other benchmarks, including mathematical reasoning, symbolic reasoning, and algorithmic problems.
In all these natural language reasoning tasks, generating code using an llm and reasoning using a python interpreter leads to more accurate results than much larger models, and we set new state-of-the-art results in all 12 benchmarks.
Authors
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, Graham Neubig
We propose rogram-ided anguage model (): a novel prompting method that uses an llm to read natural language problems and generate as reasoning steps, but offloads the step to a python interpreter.
This offloading relies on an llm that can decompose a natural language problem into programmatic steps, which is fortunately possible using contemporary state-of-the-art llms that are pre-trained on both natural language and programming languages.
We demonstrate the effectiveness of across arithmetic and symbolic reasoning tasks.
Result
We present a new few-shot state-of-the-art for symbolic and non-symbolic reasoning tasks.
The language model offloads the computation to the python interpreter, ensuring that any complex computation can be performed accurately given the correctly generated program.
Symbolic reasoning, symbolic reasoning, mathematical reasoning, problem solving, symbolic reasoning has long been one of the most challenging tasks in computer science.
This is largely due to the complexity of the question and the inability of the language models to do arithmetic.
However, large numbers in the question affect the output generated by the language models.
We address this issue by evaluating the outputs generated by for the two versions of the same question (with and without large numbers).
We find that in 16 out of 25 cases we analyze, generates nearly identical thought, indicating that the primary failure mode is the inability to perform arithmetic.
We investigate both of these issues and find that large numbers do not have a significant impact on the output generation.
Finally, we show how provides not only better results on the standard benchmarks, but also better robustness to larger and non-integer number.
In this paper, we introduce a new few-shot state-of - the-art for symbolic reasoning tasks such as symbolic reasoning and symbolic reasoning.