Accelerating Understanding of Scientific Experiments with End to End Symbolic Regression

Nikos Arechiga, Francine Chen, Yan-Ying Chen, Yanxia Zhang, Rumen Iliev, Heishiro Toyoda, Kent Lyons

We consider the problem of learning free-form symbolic expressions from raw
data, such as that produced by an experiment in any scientific domain. Accurate
and interpretable models of scientific phenomena are the cornerstone of
scientific research. Simple yet interpretable models, such as linear or
logistic regression and decision trees often lack predictive accuracy.
Alternatively, accurate blackbox models such as deep neural networks provide
high predictive accuracy, but do not readily admit human understanding in a way
that would enrich the scientific theory of the phenomenon. Many great
breakthroughs in science revolve around the development of parsimonious
equational models with high predictive accuracy, such as Newton's laws,
universal gravitation, and Maxwell's equations. Previous work on automating the
search of equational models from data combine domain-specific heuristics as
well as computationally expensive techniques, such as genetic programming and
Monte-Carlo search. We develop a deep neural network (MACSYMA) to address the
symbolic regression problem as an end-to-end supervised learning problem.
MACSYMA can generate symbolic expressions that describe a dataset. The
computational complexity of the task is reduced to the feedforward computation
of a neural network. We train our neural network on a synthetic dataset
consisting of data tables of varying length and varying levels of noise, for
which the neural network must learn to produce the correct symbolic expression
token by token. Finally, we validate our technique by running on a public
dataset from behavioral science.