Benchmarking Machine-Learning Performance with Variance
Accounting for Variance in Machine Learning Benchmarks
Machine-learning algorithms are increasingly outperforming each other due to the combination of their superior performance and the flexibility of their learning pipeline.
However, the empirical evidence that one machine-learning algorithm outperforms another one is often inconclusive.
We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization, and hyperparameter choice impact markedly the results.
We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost.
Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures.
Our study leads us to propose recommendations for performance comparisons.
Authors
Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, Pascal Vincent