We study empirical scaling laws for transfer learning between distributions in an unsupervised, fine-tuning setting.
We find that the effective data transferred is described well in the low data regime by a power-law of parameter count and fine-tuning dataset size.
We believe the exponents in these power-laws correspond to measures of the generality of a model and proximity of distributions (in a directed rather than symmetric sense).
We find that pre-training effectively multiplies the fine-tuning dataset size.
Authors
Danny Hernandez, Jared Kaplan, Tom Henighan, Sam McCandlish