Transcending Scaling Laws with 0.1% Extra Compute - 42Papers