Typefully

What do loss curves for LLMs look like?

Avatar

Share

 • 

4 years ago

 • 

View on X

This plot from @OpenAI's Scaling Laws for Neural Language Models is widely referred to when discussing training dynamics, but did you know it has almost no bearing on what massive LM training curves look like? (1/12) You can read this thread unrolled at: typefully.com/BlancheMinerva/mPialqw
Here are the same plot (as the one on the left) from GPT-NeoX 20B, for both training and validation loss. Why don't we see the same effect of a "burn-in" period? 🤔🤔🤔 (2/12)
I asked this same question in the #EleutherAI discord an hour ago and was blown away by what an eagle-eyed observer called thrasher pointed out: my plot starts too far to the right for the burn-in to show up! (3/12)
When looking at OpenAI's plot, it's easy to assume that it represents the training curve of large models. But 10^9 is "only" 1B: my model is 20B params and was trained for 400B tokens. The burn-in period ends at around the 20M token mark (4/12)
That's ~0.05% of the way through training! I logged my first loss value at 300 steps, thinking that was surely early enough to reveal anything interesting. But 300 steps of a model with a batch size of 3.1M is nearly 1B tokens: my logging can only capture the tail-end (5/12)
of the plot. The OpenAI plot shows two phase transitions, one just before 10^8 and one just before 10^9. The first phase transition occurs in the first 0.05% and the second in the first 0.25% of the training of my 20B parameter model. For Google's recent PaLM, (6/12)
those numbers drop to 0.025% and 0.125% respectively. More than 99.5% of training occurs in the "flatlined" regime after the second phase transition. Below are various evaluation benchmarks. Note that the *entire curve* is to the right of the second phase transition (7/12)
So clearly there's a lot of interesting stuff happening that OpenAI's plot doesn't tell us about. Warning: these didn't go until the end of training. The final eval numbers for the model are: Lambada: 0.720 HellaSwag: 0.535 PiQA: 0.779 Winogrande: 0.661 (8/12)
(I don't have MathQA and PubMedQA yet). In my 3 month training run we blew past all the stuff it shows in two days! That early in training LLMs are still producing complete garbage; they don't have an idea of spelling let alone grammar or anything of substance. (9/12)
WandB for GPT-NeoX: wandb.ai/eleutherai/gpt-thicc/reports/GPT-NeoX-20B-Pretraining--VmlldzoxMTg5MjY3 So what is going on during this rapid loss improvement if eval scores still suck? @AnthropicAI has been doing some research which may hold the answer: transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html They claim that few shot (my evals are all 0-shot) (10/12)
performance improves substantially. Additionally, they claim that this is where “induction heads” begin to appear. I haven’t looked into this personally yet, but it seems like a very interesting set of ideas. And it does make sense for 0-shot to develop after few-shot IMO (11/12)
As a reminder, intermediate checkpoints for GPT-NeoX 20B are available, DM me if you would like to experiment with them. Also DM me if you want to experiment on GPT-NeoX but lack the compute! EleutherAI is happy to provide free compute to anyone doing research on it. (12/12)
Avatar

Stella Rose Biderman

@BlancheMinerva

Mathematician, AI researcher, Magic the Gathering player. NLP researcher at @BoozAllen and @AiEleuther. My employer disowns my tweets. She/her