- HeyCloud
- Posts
- Quick comments on Llama3
Quick comments on Llama3
Quick comments on Llama3
Llama3-400B is a huge deal, for a couple reasons: - Early training checkpoints at 86+% on MMLU, so will very likely surpass all other models, GPT4 included when it finishes training - Llama3-8B (and 70B) was trained on 15T tokens whereas the Chinchilla compute-optimal data size would be around 200B tokens.
Meta is clearly not chasing fast go-to-market with optimal compute, they are chasing longer term community adoption.
Why?
Because when you train such a model, you want to optimise performance (say MMLU score) given the number of FLOPs (compute budget) you are willing to spend.
If your compute budget is C, then you try to:
max perf(D, N); where D is number of data tokens and N is number of model params.
If you train with same N but on larger D, then you will spend more FLOPs, with diminishing returns for each additional FLOP.
However, the gain of perf with the same size N is super beneficial for adoption. It mainly means you can deploy the model on smaller hardware.
Anyways, this is to say: the 400B version is likely being trained long and nice on 15T tokens, which will almost guarantee a banger perf unlike anything we have seen so far.