HeyCloud
Posts
Quick comments on Llama3

Quick comments on Llama3

Abdelhadi Azzouni
April 20, 2024

Quick comments on Llama3

Llama3-400B is a huge deal, for a couple reasons: - Early training checkpoints at 86+% on MMLU, so will very likely surpass all other models, GPT4 included when it finishes training - Llama3-8B (and 70B) was trained on 15T tokens whereas the Chinchilla compute-optimal data size would be around 200B tokens.

Meta is clearly not chasing fast go-to-market with optimal compute, they are chasing longer term community adoption.

Why?

Because when you train such a model, you want to optimise performance (say MMLU score) given the number of FLOPs (compute budget) you are willing to spend.

If your compute budget is C, then you try to:
max perf(D, N); where D is number of data tokens and N is number of model params.
If you train with same N but on larger D, then you will spend more FLOPs, with diminishing returns for each additional FLOP.

However, the gain of perf with the same size N is super beneficial for adoption. It mainly means you can deploy the model on smaller hardware.

Anyways, this is to say: the 400B version is likely being trained long and nice on 15T tokens, which will almost guarantee a banger perf unlike anything we have seen so far.