Scaling laws are a budget map, not a crystal ball for AI | Radar

Lilian Weng has published a long technical overview of scaling laws, the empirical relationships between model loss, parameter count, data and compute. Its value is that it pulls the popular bigger is better mantra back toward careful measurement.

Scaling laws turn training into estimates from small runs

The core idea is simple: as model size N, dataset size D and compute C grow, loss often decreases according to a power law. On a log-log plot, that can look roughly linear, making it possible to estimate what may happen at a larger scale.

Weng walks through the common workflow: train several smaller models, fit a curve and use it to estimate how many tokens, parameters and FLOPs make sense for a bigger run. She also highlights the common approximation C ≈ 6ND, where N is parameters and D is training tokens.

The overview moves from early learning curves through Kaplan et al. 2020 and Chinchilla in 2022. The important point is that the literature does not hand teams one magic constant. Results vary depending on how parameters, data, repetition and practical constraints are counted.

For model teams, this is financial discipline

Scaling laws are not academic decoration. In practice, they shape whether a team spends compute on a bigger model, longer training, more data or a different experiment. At frontier prices, a bad estimate is the difference between a plan and a burned budget.

The useful shift is from intuition to an experiment portfolio. A team does not have to believe that bigger is automatically better. It can let small runs show where another token or parameter stops buying enough improvement.

For product teams, the practical question is what is being optimized. Lower pretraining loss does not automatically mean better tool use, more reliable reasoning or cheaper inference. A scaling law is a training ledger, not a full product benchmark.

Extrapolation invites more precision than it owns

The main risk is false certainty. A curve can look clean while depending on a specific experimental range, metric and data mixture. Change architecture, data quality, multi-epoch training or inference constraints, and the old fit may lose force.

That is why Weng's emphasis on carefully is right. Scaling laws help teams reason with discipline, but they should not replace evals on target tasks. A model can sit nicely on a loss curve and still fail in a product where latency, cost and behavior under tool use decide adoption.

The winners will measure beyond the loss chart

The next proof will not be a prettier equation. It will be how labs connect scaling predictions with post-training, inference economics and application evals. That is where compute-optimal training either becomes product-optimal or does not.

For smaller teams, the useful signal is the reverse: whether public scaling work helps them decide when training a foundation model makes no sense. Sometimes the best scaling law is the one that stops you from buying another GPU.

Lilith's verdict

Scaling laws are a ruler placed on a map, not a navigator that drives to the destination. Spend millions by the ruler without checking the terrain, and you can draw a beautiful straight line into a swamp.