acm-header
Sign In

Communications of the ACM

Viewpoint

Please Report Your Compute


the word 'FLOP' on a graph-like background, illustration

Credit: Alicia Kubista / Andrij Borys Associates

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."

—Richard Sutton, The Bitter Lessona

During the last seven decades, the field of artificial intelligence (AI) has seen astonishing developments—recent AI systems have demonstrated impressive wide-ranging capabilities in areas spanning from mathematics to biology,4,14 and have the potential to influence large fractions of the U.S. economy.6

How have these developments been possible? When we look at the numbers, the story becomes clear—the amount of compute, data, and parameters in deep learning models have all grown exponentially. On top of these massive increases in scale, researchers are constantly searching for novel ways to improve algorithmic and hardware efficiency.7

While all these factors play a crucial role in AI progress, our focus in this iewpoint is on training compute—the number of floating point operations (FLOP) performed by computer chips when training a Machine Learning (ML) model. Training compute has grown an astonishing amount since the start of ML in the 1950s—first, doubling every 20 months in tandem with Moore's Law, and then every six months after the advent of Deep Learning. For comparison, Rosenblatt's Perceptron in 195815 was trained with 7e5 FLOP, whereas we estimate the recently unveiled GPT-414 was trained with up to 3e25 FLOP—almost 20 orders of magnitude more compute (see Figure 1).

f1.jpg
Figure 1. Training compute budgets of notable ML systems, 1950–2022. (Original graph from Sevilla et al.,17 updated March 2023.)

These enormous increases in compute investments have been accompanied by impressive increases in AI capabilities. Research into neural scaling laws suggests training compute is among the best predictors of model performance.10,11 Furthermore, training compute requirements inform budget allocations in large tech companies, who are willing to spend millions of U.S. dollars to train a single model.5,10

Evidently, compute is a very important ingredient for AI. But there is a problem—despite recent studies indicating compute is an essential ingredient for high AI performance, we fail to report it with precision. Case in point: none of the award-winning papers (or papers receiving honorable mentions) in ICLR 2021, AAAI-21, and NeurIPS 2021 explicitly reported their compute usage in unambiguous terms, such as the number of FLOP performed during training.b In fairness to the authors, there are challenges to estimating compute, and a lack of norms for publishing FLOP numbers. This has to change.

Back to Top

Why You Should Report Your Compute Usage

The reasons for reporting compute usage in unambiguous terms go beyond acknowledging the importance of compute—adopting compute reporting norms would support the science of ML, forecasts of AI impacts and developments, and evidence-based AI policy. We discuss each of these factors in turn.

Supporting reproducibility. Publishing precise experimental details about data and model architectures is expected of any rigorous ML paper. Should compute not be held to the same standard as these other metrics? Understanding the required compute budget helps practitioners know what is needed—in terms of hardware or cloud-computing requirements, engineering effort, and so on—to reproduce or run a similar experiment.

Facilitating comparisons between models. Publishing compute figures allows researchers to better interpret the drivers of progress in ML. For instance, Erdil and Besiroglu7 investigate algorithmic improvements in image classifiers based on a dataset of training compute estimates.

Understanding who is able to train massive ML systems. We are entering a world in which only a few players have the resources to train competitive ML models.1 Transparency around this fact will help us better decide what we should do about it. In particular, we think people reporting how much compute they used and their providers will help us understand access to computing infrastructure in AI research.

Studying the relation between scale and performance. Empirically derived scaling laws hint at a certain regularity in the relationship between compute and performance.10,11 More systematic reporting of compute and performance can help us better map how one translates to the other, and how this ultimately corresponds to the capabilities of AI systems.


Despite recent studies indicating compute is an essential ingredient for high AI performance, we fail to report it with precision.


Anticipating the emergence of capabilities in AI. Armed with the map between compute and performance, we might be able to extrapolate compute trends to venture guesses on what new capabilities may emerge, and when new capabilities can pose previously unseen risks. Anticipating these issues allows us to take preemptive action. We can, for instance, provide an early warning to AI safety and ethics researchers, helping prioritize their efforts on what might become pressing future issues.3,18

Regulating the deployment of advanced AI systems. Models trained on new regimes of compute must be treated with caution. The emergence of side effects and unexpected behavior plagues our field, where we are being led by bold experiments rather than careful deliberation. Language models can in fact be too big2 and AI ingenuity can produce unintended consequences.12 Being transparent about the levels of compute used in training can help us understand better whether there is precedent for the scale deployed, or if special precautions are warranted.

While capabilities and risks are difficult to assess, especially before training and deployment, compute used in training is a legible lever for external and internal regulation. We can draft policies targeting models exceeding a certain compute budget, for example, requiring special review and safety assurance processes for models exceeding a threshold of training compute.

In short, reporting compute usage in unambiguous terms will help us hold results to greater scientific scrutiny and manage the deployment of advanced AI. This must become a widespread norm, so we can make wiser decisions about this increasingly important part of our technological society.

Back to Top

How to Measure Compute

In a sense, it is not surprising that reporting ML training compute is not already a widespread norm. Currently, there are few standard methods or tools for estimating compute usage, and the ones that exist are often unreliable.c

Developers of ML framework such as PyTorch, TensorFlow, or Jax should consider incorporating a way to automatically and consistently measure the compute usage into the training toolkit. Such a change would facilitate accurate measures of training compute, which can benefit the entire field.


Reporting compute usage in unambiguous terms will help us hold results to greater scientific scrutiny and manage the deployment of advanced AI.


Meanwhile, we can apply apply approximate methods. One method involves estimating the number of arithmetic operations the ML model performs over training, based on its architecture. Alternatively, we can estimate traing compute via information on hardware performance,d and the training time. In general, the first method yields more accurate results, whereas the second can often be determined more easily for complex architectures16 (see Figure 2).

f2.jpg
Figure 2. Two ways of applying approximate methods.

Ultimately, we want to use a consistent and unambiguous way of measuring compute. The metric favored by previous work and that we recommend using is the total number of FLOPe used to train the system.f A calculatorg is available to help with estimating FLOP.17

Back to Top

Conclusion

Ever-growing compute usage has become a central part of ML development. To keep track of this increasingly important metric, we need a norm to consistently report compute. And while no tools exist yet to measure this automatically and precisely, we have explained how you can approximate the result using freely available online tools.h

In the AI field there exists a commitment to transparency, accountability, and oversight, which has resulted in norms such as publishing reproducible code, model cards,13 and data cards.8 Reporting compute is a necessary extension of that commitment.

We hope this will help shift the norms of research in ML toward more transparency and rigor. And you can help us normalize this, by acquainting yourself with the methods we have outlined and including FLOP estimates in your writing. Please report your compute.

Back to Top

References

1. Ahmed, N. and Wahed, W. 'The De-Democratization of AI: Deep Learning and the Compute Divide in Artificial Intelligence Research'. (2020); https://doi.org/10.48550/ARXIV.2010.15581

2. Bender, E.M. et al. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21 (2021), ACM, NY; https://bit.ly/3n26u5J

3. Bommasani, R. et al. On the Opportunities and Risks of Foundation Models. arXiv. (2022); https://bit.ly/40LXLDi

4. Chowdhery, A. et al. PaLM: Scaling Language Modeling with Pathways. (2022); https://bit.ly/3mWWv1w

5. Cottier, B. Trends in the dollar training cost of machine learning systems. (2023); https://bit.ly/3KmVaKJ

6. Eloundou, T. et al. D. GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models. arXiv. (2023); https://arxiv.org/abs/2303.10130

7. Erdil, E. and Besiroglu, T. Algorithmic progress in computer vision. arXiv. (2022); https://arxiv.org/abs/2212.05153

8. Gebru, T. et al. Datasheets for Datasets (2018); https://bit.ly/3leCW4s

9. Hobbhahn, M. How to Measure FLOP/s for Neural Networks Empirically? (2021); https://bit.ly/3FxCBRj

10. Hoffmann, J. et al. Training Compute-Optimal Large Language Models. (2022); https://bit.ly/3Tqk4vR

11. Kaplan, J. et al. Scaling Laws for Neural Language Models. (2020); https://bit.ly/3n2LnQy

12. Krakovna, V. et al. Specification Gaming: The Flip Side of AI Ingenuity (2022); https://bit.ly/3n3h4t210

13. Mitchell, M. et al. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (2019); https://bit.ly/40aqu4c

14. OpenAI. GPT-4 Technical Report. (2023); https://bit.ly/3nEoNOQ

15. Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review 65, 6 (1958); 386–408; https://bit.ly/3ZXuk0Z

16. Sevilla, J. Compute Trends Across Three Eras of Machine Learning. (2022); https://bit.ly/3mV2rYR

17. Sevilla, J. et al. Estimating Training Compute of Deep Learning Models. (2022); https://bit.ly/3Tn0PmG

18. Steinhardt, J. On The Risks of Emergent Behavior in Foundation Models. (2021); https://bit.ly/3Zyq6vE

Back to Top

Authors

Jaime Sevilla ([email protected]) is the director of Epoch Research, San Jose, CA, USA.

Anson Ho ([email protected]) is a staff researcher at Epoch Research, San Jose, CA, USA.

Tamay Besiroglu ([email protected]) is the associate director of Epoch Research, San Jose, CA, USA, and a member of the technical staff at MIT CSAIL, Cambridge, MA, USA.

Back to Top

Footnotes

a. See https://bit.ly/2Tb31xf

b. NeurIPS is notable for encouraging researchers to report their compute usage, but this is generally not strictly enforced: typically, only information about the hardware use and training setup is required, rather than an explicit FLOP count.

c. Our team found this out the difficult way, by finding that profilers—packages that measure certain quantities at runtime, such CPU time, GPU time, FLOP—sometimes show signs of under- and overcounting the relevant quantities (see Hobbhahn9).

d. Accelerator peak performance varies depending on the number format used, for example, FP32, FP16, and so forth. For reproducibility and transparency purposes, we recommend reporting the number format used in training.

e. Alternative metrics include petaFLOPS-day (approximately equal to 1e20 FLOP) and multadds (in practice approximately equal to 2 FLOP). Just reporting training time and hardware is not an acceptable substitute.

f. The primary metric we recommend is the FLOP used in the final training run of the reported model. Reporting the compute spent during previous experiments and hyperparameter tuning would be a more rigorous approach, though we acknowledge the additional difficulty.

g. See https://bit.ly/3Tn0PmG

h. See https://bit.ly/3Tn0PmG

The authors thank Lennart Heim, Pablo Villalobos, Moshe Y. Vardi, and Simeon Campos for their comments.


Copyright held by authors.
Request permission to (re)publish from the owner/author

The Digital Library is published by the Association for Computing Machinery. Copyright © 2023 ACM, Inc.


 

No entries found