Energy and Policy Considerations in Deep Learning for NLP

🔬 Research summary by Abhishek Gupta (@atg_abhishek), our Founder, Director, and Principal Researcher.

[Original paper by Emma Strubell, Ananya Ganesh, and Andrew McCallum]

Overview: As we inch towards ever-larger AI models, we have entered an era where achieving state-of-the-art results has become a function of access to huge compute and data infrastructure in addition to fundamental research capabilities. This is leading to inequity and impacting the environment due to high energy consumption in the training of these systems. The paper provides recommendations for the NLP community to alter this antipattern by making energy and policy considerations central to the research process.

Introduction

We’ve seen astonishing numbers detailing the size of recent large-scale language models. For example, GPT-3 clocked in at 175 billion parameters, the Switch Transformer at 1.6 trillion parameters, amongst many others. The environmental impact of the training and serving of these models has also been discussed widely, especially after the firing of Dr. Timnit Gebru from Google last year. In this paper, one of the foundational papers analyzing the environmental impact of AI, the researchers take a critical look at the energy consumption of BERT, Transformer, ELMo, and GPT-2 by capturing the hardware that they were trained on, the power consumption of that hardware, the duration of training, and finally, the CO2eq emitted as a result along with the financial cost for that training.

The researchers found that enormous financial costs make this line of research increasingly inaccessible to those who don’t work at well-funded academic and industry research labs. They also found that the environmental impact is quite severe and the trend of relying on large-scale models to achieve state-of-the-art is exacerbating these problems.

GPU power consumption

Prior research has shown that computationally-intensive models achieve high scores. Arriving at those results though requires iteration when experimenting with different architectures and hyperparameter values which multiplies this high cost thousands of times over. For some large models, the carbon equivalent rivals that of several lifetimes of a car.

To calculate the power consumption while training large models on GPUs, the researchers use manufacturer-provided system management interfaces which report these values in real-time. Total power consumption is estimated as that consumed by the CPU, GPU, and DRAM of the system multiplied by the Power Usage Effectiveness factor which accounts for the additional energy that is consumed for auxiliary purposes like cooling the system. These calculations are done for Transformer, BERT, ELMo, and GPT-2 based on the values for the hardware and duration of the training as provided in the original papers by the authors of those models.

While there has been prior research capturing values of training such models from an energy and cost perspective, those typically focus on just the final configuration of the model rather than the journey used to arrive at that final configuration which can be quite significant in its impact. Through the experiments conducted by the authors of this paper, they find that TPUs are more energy-efficient than GPUs, especially in the cases where they are more appropriate for the model that is being trained, for example, BERT.

Iteratively fine-tuning models

This process of fine-tuning a model through iterative searches of the model architectures and hyperparameter values adds up to massive financial and energy costs as shown in the paper where a single iteration for the model training might cost only ~USD 200, the entire R&D process for arriving at that model which required ~4800 iterations cost ~USD450k which can easily put it out of the reach of those without access to significant resources.

Thus, the researchers propose that when a model is supposed to be further fine-tuned downstream, there should be a reporting of the sensitivity of different hyperparameters to this process to guide future developers. An emphasis on large-scale models furthers inequity by promoting a rich-get-richer cycle whereby only the organizations that have a lot of resources are able to do this kind of research, publish results, and thus gain more funding further entrenching their advantage. Tooling that promotes more efficient architecture searches is limited in its application at the moment because of a lack of easy tutorials and compatibility with the most popular deep learning libraries like Tensorflow and PyTorch. A change in this is also bound to make an impact on the state of carbon accounting in the field of AI.

Between the lines

This paper kickstarted a reflection in the field of NLP on thinking about carbon accounting and overreliance on accuracy as a metric for evaluating the value of results in the AI research community. Upcoming efforts such as carbon-efficient workshops at various top-tier NLP conferences have further boosted awareness of these issues in the community. The hope is that there will be sustained momentum around this as we seek to build more eco-socially responsible AI systems. Follow-on research is required, especially to make tooling more compatible with existing deep learning frameworks. Making reporting a standardized process of the research lifecycle will also help with this. Work done at the Montreal AI Ethics Institute titled SECure: A Social and Environmental Certificate for AI systems provides some more recommendations on how we can do better when it comes to building more eco-socially responsible AI systems.