🔬 Research Summary by Sofia Serrano, a Ph.D. candidate in computer science at the University of Washington (and will be an Assistant Professor at Lafayette College starting in Autumn 2024), specifically focusing on the interpretability of contemporary natural language processing models.
[Original paper by Sofia Serrano, Zander Brumbaugh, and Noah A. Smith]
Overview: Language models are seemingly everywhere in the news, but their explanations are either very high-level or technical and geared toward experts. As natural language processing (NLP) researchers, we thought it would be helpful for us to write a guide for readers outside of NLP who are interested in a more in-depth look at how language models work, the factors that have contributed to their recent development, and how they might continue to develop.
Introduction
Given the growing importance of AI literacy, we decided to write this tutorial on language models (LMs) to help narrow the gap between the discourse among those who study LMs—the core technology underlying ChatGPT and similar products—and those who are intrigued and want to learn more about them. In short, we believe the perspective of researchers and educators can clarify the public’s understanding of the technologies beyond what’s currently available, which tends to be either extremely technical or promotional material generated about products by their purveyors.
Our approach teases apart the concept of a language model (LM) from products built on them, from the behaviors attributed to or desired from those products, and claims about similarity to human cognition. As a starting point, we (1) offer a scientific viewpoint that focuses on questions amenable to study through experimentation; (2) situate language models as they are today in the context of the research that led to their development; and (3) describe the boundaries of what is known about the models at this writing.
Key Insights
Tasks, Data, and Evaluation Methods
To understand the last few years’ developments around language models, it’s helpful to have some context about the research field that produced them. Therefore, we begin our guide by explaining how the field of NLP has typically approached building computer systems to work with text in the last couple of decades.
The first idea we discuss is how NLP researchers turn idealized things we’d like a computer to be able to do, like “have an understanding of grammar,” “write coherently,” or “translate between languages,” into a simplified problem on which we can begin to chip away. These simplified problems are known as “tasks” and turn a desired computer behavior like “translating between languages” into something more concrete like “given an English sentence, translate it into French.”
Crucially, there is a gap between idealized computer behavior and the “tasks” they are simplified into— to use our translation example, anyone who’s read the same book in two different languages can tell you that there is an art to how human translators balance faithfulness to the original work and the conventions of the work’s new language, to avoid stilted prose. This process often involves slightly rearranging sentences so that there aren’t exactly the same number of sentences in the two versions of the book, and our distillation of “translating text between languages” into translating sentence-for-sentence obscures that. But making progress towards that intermediate stepping stone of a task helps to make progress towards the larger goal.
We then discuss how deciding on a source of data and an evaluation method for a given simplified task lend themselves to training neural network-based models that perform that task.
The “Language Modeling” Task: Next-Word Prediction
With all that said, what task have language models been trained to perform? As it turns out, their task is next-word prediction, which has already been known for many years as “language modeling” in NLP. In other words, given some text in progress, like “This document is about natural language _____,” a language model is trained to try to predict the next word. (For our example, “processing” would be a reasonable guess.)
While language models have been around in NLP for a long time, it was only recently that researchers began to recognize that past a certain point, to do really well on language modeling, a language model needed to pick up certain facts and world knowledge (for example, to do well at filling in the blanks for “The Declaration of Independence was signed by the Second Continental Congress in the year ____,” or “When the boy received a birthday gift from his friends, he felt ____”).
But even today, the training of language models is still based on optimizing for low “perplexity”—that is, the same measure of a language model’s word-by-word “surprise” at the true, revealed continuation of text-in-progress that we’ve been using in NLP for decades.
Getting from Language Models to Today’s Large Language Models
While perplexity has continued to be our central quantity of interest for language models, that’s not to say that nothing has changed in the last few years about how language models are developed. We discuss two key changes: a move towards training on far more data and also the adoption of a type of neural network called the “transformer,” which is structured in such a way as to enable faster training on more data (provided a model developer has access to certain hardware—specifically GPUs—with a lot of memory).
We then discuss a few of the impacts of those changes and of the resulting surge in performance on language modeling. For example, we discuss how language models are now commonly used to perform other “tasks” that would have involved separately trained models a few years ago and how moving towards larger models has contributed to current NLP models’ relative inscrutability. We also talk about how the rising cost of developing new language models has considerably narrowed the field of which entities/companies can now afford to produce them, the current strategies they use to adapt LMs for use as products, and how difficult it is to evaluate LMs.
Implications of How Language Models Work for Common Questions About Them
Based on our earlier discussion of how language models work, we address a few common questions about using language models, including the importance of particular prompts and which kinds of things are essential to check language model output. We also offer a bit of context for discussions around whether language models count as “intelligent.” However, this is largely a side question for most people considering LMs.
Where Language Models are Headed
We close with some parting words about the difficulty of making projections about the future of LMs and the development of the regulation landscape around LMs. Finally, we list a few helpful actions that people reading the guide can consider moving forward to contribute to a healthy AI landscape.
Between the lines
Current language models are downright perplexing! By considering the trends in the research communities that produced them, we understand why these models behave as they do. Keeping in mind the primary task these models have been trained to accomplish, i.e., next-word prediction, also helps us understand how they work.
Many open questions about these models remain—ranging from how to steer models away from generating incorrect information to how best to customize models for different use cases to which strategies to use to democratize their development. However, we hope our tutorial can provide some helpful guidance on using and assessing LMs.
Though determining how these technologies will continue to develop is difficult, there are helpful actions that each of us can take to push that development in a positive direction. By broadening the number and type of people involved in decisions about model development and engaging in broader conversations about the role of LMs and AI in society, we can all help shape AI systems into a positive force.