🔬 Research Summary by ** Nouha Dziri**, a research scientist at Allen Institute for AI working with Yejin Choi and the Mosaic team on understanding the inner workings of language models.

[Original paper by Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, PeterWest, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, Yejin Choi]

**Overview**: Transformers Language Models (LMs) such as GPT4 and ChatGPT have taken the world by storm with their exceptional performance on tasks that demand complex multi-step reasoning. Yet, these models simultaneously show failures on surprisingly trivial problems. In this work, we empirically and theoretically investigate the limits of Transformers on compositional tasks and provide novel insights into* why and when* they succeed and fail.

**Introduction**

Large-scale transformers such as ChatGPT and GPT-4 have taken the world by storm with their incredible capabilities, even being noted as “sparks of AGI.” In stark contrast, they often struggle with simple, intuitive tasks. For instance, humans can solve 3-digit by 3-digit multiplication arithmetic after learning basic calculation rules. Yet, off-the-shelf ChatGPT and GPT4 achieve only 55% and 59% accuracies on this task, respectively. This begs the question: Under what conditions do Transformers succeed, fail, and why? What types of errors do they make? Can Transformers uncover implicit problem-solving rules or be taught to follow reasoning paths?

We propose two hypotheses. First, Transformers solve compositional tasks by reducing multi-step reasoning into pattern matching. Despite appearing complex, some tasks may lack inherent compositionality as their solutions can be easily extracted from the input-output sequences in the training data. Second, Transformers may have inherent limitations in solving high-complexity compositional tasks due to error propagation.

To investigate our hypotheses, we formulate compositional tasks as computation graphs. These graphs break down problem-solving into smaller, functional steps. We study three straightforward compositional tasks: multiplication, logic grid puzzles, and a classic dynamic programming problem. Our findings show that Transformers’ successes are heavily linked to having seen significant portions of the required computation graph during training, reducing multi-step reasoning into subgraph matching. Moreover, we provide theoretical evidence highlighting how Transformers’ performance will rapidly decay with increased task complexity. Errors in the early stages of the computational process can lead to substantial compounding errors in subsequent steps, preventing models from finding correct solutions.

**Key Insights**

### Model Performance Decreases as Graph Complexity Increases

To investigate the inherent problem-solving capabilities of LLMs, we pushed Transformers to their limits through zero-shot, few-shot, and fine-tuning experiments. Across all tasks, we observed a significant decline in performance from nearly perfect to zero as complexity increased. For instance, GPT-4 manages only 59% accuracy on 3-digit x 3-digit multiplication, dropping to 4% for 4-digit x 4-digit multiplication.

What if we introduced a step-by-step solution, also known as “scratchpad,” to the models? Testing this approach with a few-shot scratchpad notably boosted performance, such as increasing GPT-4’s accuracy on 3×3 multiplication from 59% to 92%. However, this improvement didn’t extend to highly complex problems, remaining near 0. This may be due to insufficient exposure to specific task data during pretraining. To address this, we decided to extensively train the models on a huge amount of task-specific data.

Even after exhaustive training on various tasks like multiplication, dynamic programming, and puzzle logic, the models couldn’t fully master the tasks. While they excelled in *in-domain* distribution cases, they utterly failed to generalize to out-of-distribution cases. These results reveal that the autoregressive nature of Transformers poses an inherent challenge that cannot be resolved by instructing the model to generate a step-by-step solution. Models fundamentally rely on a greedy process, predicting the next word without a comprehensive global understanding of the task.

### Information Gain Explains Where Transformers Partially Excel

When transformers fail to provide the correct answer, we observed that they often partially predict the response accurately, even if the overall answer is incorrect. For instance, in multiplication, the model may correctly guess the first and last digits while getting the rest wrong.

To investigate this phenomenon, we use relative information gain to predict surface patterns likely learned by the model. The analysis shows a high correlation between the first digits of the output and input numbers, suggesting that the model is likely learning this spurious pattern.

In conclusion, Transformers can make partial guesses without executing the whole multi-step reasoning required by the task. Task-specific nuances encourage transformers to adopt shortcut answers without using the necessary multi-step reasoning. There is nothing wrong with learning shortcuts, as humans commonly use them to deliver answers swiftly. However, the key difference lies in our ability to discern when and how to use shortcuts—a skill machines seem to lack.

### Transformers Reduce Multi-Step Compositional Reasoning into Linearized Subgraph

#### Matching

We now explore whether models’ correct predictions on unseen test data are due to learning the underlying algorithm or, instead, explainable by exposure to similar training examples. for each graph, we computed the average frequency of partial computations in the training data needed to solve a task for correctly and wrongly predicted examples.

We found that Transformers’ successes are heavily linked to having seen significant portions of the required computation graph during training — suggesting that compositional behaviors may truly be pattern matching. This type of learning could be very effective when the compositional complexity of tasks is low, but it becomes less efficient when tasks are increasingly complex.

#### What Types of Errors Do Transformers Make at Different Reasoning Depths

To better understand where Transformers fall short, we analyze the types of errors that

transformers make for nodes at different layers in the computation graph. Our analyses show that models can correctly perform single-step reasoning, potentially due to memorizing such single-step operations during training, but need to plan and compose several of these steps for overall correct reasoning.

#### Error Propagations: The Theoretical Limits

So far, we have highlighted the limitations of current Transformers in handling complex, multi-step reasoning tasks: errors rapidly escalate as the problem size grows. Compositional task algorithms often involve multiple independent applications of a function (width) and/or iterated applications of the same function (depth). In executing such algorithms, Transformers act as estimators for these functions. Specifically, Transformers struggle with problems featuring large graph width and depth. Our theoretical insights explain why these models perform significantly worse in compositional tasks as the problem size increases. We analyze the probability of Transformers reaching the correct answer as the problem size grows, demonstrating that, under reasonable assumptions, the probability of incorrect predictions converges exponentially to ≈ 1 for abstract compositional tasks.

## Between the lines

### Collapsed Compositionality and Robustness Implications

Transformers today demonstrate undeniably powerful empirical results. Yet, our study suggests that Transformers may have fundamental weaknesses in certain intellectual tasks that require true multi-step compositional operations. Our careful study based on the computation graph and analyses demonstrate that Transformers can often solve multi-step compositional problems by collapsing the depth of the compositional operations via analogical pattern matching. More broadly, our findings suggest that the strong performance of Transformers should be taken with a certain grain of salt: Despite initially appearing challenging, certain tasks may not possess the inherent compositionality they have. This is because desired solutions could be readily derived from input-output sequences present in the training data, allowing for shortcut pattern matching to produce acceptable solutions. However, our study shows that such an approach can ultimately result in poor generalization capabilities.

Building on these findings, we suggest several empirical strategies for harnessing the potential of Transformers. First, Transformers may be best suited for compositional tasks where evaluation metrics can afford some leniency, for example, finding approximate solutions that do not require executing the whole graph, such as identifying the most significant digit in a multiplication. Second, we suggest augmenting Transformers with planning modules as well as using refinement methods that can iteratively improve their generations.

### Call for broad Participation in the Investigation of the Limitations.

We acknowledge that due to our compute budget constraints and limited access to the largest language models, such as GPT-4, we cannot push the empirical limits of Transformers even further in terms of training data size and number of epochs. We invite the broader research community, particularly those with more extensive resources at their disposal, to investigate these possibilities further.