Mind your Language (Model): Fact-Checking LLMs and their Role in NLP Research and Practice

🔬 Research Summary by Alexandra Sasha Luccioni and Anna Rogers.

Dr. Sasha Luccioni is a Research Scientist and Climate Lead at Hugging Face; her work focuses on better understanding the societal and environmental impacts of AI models, datasets, and systems.

Dr. Anna Rogers is an assistant professor at IT University of Copenhagen, working on the analysis and interpretability of NLP models, NLP research methodology, and AI and society.

[Original paper by Alexandra Sasha Luccioni and Anna Rogers]

Overview: Much of the recent discourse within the AI research community has been centered around Large Language Models (LLMs), their functionality and potential. Not only do we not have a precise definition of LLMs, but much of what’s been said relies on claims and assumptions worth re-examining. We contribute a definition of LLMs and discuss some of the assumptions about them and the existing evidence.

Introduction

LLMs have been all over the news for the past year – both in terms of their ability to tell knock-knock jokes and answer questions, but also because of their visible mistakes – like that AI-generated recipe for chlorine gas and made up legal cases that got a lawyer fined. They are presented as ‘general purpose technologies’ [1] with ‘emergent properties’ [2] – but what are they, and how much of this is true?

Key Insights

What’s an LLM?

The very term ‘Large Language Model’ is ill-defined and often used interchangeably with other (equally ill-defined terms) like ‘foundation model’ [3] and ‘frontier model’ [4]. We propose a definition based on three criteria:

LLMs model text and can be used to generate it based on input context: the text can be in any modality — as characters, audio waves, pixels, etc., and the output is generated by selecting the tokens that are the most likely, given the partial context provided as input.
LLMs receive large-scale pretraining, where ‘large-scale’ refers to the pre-training data rather than the number of parameters. We propose setting this threshold to a billion tokens (inspired by Chelba et al. [5]).
LLMs make inferences based on transfer learning: LLMs are meant to be adaptable to many tasks on the assumption that their pre-training encodes information that can then be leveraged in other tasks. This can be done in different ways, e.g., by fine-tuning, as in BERT [6], or prompting, as in GPT-3 [7].

The above criteria include models like BERT and GPT series (and even their distilled versions) but exclude the static representations like word2vec (which do not take context into account at inference time).

Fact-checking LLMs

LLMs are SOTA

LLM-based approaches are largely perceived as the current state-of-the-art (SOTA) across NLP benchmarks, but such statements should come with a few footnotes. For example, OpenAI’s claim that GPT-4 outperforms unspecified fine-tuned models on several verbal reasoning tasks [8] was made without sufficient details to verify this. Other works, such as PaLM [9] and LLaMa [10], report many results across multiple benchmarks without digging deeper into where the models fail and what this means. Looking at various task leaderboards on Papers with Code, we can see simple embedding-based approaches [11] succeeding at many of them. Finally, many benchmark datasets are present in LLM training data [12], which makes evaluation results for such benchmarks dubious.

Bigger is Better

Scaling has played a central role in the success of LLMs – starting with the ‘scaling laws’ paper [13]. However, it’s hard to disentangle what exactly is responsible for this improvement – whether the size of the model, the number of training epochs, the amount of data – or its quality: a factor that was not explicitly discussed in the scaling laws paper, but which is now drawing much effort and attention. There has also been an increased skepticism around the scaling hypothesis, starting with ‘efficient scaling’ proposals for Transformers [14] and further explored via initiatives like the Inverse Scaling Prize. Performance doesn’t improve with model size for many tasks, such as logical reasoning and pattern matching.

LLMs are robust

When talking of brittle systems, many people remember the early symbolic AI programs that were rule-based and, hence, could not process anything outside of the scope of pre-defined knowledge. Did deep learning systems overcome that? Yes, unfamiliar inputs do not completely break them. But even the latest systems still make errors a human wouldn’t make [15-17]. We know that fine-tuned models may learn shortcuts [18-21]: undesirable spurious correlations picked up from the training data. We also know that slight variations in the phrasing of the prompt can lead to very different LLM output [22-24]: this phenomenon affected all 30 LLMs in a recent large-scale evaluation [25]. François Chollet [26] questions if deep learning systems can ever overcome this kind of brittleness: according to him, they are “unable to make sense of situations that deviate slightly from their training data or the assumptions of their creators” (p.3).

LLMs exhibit “emergent properties.”

This is one of the claims that gets LLMs the most attention: if a learning system acquires something it was not supposed to, it can sound scary. But if “emergent properties” are defined as something that the system acquires without explicit instruction [3], then we have to prove that by examining the training data. This has simply never been done because the training datasets of LLMs are too big, and we do not even have the methodology to systematically check how similar their evidence was to the test examples (beyond simple string matching). And especially for the commercial LLMs such as GPT-4, we simply cannot discuss their “emergent properties” scientifically because their training data is not disclosed. Checking whether a particular example exists on the internet is not good enough because of all the data previously submitted to OpenAI for evaluating GPT-3, trying to trick ChatGPT, etc.

Between the lines

In this position paper, we discuss that LLMs are not well-defined as a technology, and much of their feted functionalities are overblown or untrue. The existing research suggests that they are not robust and not always state-of-the-art compared to other approaches, and their success does not come purely from their scale. Furthermore, if ‘emergent abilities’ are defined as something not explicitly taught — we do not even have the methodology to ascertain that.

References

[1] Eloundou, T., Manning, S., Mishkin, P., & Rock, D. (2023). Gpts are gpts: An early look at the labor market impact potential of large language models. arXiv preprint arXiv:2303.10130

[2] Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., … & Fedus, W. (2022). Emergent abilities of large language models. arXiv preprint arXiv:2206.07682

[3] Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., … & Liang, P. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.

[4] Anderljung, M., Barnhart, J., Leung, J., Korinek, A., O’Keefe, C., Whittlestone, J., … & Wolf, K. (2023). Frontier AI regulation: Managing emerging risks to public safety. arXiv preprint arXiv:2307.03718.

[5] Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., & Robinson, T. (2013). One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.

[6] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[7] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.

[8] OpenAI (2023). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774

[9] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., … & Fiedel, N. (2022). Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.

[10] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., … & Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

[11] Wang, X., Jiang, Y., Bach, N., Wang, T., Huang, Z., Huang, F., & Tu, K. (2020). Automated concatenation of embeddings for structured prediction. arXiv preprint arXiv:2010.05006.

[12] Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., … & Gardner, M. (2021). Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv preprint arXiv:2104.08758.

[13] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., … & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

[14] Tay, Y., Dehghani, M., Rao, J., Fedus, W., Abnar, S., Chung, H. W., … & Metzler, D. (2021). Scale efficiently: Insights from pre-training and fine-tuning transformers. arXiv preprint arXiv:2109.10686.

[15] Chang, Tyler A and Benjamin K Bergen. 2023. Language model behavior: A comprehensive survey. arXiv preprint arXiv:2303.11504.

[16] Lee, Peter, Sebastien Bubeck, and Joseph Petro. 2023. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. New England Journal of Medicine, 388(13):1233–1239.

[17] Gan, Chengguang and Tatsunori Mori. 2023. Sensitivity and Robustness of Large Language Models to Prompt Template in Japanese Text Classification Tasks.

[18] McCoy, R Thomas, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in Natural Language Inference. arXiv preprint arXiv:1902.01007.

[19] Rogers, Anna, Olga Kovaleva, Matthew Downey, and Anna Rumshisky. 2020. Getting closer to AI complete question answering: A set of prerequisite real tasks. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 8722–8731

[20] Branco, Ruben, António Branco, Joao Rodrigues, and Joao Silva. 2021. Shortcutted commonsense: Data spuriousness in deep learning of commonsense reasoning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1504–1521.

[21] Choudhury, Sagnik Ray, Anna Rogers, and Isabelle Augenstein. 2022. Machine Reading, Fast and Slow: When Do Models” Understand” Language? arXiv preprint arXiv:2209.07430.

[22] Liu, Pengfei, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023b. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.

[23] Lu, Yao, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2021. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786.

[24] Zhao, Zihao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12697–12706, PMLR.

[25] Liang, Percy, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2022. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.

[26] Chollet, François. 2019. On the Measure of Intelligence.