Can LLMs Enhance the Conversational AI Experience?

🔬 Column by Julia Anderson, a writer and conversational UX designer exploring how technology can make us better humans.

Part of the ongoing Like Talking to a Person series

During conversations, sometimes people finish each other’s sentences. With the advent of LLMs or large language models, that ability is now available to machines. Large language models are distinct from other language models due to their size, which are typically several gigabytes and billions of training parameters larger than their predecessors. By learning from colossal chunks of data, which could come as text, image, or video, these LLMs are poised to notice a thing or two about how people communicate.

Conversational AI products, such as chatbots and voice assistants, are the prime beneficiaries of this technology. OpenAI’s GPT-3, for example, can generate text or code from short prompts entered by users. OpenAI recently released ChatGPT, a version optimized for dialogue. This is one of many models driving the field of generative AI, where the “text2anything” phenomenon is letting people describe an image or idea in a few words and letting AI output its best guesses. Using this capability, bots and assistants could generate creative, useful responses to anyone conversing.

However, LLMs have their faults. Beyond the lack of transparency in training these models, the costs are typically exorbitant for all but massive enterprises. There are also several instances of them fabricating scientific knowledge and promoting discriminatory ideals. While this technology is promising, designers of conversational AI products must carefully assess what LLMs can do and whether that creates a beneficial user experience.

Knowledge Isn’t Everything

Possessing knowledge may make someone, or in this case, something, more informed, but they may need to be more effective communicators. This is the primary problem with LLMs as it relates to conversational AI.

While LLMs could serve as a solid knowledge base for a chatbot, they must be fine-tuned to specific user data for it to give appropriate responses. Unlike prior language models like BERT, where an algorithm predicts words missing from a sentence, GPT-3 and other LLMs try to predict something new in a given context. Such a capability makes it excellent for generating responses.

Stanford research evaluated 30 prominent language models and found that fine-tuning them with human feedback resulted in more accurate and fair responses. The findings also emphasize the importance of constant evaluation to understand how these models adapt once introduced to more user prompts and data. Without tracking, the risk of misinformation generation increases.

Mislead and Misinform

When users think of systems as a reflection of human knowledge, especially ones that use speech as an interface, the systems have a responsibility to address people fairly.

The proliferation of social stereotypes and the ease of generating toxic statements are worrying side effects of LLMs. While Gopher, DeepMind’s language model, performs exceedingly well in summarizing biological concepts, it tends to respond with religious, racial, and occupational stereotypes.

Inferring false associations is a massive UX shortcoming. At best, a chatbot tells a user that they seem like a cat person when they are a dog person. At worst, a chatbot infers that a patient has enough medical problems to “kill themselves.” Disseminating these suggestions is harmful to users seeking advice and a risk to others expecting factual information in medicine or law. This doesn’t even touch the rise of fake personas or agents scamming unsuspecting users.

Then there is the concern of what knowledge is excluded from LLMs. English is one of the more common languages, but this may only show one interpretation of history. If models are trained primarily on research papers or published works instead of casual conversation, then several dialects and colloquialisms may also be missing. When it comes to talking to a user, this limits the scope of personalization and relatability to a bot, blemishing the experience itself.

Looking Into Language

Increased transparency through open-sourced LLMs could mitigate risks. Not only does this open up the conversation to new perspectives and languages, but it also decreases the costs associated with training large data models.

BLOOM, an open-source model from BigScience, will be one of the biggest of its kind. Researchers hope that it will be trained on a larger variety of data than existing LLMs, of which many don’t know the true inner workings since they are not open-source. (Read the series hosted by MAIEI from the developers of BLOOM)

Language is a complex mechanism, and LLMs are still testing its levers. Just as people oftentimes misinterpret a joke, these models come with similar biases. After all, they were trained on human experience and filled with grammar and words that confuse even native speakers. For LLMs to be helpful in conversational AI, they should understand not only the way people talk but how they think.