How Prevalent is Gender Bias in ChatGPT? - Exploring German and English ChatGPT Responses

🔬 Research Summary by Stefanie Urchs, a Computer Science Ph.D. student at the Hochschule München University of Applied Sciences, deeply interested in interdisciplinary approaches to natural language processing.

[Original paper by Stefanie Urchs, Veronika Thurner, Matthias Aßenmacher, Christian Heumann, and Stephanie Thiemichen]

Overview: ChatGPT opened the world of large language models to non-IT users who tend to use the system as an all-knowing chatbot without regard to the pitfalls of the technology. In this paper, the authors examine how gender-biased ChatGPT responses are and what other pitfalls await an unprepared non-IT user.

Introduction

By introducing ChatGPT with its intuitive user interface (UI), OpenAI opened the world of state-of-the-art natural language processing to non-IT users. Users do not need a computer science background to interact with the system. Instead, they have a natural language conversation in the UI. Many users utilize the system to help with their daily work: Writing texts, checking grammar and spelling, and even fact-checking their work. However, non-IT users tend to see the system as a “magical box” that knows all the answers and believe that because machines do not make mistakes, neither does ChatGPT. This lack of critical usage is problematic in everyday use.

We prompt ChatGPT in German and English from a neutral, female, and male perspective to examine the differences in responses. We inspect three prompts in depth after broadly prompting the system to define the problem space. ChatGPT is a good tool to use for drafting texts. However, it still has problems with gender-neutral language and tends to overcorrect if a prompt contains gender. In the end, we still need humans to check the work of machines.

Key Insights

Gender Bias in Large Language Models

What do we mean when discussing bias?

It is important to define the term bias properly to detect biases in text. In machine learning, specifically in a classification task, bias is defined as the preference of a model towards a certain class. However, in natural language, bias has a different definition: “Biases are preconceived notions based on beliefs, attitudes, and/or stereotypes about people pertaining to certain social categories that can be implicit or explicit.” (Mateo et al., 2020). When writing text, humans tend to incorporate these notions—our attitude towards the different biases changes over time. For example, our understanding of the role of women in society or the LGBTQIA+ community today differs from thirty years ago.

Where’s the problem?

One of the premises of machine learning is that the more data there is, the better. Therefore, large language models (LLMs) are trained on as much textual data as possible. OpenAI decided not to mention what data they used to train the model that is the basis for ChatGPT. However, the training data likely includes much text from the web. This leads to two problems: first, the web is primarily male, white, and US American. Second, as mentioned above, our values evolved, but we train our modern-day language models on texts from several decades ago. This is problematic because these models learn to choose the statistically most likely next word, depending on the words they have already seen (prompt) or generated (response up till the word it generates at that moment). This means if the model is trained on a lot of text with women and housework in proximity, the model will learn that these concepts belong together. Therefore, the model reproduces all biases that are contained in the training data.

Detecting Gender Bias in ChatGPT

Our Approach

We analyze ChatGPT responses from the point of view of a non-IT user working in university communications by prompting the system in German and English from a neutral, female, and male perspective. At first, we probe the system with open-ended and neutrally formulated prompts for possibly problematic responses. Even one occurrence of controversial behavior can be problematic for a user who does not check the response thoroughly before publishing. Additionally, the system is used very frequently by many users, thus generating a tremendous amount of responses. Hence, problematic behavior will be generated again. Subsequently, we chose two prompts to investigate further. In contrast to the first probing, we now repeat each prompt ten times to examine whether problems arise from scaling. We analyze these responses for words used in the text, usage of female/male-coded words, and text length.

Findings

While first probing the system, we found that ChatGPT excels in English but uses US American English by default. German responses sometimes lack grammatical correctness. However, these grammatical mistakes are not obvious while skimming the response. Furthermore, ChatGPT has a problem using correct grammar regarding German gender-neutral language. However, ChatGPT can use the gender-neutral “they” in English. When explicitly adding gender to a prompt, ChatGPT tends to trigger topics about fairness and equality, which are not mentioned in responses to neutral prompts. Unfortunately, ChatGPT has no concept of male and female and uses the same reasoning for both. Therefore, women should become professors to elevate other women. Men should become professors so young men know that it is possible to excel in STEM (science, technology, engineering, and mathematics) fields.

While prompting the system more in-depth, we found that ChatGPT hallucinates information into generic prompts. Generating exclusively female professors (in both languages) for neutral prompts makes it look biased toward female content. Furthermore, the system displays a bias toward STEM-related research fields, while the responses overall use relatively few gender-coded words and do not reinforce common language biases. German and English responses were similar in stressed content, which is suitable for using the system for bi-lingual text generation.

ChatGPT is useful for helping non-IT users draft texts for their daily work. However, it is crucial to thoroughly check the system’s responses for biases and syntactic and grammatical mistakes.

Between the lines

Due to our endeavor to analyze ChatGPT from a non-IT user’s perspective, working in university communications, we had a limited scope of possible prompts that led to subtle differences between the perspectives. To explore the differences between gendered responses, more general prompts should be explored. Furthermore, changing the context from formal university communications to a more informal one might lead to more biased results.

Our research, which is based in a formal setting, results in an unexpectedly positive outcome regarding gender bias in ChatGPT. However, changing the context to a less formal setting might result in more biased responses.