🔬 Research Summary by Alejandro Cuevas Villalba, Ph.D. student in Computer Science at Carnegie Mellon University, focusing on measuring social influence and improving reputation systems.
[Original paper by Alejandro Cuevas Villalba, Eva M. Brown, Jennifer V. Scurrell, Jason Entenmann, and Madeleine I. G. Daepp]
Overview: Quantitative data collection methods (e.g., surveys) often stand at odds with qualitative methods (e.g., interviews). Tools such as surveys enable researchers to collect and analyze data at scale but can constrain the depth and breadth of participants’ answers. On the other hand, tools such as interviews facilitate rich and nuanced data collection, though at the expense of scale. Undoubtedly, the advent of large language models (LLMs) offers unique opportunities to develop new data collection methods. However, should we think of LLMs as survey enhancers? Or should we think of them as automated interviewers?
Introduction
Advances in technology have often ushered in new eras of data collection methodologies. In the same way that the Internet has enabled the prevalence of web surveys, the adoption of telephones across households has replaced mail-based surveys. Today, the era of personal AI assistants is set to usher in a new wave of data collection tools.
While conversational agents (e.g., chatbots) have been around for several years, we may be finally reaching—and quickly surpassing—the point where these tools are actually a pleasure to interact with. Companies like OpenAI have managed to create remarkable user experiences and greatly reduced the burden of developing and deploying conversational agents.
Thus, a natural question emerged in the social sciences: how can we use AI assistants to facilitate data collection? We explored this question by designing and deploying a conversational agent to conduct a study with over 399 participants. Our findings suggest that there are numerous benefits to employing conversational agents over traditional surveys. Interestingly, these tools still fall short when compared to in-person interviews. What was most surprising, however, is that managing participants’ expectations is a key methodology element.
Key Insights
The limitations of surveys, interviews, and chatbots
Surveys have become a popular tool for data collection, but their prevalence is not due to their ability to extract superior insights. Rather, the ease of data analysis after collection is what turned them into the de facto tools of human data collection. However, surveys are not great exploratory tools. Composing questions, especially close-ended ones, is tough when we don’t yet know what to ask. This is where methods like interviews or focus groups excel because they allow interviewers to probe participants (i.e., ask subsequent questions) on areas they think are interesting. These methods are flexible yet substantially hard to scale. With great effort, an interviewer can talk to 8 participants a day. But the brunt of the work is in the analysis, where each hour of interview may amount to more than 3 hours of analysis.
The gap between both methods is—and has always been—quite broad. For a long time, researchers have sought methods that offer more flexibility, either by allowing us to scale interviews or by enhancing the richness of surveys. For quite some time, chatbots were seen as an option to bridge the gap. Nonetheless, the promise of chatbots was received with great disappointment. Whether it was the Q&A system for an airline or a robot receptionist answering the support line of a bank, most of us have experienced the distaste of interacting with these alleged conversational agents. Despite the significant advances in machine learning, specifically neural networks, a pleasant interaction with a chatbot seemed elusive.
Then came the LLMs with skyrocketing popularity thanks to OpenAI’s ChatGPT. This easy-to-use, know-it-all conversational agent captivated the world and reinstated confidence in chatbots. Not only are they easy to use, but recently, they have become even easier to deploy. With the new functionalities announced by OpenAI, users can now set up their custom chatbots with few lines of code. Yes, this means you can deploy a chatbot that talks like you or about niche things that you care about; the sky (or compute power) is the limit.
And yes, this also means we can use chatbots to create surveys and interview questions or have them conduct surveys or conduct interviews. But can it really do these tasks well? This is what we set forth to study. To do this, we designed three chatbots and recruited 399 participants to participate in a study about AI alignment. We split participants into three groups, each interacting with a different chatbot, and asked them to complete a survey about their experience.
Our study approach
Our study had three stages. First, participants took a multiple-choice survey on AI alignment. The purpose of the survey was twofold. First, it was a way to prime participants about the topic at hand. Second, it provided us with a benchmark by which we could assess our interpretations of the conversations. More on this later. After the first survey, participants were split randomly into three groups, each with a different chatbot design. The chatbots were programmed to ask questions about AI alignment. Our baseline was a chatbot that only asked hardcoded questions, whereas the two other chatbots relied on LLMs to display more intelligence. Lastly, participants were asked to complete an exit survey on their experience.
We found significant evidence to suggest that researchers have much to gain from employing chatbots to replace surveys. Interestingly, this was not the case when comparing chatbots to interviews. Compared to surveys, participants had better engagement and rated their experience significantly higher. On the other hand, when comparing the chatbot to in-person interviews, participants found themselves preferring a human-to-human interview.
Among the most interesting key insights was an accidental discovery in our methodology. In the survey that followed the chatbot interaction, we referred to the chatbot as an “AI interviewer.” This framing was particularly salient to participants in the baseline group, who interacted with the baseline chatbot. Several of the participants expressed frustration and disappointment at the fact that the chatbot did not seem intelligent. On the other hand, this effect was absent from the other groups. Rather, participants expressed that they much preferred their interaction with the chatbot over traditional surveys.
Between the lines
The ease of deployment and enjoyment of participants bodes well for using chatbots as data collection instruments in user studies. Soon, however, we should consider them as survey augmenters rather than replacements for in-person interviews. Furthermore, the missing puzzle piece from our work is scaling the analyses of the collected data. Recent work has shown that LLMs may also assist in analyzing the chatlogs. Although outside the scope of this paper, we found encouraging preliminary results when analyzing the collected data with ChatGPT.
With careful management of user expectations, we could introduce a new tool for user studies: a tool that allows us to explore new phenomena at a greater scale more quickly and deeply.