š¬ Research summary by Victoria Heath (@victoria_heath7), our Associate Director of Governance & Strategy.
[Original paper by Nils Kobis, Luca D. Mossink]
Overview: Can we tell the difference between a machine-generated poem and a human-written one? Do we prefer one over the other? Researchers Nils Kobis and Luca D. Mossink examine these questions through two studies observing human behavioural reactions to a natural language generation algorithm, OpenAIās Generative Pre-Training (GPT-2).
Introduction
Science tells us that the essence of nature is empathy.Ā
Who wrote that? An algorithm or a human being? *Spoiler alert* That philosophical sentence was generated by an open-source algorithm called the New Age Bullshit Generator, a rudimentary example of a text or natural language generation algorithm (NLG). From autocompleting our emails to creating news stories, NGLs are increasingly present. Not only can these systems be used to create new forms of plagiarism but they can also be used to create and disseminate mis/disinformation. Therefore, our ability to decipher what is created by an algorithm and what isnāt is important.Ā
While previous studies have tested the capabilities of algorithms to generate creative outputs, there is a gap in research observing peopleās reactions to these outputs. Researchers Nils Kobis and Luca D. Mossink sought to fill that gap by using poetry generated by OpenAIās advanced NLG, the Generative Pre-Training (GPT-2) model, to measure and analyze:
1) Peopleās ability to distinguish between artificial and human text
2) Peopleās confidence levels in regards to algorithmic detection
3) Peopleās aversion to or appreciation for artificial creativity
4) How keeping humans āin-the-loopā affects points 1-3
Studies 1 and 2: General methodologyĀ
Before diving into the results of Kobis and Mossinkās research, letās take a quick look at their methodology. They created two different studies, Study 1 and Study 2, that each had four parts:
- Part 1 entailed ācreating pairs of human-AI poems.ā In Study 1, these were written by participants in an incentivized creative-writing task. In Study 2, they used professionally written poems. In Study 1, the researchers selected which creative outputs would be used for judging, something they refer to as keeping a human in the loop (HITL) while in Study 2, they tested HITL as well as what would happen if the poetry was randomly sampled (HOTL). In both studies, they used GPT-2 to generate poetry from an algorithm.
- Part 2 entailed āa judgement taskā modelled after the Turing test, in which a judge tries to decipher between two participants which is the machine and which is the human. In Kobis and Mossinkās studies, participants acted as āthird party judgesā tasked with indicating which creative text they preferred. The researchers also told some participants which poems were written by a human (i.e. transparency) but kept that information hidden from others (i.e. opacity).
- Part 3 entailed a financial incentive to āassess peopleās accuracy in identifying algorithm-generated creative text,ā an aspect to this study that makes it unique.
- Part 4 entailed the judges indicating ātheir confidence in identifying the correct poem.ā In Study 1, there was no incentive. In Study 2, however, judges received financial incentives for ācorrectly estimating their performance.ā
Studies 1 and 2: Results
In Study 1, judges showed a preference (57%) for human-written poems over the GPT2-generated poems. Contrary to the researchersā hypothesis, ājudges did not reveal a stronger preference for human-written poetry when they were informed about the origin of the poems.ā Results also showed that judges were able to accurately distinguish the poems about 50% of the timeāindicating that people are ānot reliably able to identify human versus algorithmic creative content.ā On average, however, the judges were overconfident in their ability to identify the origin of the poems. āThese results are the first to indicate that detecting artificial text is not a matter of incentives but ability,ā they conclude.
While their findings from Study 1 were generally replicated in Study 2, they observed that when the machine-generated poems were randomly sampled (HOTL) vs selected by humans (HITL) there was a stronger preference for the human-written poems. There was also a notable increase in preference for algorithm-generated poems in the HITL group. Further, they found that people were more accurate in identifying the artificial poetry when the pieces were randomly chosen (HOTL).
Discussion: Should we fear the robot poet?
Nils Kobis and Luca D. Mossink’s research generally affirms what other studies have shown: people generally have an āaversionā to algorithms, especially algorithmically generated content that people perceive as āemotionalā rather than āmechanical.ā This could indicate that certain creative professions, like journalism, are more likely to be disrupted by AI in comparison to others, like poetry or music. Another significant finding of this research is the influence humans can have on perceptions of artificial content. āWe provide some of the first behavioural insights into peopleās reactions to different HITL systems,ā they explain. This should inform discussions around algorithmic accountability. While keeping humans in the loop can help āmonitor and adjust the system and its outcomes,ā Kobis and Mossink write, it also allows us to ācrucially shape the conclusions drawn about the algorithmās performance.ā
Further research is required into humansā behavioural reactions to NLG algorithms. By using an incentivized version of the Turing Test, they argue, we could learn more about the use of NGL algorithms in other creative domains, such as news or social media. Kobis and Mossink also argue that creating studies comparing HITL and HOTL is necessary to produce āreliable and reproducible findings on the nexus of human and machine behavior.ā They conclude the article by pointing out that although algorithmsā ability to mimic human creative text is increasing, āthe results do not indicate machines are ācreative.āā Creativity requires emotion, something machines donāt possess (yet).
Between the lines
When it comes to AI and creativity, there are significant issues we must contend with and questions we must ask. Like, how do AI-generated creative outputs fit within the realm of copyright and intellectual property law? Should the developer of an algorithm own the copyright of its creative outputs? Creative Commons (CC), the nonprofit organization behind the open CC licenses, says there isnāt a straightforward answer. āIt brings together technical, legal, and philosophical questions regarding ācreativity,ā and whether machines can be considered āauthorsā that produce āoriginalā works,ā wrote CCās Director of Policy Brigitte Vezina in 2020. Pending further research and debate, they argue, any outputs by an algorithm should be in the public domain.
While this research affirms that humans arenāt fond of artificial text, it also indicates that we canāt really tell the differenceāsomething that will only increase as these systems become more sophisticated. Therefore, itās important that we prepare ourselves for a future in which these systems impact creative industries, both positively and negatively. Personally, I dread the day Iāll be replaced by an algorithm that can more efficiently (and affordably) spit out witty Tweets and attention-grabbing headlines. Thankfully, that day is not today.