• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • 🇫🇷
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

Robust Distortion-free Watermarks for Language Models

December 6, 2023

🔬 Research Summary by Rohith Kuditipudi, a third year Ph.D. student at Stanford University advised by John Duchi and Percy Liang.

[Original paper by Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang]


Overview: This paper proposes a technique for planting watermarks in text sampled from a language model that enables reliable attribution of the text to the model. The watermarks are robust to edits while exactly preserving the original text distribution up to a maximum sampling budget.


Introduction

Large language models (LLMs) like ChatGPT provoke new questions about the provenance of written documents. For example, the website StackOverflow has banned users from posting answers using OpenAI’s ChatGPT model to mitigate the spread of misinformation on the platform. However, enforcing a ban on text generated by models is challenging because, by design, these models produce text that appears human-like. A reliable forensic tool for attributing text to a particular language model would empower individuals—such as platform moderators and teachers—to enact and enforce policies on language model usage; it would also better enable model providers (e.g., OpenAI) to track the use or misuse of their models.

To achieve provenance, a watermark is a signal embedded within some generated content—in this case, synthetic text from a language model (LM)—that encodes the source of the content. We propose a family of watermarking techniques for attributing text to a language model. Our watermarks can reliably distinguish human-written and synthetic text from a couple dozen words, even if more than half the original words have been edited in an attempt to evade watermark detection.

Key Insights 

Watermarking protocol setup

In our setting, users access the LM through a trusted provider that embeds the watermark in the LM’s output. Voluntary commitments, regulatory compliance, or by law could underwrite this trust. The user is an untrusted party (e.g., a student who hopes to cheat on a homework assignment) who requests generated text from the LM provider and may rewrite or paraphrase this text to remove the watermark. A detector can later check if a piece of text contains the watermark to determine whether this text originated from the LM. The detector should be robust to shenanigans by the user: the watermark should remain detectable unless the user has rewritten the text to the extent that it is no longer meaningfully attributable to the LM.

We allow the LM provider and watermark detector to coordinate ahead of time by sharing a secret randomized key. The LM provider uses the key, which amounts to a large sequence of random bits, to sample text from the LM that correlates with the watermark key sequence; the detector can then robustly align a putative text with the known key sequence to detect the watermark.

Distortion-free watermarks

In contrast to prior work, our watermarks are distortion-free in the sense that—over the initial randomness of the watermark key sequence—watermarked text is indistinguishable in distribution from regular text sampled from the language model. Creating a distortion-free watermark might at first appear impossible: how could a watermark be detectable if text generated with the watermark is sampled from the same probability distribution as unwatermarked text?

To illustrate, let’s design a distortion-free watermark for the outcome of ten coin flips. One way to sample the outcome of a fair coin flip is to draw a number uniformly at random between zero and one and return “heads” if the number is at most 1/2. Thus, to watermark the outcome of ten coin flips, we could first “pre”-sample a sequence of ten uniform random variables and fix these random variables as the watermark key. To anyone who does not know the key, the outcome of the ten coin flips will appear truly random; however, to the watermark detector, it will be evident that whether or not the coin flips have been watermarked—the probability that an independent sequence of ten coin flips all align with the watermark key is 2^(-10). 

Generalizing this simple intuition to a large language model—whose vocabulary consists of tens of thousands of tokens rather than just “heads” or “tails”—requires some care; however, the main idea is still that the watermarked text will correlate with the watermark key sequence irrespective of the text distribution (i.e., no matter the form of the original language model). One additional wrinkle we incorporate—to avoid repeatedly producing the same watermarked text—is to generate watermarked text using random subsequences of the full watermark key sequence. Until we reuse an element of the key sequence, the distribution of watermarked text will remain distortion-free, even if a user queries the LM provider multiple times.

Calibrated watermark detection

As the toy example of a coin flip illustrates, one important feature of watermarking is that the watermark detector can compute exact p-values for the null hypothesis that a particular key does not watermark a given text. Given the often high-stakes nature of content attribution, the availability of such p-values is critical. For example, a teacher may decide to penalize a student for plagiarism using a watermarked language model only if the likelihood of their homework under the null hypothesis is minimal (e.g., less than one in a billion).

Between the lines

The two key strengths of our watermarks are that they are both distortion-free and robust to substantial editing of watermarked text. However, this robustness does not necessarily imply that evading detection is hard; for example, one effective way to attack our watermark would be first to generate text in a foreign language and later translate it to the desired language (e.g., English). Thus, there remains both considerable room for progress and a need for measured caution in avoiding overreliance on watermarking as a means of attribution.

One important limitation of watermarking is that it requires trusting the LM provider to execute the watermarking protocol when sampling text faithfully. An exciting direction for future work is the development of effective watermarking schemes for open-source LMs (e.g., by planting the watermark in the open-sourced weights). A watermark is also LM-specific; the detector cannot broadly test whether a given text is machine-generated, but only whether it was generated by a particular LM that implements the watermark.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

Canada’s Minister of AI and Digital Innovation is a Historic First. Here’s What We Recommend.

Am I Literate? Redefining Literacy in the Age of Artificial Intelligence

AI Policy Corner: The Texas Responsible AI Governance Act

AI Policy Corner: Singapore’s National AI Strategy 2.0

AI Governance in a Competitive World: Balancing Innovation, Regulation and Ethics | Point Zero Forum 2025

related posts

  • Mapping value sensitive design onto AI for social good principles

    Mapping value sensitive design onto AI for social good principles

  • AI Has Arrived in Healthcare, but What Does This Mean?

    AI Has Arrived in Healthcare, but What Does This Mean?

  • Worldwide AI Ethics: a review of 200 guidelines and recommendations for AI governance

    Worldwide AI Ethics: a review of 200 guidelines and recommendations for AI governance

  • GenAI Against Humanity: Nefarious Applications of Generative Artificial Intelligence and Large Langu...

    GenAI Against Humanity: Nefarious Applications of Generative Artificial Intelligence and Large Langu...

  • Characterizing, Detecting, and Predicting Online Ban Evasion

    Characterizing, Detecting, and Predicting Online Ban Evasion

  • Research summary: Lexicon of Lies: Terms for Problematic Information

    Research summary: Lexicon of Lies: Terms for Problematic Information

  • Technical methods for regulatory inspection of algorithmic systems in social media platforms

    Technical methods for regulatory inspection of algorithmic systems in social media platforms

  • Subreddit Links Drive Community Creation and User Engagement on Reddit

    Subreddit Links Drive Community Creation and User Engagement on Reddit

  • Contextualizing Artificially Intelligent Morality: A Meta-Ethnography of Top-Down, Bottom-Up, and Hy...

    Contextualizing Artificially Intelligent Morality: A Meta-Ethnography of Top-Down, Bottom-Up, and Hy...

  • Risk of AI in Healthcare: A Study Framework

    Risk of AI in Healthcare: A Study Framework

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • © MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.