• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Robust Distortion-free Watermarks for Language Models

December 6, 2023

🔬 Research Summary by Rohith Kuditipudi, a third year Ph.D. student at Stanford University advised by John Duchi and Percy Liang.

[Original paper by Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang]


Overview: This paper proposes a technique for planting watermarks in text sampled from a language model that enables reliable attribution of the text to the model. The watermarks are robust to edits while exactly preserving the original text distribution up to a maximum sampling budget.


Introduction

Large language models (LLMs) like ChatGPT provoke new questions about the provenance of written documents. For example, the website StackOverflow has banned users from posting answers using OpenAI’s ChatGPT model to mitigate the spread of misinformation on the platform. However, enforcing a ban on text generated by models is challenging because, by design, these models produce text that appears human-like. A reliable forensic tool for attributing text to a particular language model would empower individuals—such as platform moderators and teachers—to enact and enforce policies on language model usage; it would also better enable model providers (e.g., OpenAI) to track the use or misuse of their models.

To achieve provenance, a watermark is a signal embedded within some generated content—in this case, synthetic text from a language model (LM)—that encodes the source of the content. We propose a family of watermarking techniques for attributing text to a language model. Our watermarks can reliably distinguish human-written and synthetic text from a couple dozen words, even if more than half the original words have been edited in an attempt to evade watermark detection.

Key Insights 

Watermarking protocol setup

In our setting, users access the LM through a trusted provider that embeds the watermark in the LM’s output. Voluntary commitments, regulatory compliance, or by law could underwrite this trust. The user is an untrusted party (e.g., a student who hopes to cheat on a homework assignment) who requests generated text from the LM provider and may rewrite or paraphrase this text to remove the watermark. A detector can later check if a piece of text contains the watermark to determine whether this text originated from the LM. The detector should be robust to shenanigans by the user: the watermark should remain detectable unless the user has rewritten the text to the extent that it is no longer meaningfully attributable to the LM.

We allow the LM provider and watermark detector to coordinate ahead of time by sharing a secret randomized key. The LM provider uses the key, which amounts to a large sequence of random bits, to sample text from the LM that correlates with the watermark key sequence; the detector can then robustly align a putative text with the known key sequence to detect the watermark.

Distortion-free watermarks

In contrast to prior work, our watermarks are distortion-free in the sense that—over the initial randomness of the watermark key sequence—watermarked text is indistinguishable in distribution from regular text sampled from the language model. Creating a distortion-free watermark might at first appear impossible: how could a watermark be detectable if text generated with the watermark is sampled from the same probability distribution as unwatermarked text?

To illustrate, let’s design a distortion-free watermark for the outcome of ten coin flips. One way to sample the outcome of a fair coin flip is to draw a number uniformly at random between zero and one and return “heads” if the number is at most 1/2. Thus, to watermark the outcome of ten coin flips, we could first “pre”-sample a sequence of ten uniform random variables and fix these random variables as the watermark key. To anyone who does not know the key, the outcome of the ten coin flips will appear truly random; however, to the watermark detector, it will be evident that whether or not the coin flips have been watermarked—the probability that an independent sequence of ten coin flips all align with the watermark key is 2^(-10). 

Generalizing this simple intuition to a large language model—whose vocabulary consists of tens of thousands of tokens rather than just “heads” or “tails”—requires some care; however, the main idea is still that the watermarked text will correlate with the watermark key sequence irrespective of the text distribution (i.e., no matter the form of the original language model). One additional wrinkle we incorporate—to avoid repeatedly producing the same watermarked text—is to generate watermarked text using random subsequences of the full watermark key sequence. Until we reuse an element of the key sequence, the distribution of watermarked text will remain distortion-free, even if a user queries the LM provider multiple times.

Calibrated watermark detection

As the toy example of a coin flip illustrates, one important feature of watermarking is that the watermark detector can compute exact p-values for the null hypothesis that a particular key does not watermark a given text. Given the often high-stakes nature of content attribution, the availability of such p-values is critical. For example, a teacher may decide to penalize a student for plagiarism using a watermarked language model only if the likelihood of their homework under the null hypothesis is minimal (e.g., less than one in a billion).

Between the lines

The two key strengths of our watermarks are that they are both distortion-free and robust to substantial editing of watermarked text. However, this robustness does not necessarily imply that evading detection is hard; for example, one effective way to attack our watermark would be first to generate text in a foreign language and later translate it to the desired language (e.g., English). Thus, there remains both considerable room for progress and a need for measured caution in avoiding overreliance on watermarking as a means of attribution.

One important limitation of watermarking is that it requires trusting the LM provider to execute the watermarking protocol when sampling text faithfully. An exciting direction for future work is the development of effective watermarking schemes for open-source LMs (e.g., by planting the watermark in the open-sourced weights). A watermark is also LM-specific; the detector cannot broadly test whether a given text is machine-generated, but only whether it was generated by a particular LM that implements the watermark.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

ALL IN Conference 2025: Four Key Takeaways from Montreal

Beyond Dependency: The Hidden Risk of Social Comparison in Chatbot Companionship

AI Policy Corner: Restriction vs. Regulation: Comparing State Approaches to AI Mental Health Legislation

Beyond Consultation: Building Inclusive AI Governance for Canada’s Democratic Future

AI Policy Corner: U.S. Executive Order on Advancing AI Education for American Youth

related posts

  • Interview with Borealis AI

    Interview with Borealis AI

  • The Unnoticed Cognitive Bias Secretly Shaping the AI Agenda

    The Unnoticed Cognitive Bias Secretly Shaping the AI Agenda

  • Research summary: Evasion Attacks Against Machine Learning at Test Time

    Research summary: Evasion Attacks Against Machine Learning at Test Time

  • Andrew Ng’s AI For Everyone - The Definitive Starting Block for AI Novices

    Andrew Ng’s AI For Everyone - The Definitive Starting Block for AI Novices

  • An error management approach to perceived fakeness of deepfakes: The moderating role of perceived de...

    An error management approach to perceived fakeness of deepfakes: The moderating role of perceived de...

  • Top 10 Takeaways from our Conversation with Salesforce about Conversational AI

    Top 10 Takeaways from our Conversation with Salesforce about Conversational AI

  • Prompt Middleware: Helping Non-Experts Engage with Generative AI

    Prompt Middleware: Helping Non-Experts Engage with Generative AI

  • How Canada can be a global leader in ethical AI

    How Canada can be a global leader in ethical AI

  • Artificial Intelligence as a Force for Good

    Artificial Intelligence as a Force for Good

  • The Artificiality of AI – Why are We Letting Machines Manage Employees?

    The Artificiality of AI – Why are We Letting Machines Manage Employees?

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.