Robust Distortion-free Watermarks for Language Models

🔬 Research Summary by Rohith Kuditipudi, a third year Ph.D. student at Stanford University advised by John Duchi and Percy Liang.

[Original paper by Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang]

Overview: This paper proposes a technique for planting watermarks in text sampled from a language model that enables reliable attribution of the text to the model. The watermarks are robust to edits while exactly preserving the original text distribution up to a maximum sampling budget.

Introduction

Large language models (LLMs) like ChatGPT provoke new questions about the provenance of written documents. For example, the website StackOverflow has banned users from posting answers using OpenAI’s ChatGPT model to mitigate the spread of misinformation on the platform. However, enforcing a ban on text generated by models is challenging because, by design, these models produce text that appears human-like. A reliable forensic tool for attributing text to a particular language model would empower individuals—such as platform moderators and teachers—to enact and enforce policies on language model usage; it would also better enable model providers (e.g., OpenAI) to track the use or misuse of their models.

To achieve provenance, a watermark is a signal embedded within some generated content—in this case, synthetic text from a language model (LM)—that encodes the source of the content. We propose a family of watermarking techniques for attributing text to a language model. Our watermarks can reliably distinguish human-written and synthetic text from a couple dozen words, even if more than half the original words have been edited in an attempt to evade watermark detection.

Key Insights

Watermarking protocol setup

In our setting, users access the LM through a trusted provider that embeds the watermark in the LM’s output. Voluntary commitments, regulatory compliance, or by law could underwrite this trust. The user is an untrusted party (e.g., a student who hopes to cheat on a homework assignment) who requests generated text from the LM provider and may rewrite or paraphrase this text to remove the watermark. A detector can later check if a piece of text contains the watermark to determine whether this text originated from the LM. The detector should be robust to shenanigans by the user: the watermark should remain detectable unless the user has rewritten the text to the extent that it is no longer meaningfully attributable to the LM.

We allow the LM provider and watermark detector to coordinate ahead of time by sharing a secret randomized key. The LM provider uses the key, which amounts to a large sequence of random bits, to sample text from the LM that correlates with the watermark key sequence; the detector can then robustly align a putative text with the known key sequence to detect the watermark.

Distortion-free watermarks

In contrast to prior work, our watermarks are distortion-free in the sense that—over the initial randomness of the watermark key sequence—watermarked text is indistinguishable in distribution from regular text sampled from the language model. Creating a distortion-free watermark might at first appear impossible: how could a watermark be detectable if text generated with the watermark is sampled from the same probability distribution as unwatermarked text?

To illustrate, let’s design a distortion-free watermark for the outcome of ten coin flips. One way to sample the outcome of a fair coin flip is to draw a number uniformly at random between zero and one and return “heads” if the number is at most 1/2. Thus, to watermark the outcome of ten coin flips, we could first “pre”-sample a sequence of ten uniform random variables and fix these random variables as the watermark key. To anyone who does not know the key, the outcome of the ten coin flips will appear truly random; however, to the watermark detector, it will be evident that whether or not the coin flips have been watermarked—the probability that an independent sequence of ten coin flips all align with the watermark key is 2^(-10).

Generalizing this simple intuition to a large language model—whose vocabulary consists of tens of thousands of tokens rather than just “heads” or “tails”—requires some care; however, the main idea is still that the watermarked text will correlate with the watermark key sequence irrespective of the text distribution (i.e., no matter the form of the original language model). One additional wrinkle we incorporate—to avoid repeatedly producing the same watermarked text—is to generate watermarked text using random subsequences of the full watermark key sequence. Until we reuse an element of the key sequence, the distribution of watermarked text will remain distortion-free, even if a user queries the LM provider multiple times.

Calibrated watermark detection

As the toy example of a coin flip illustrates, one important feature of watermarking is that the watermark detector can compute exact p-values for the null hypothesis that a particular key does not watermark a given text. Given the often high-stakes nature of content attribution, the availability of such p-values is critical. For example, a teacher may decide to penalize a student for plagiarism using a watermarked language model only if the likelihood of their homework under the null hypothesis is minimal (e.g., less than one in a billion).

Between the lines

The two key strengths of our watermarks are that they are both distortion-free and robust to substantial editing of watermarked text. However, this robustness does not necessarily imply that evading detection is hard; for example, one effective way to attack our watermark would be first to generate text in a foreign language and later translate it to the desired language (e.g., English). Thus, there remains both considerable room for progress and a need for measured caution in avoiding overreliance on watermarking as a means of attribution.

One important limitation of watermarking is that it requires trusting the LM provider to execute the watermarking protocol when sampling text faithfully. An exciting direction for future work is the development of effective watermarking schemes for open-source LMs (e.g., by planting the watermark in the open-sourced weights). A watermark is also LM-specific; the detector cannot broadly test whether a given text is machine-generated, but only whether it was generated by a particular LM that implements the watermark.

Robust Distortion-free Watermarks for Language Models

Introduction