On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research

🔬 Research Summary by Luiza Pozzobon, a Research Scholar at Cohere For AI where she currently researches model safety. She’s also a master’s student at the University of Campinas, Brazil.

[Original paper by Luiza Pozzobon, Beyza Ermis, Patrick Lewis, and Sara Hooker]

Overview: We show how silent changes in a toxicity scoring API have impacted a fair comparison of toxicity metrics between language models over time. This affected research reproducibility and living benchmarks of model risk such as HELM. We suggest caution in applying apples-to-apples comparisons between toxicity studies and lay recommendations for a more structured approach to evaluating toxicity over time.

Introduction

An unintended consequence of the recent progress in language modeling is the models’ increasing capability of generating toxic or harmful text. Although there are usually protections to mitigate the harm of these models, they’re not fail-proof. For example, it has been shown how asking ChatGPT to act as a different persona (e.g., the boxer Muhammad Ali) increases toxic generations [1].

A quick and low-cost way to measure the possible harm a model can cause to its users is through automatic evaluation. Model generations are evaluated for toxicity by tools such as the Perspective API, which has become the standard for many research use cases as a free tool maintained by a credible institution.

However, the scientific community has overlooked the reality: the API’s underlying models are silently updated over time, and we cannot access model versioning. This implies that all research that relies on such API is not inherently reproducible, and results are not inherently comparable over time. We show the impacts of such API changes in research reproducibility and ranking of model risk featured in the HELM benchmark. We call for a more structured approach to evaluating toxicity over time.

Key Insights

Automatic toxicity evaluation

Human toxicity evaluation presents serious challenges, such as the variability of different geographies and cultural norms, the ever-expanding size of datasets, and the mental health risk it poses to evaluators exposed to highly toxic content. Due to this, automatic toxicity classification became the standard in language model evaluation and acts as a first low-cost means of metrifying a model’s toxicity.

The most widely used tool in this regard is the Perspective API, maintained by Google’s Jigsaw team. Originally, the API was aimed to aid human-supervised content moderation online, but it’s also been frequently used in research papers and rankings of model risk.

Backed by machine learning models, the Perspective API returns up to seven attributes of a given sequence of text. These attributes represent the perceived impact of a given comment on a range of emotional concepts. The toxicity attribute, the focus of this work, is defined as “a rude, disrespectful, or unreasonable comment that is likely to make people leave a discussion” and is available to assess sentences in more than ten languages.

Impacts on Rankings of Model Risk

To robustly evaluate a model for toxicity, we need to investigate the text they generate at scale and given a variety of contexts. In this study, we’re concerned about foundational, general language models, and we evaluate how they complete a given sentence.

A common benchmark for toxicity evaluation is the RealToxicityPrompts (RTP), a dataset built to assess the amount of toxicity a language model generates when continuing a given toxic or non-toxic text. It contains 100 thousand naturally occurring English prompts and their Perspective API toxicity scores.

Here’s how the evaluation of toxicity works in practice:

The evaluated model generates 25 continuations to each prompt of the RealToxicityPrompts dataset.
Those continuations are sent to the Perspective API for toxicity scoring.
Toxicity metrics are computed for each set of prompts and their continuations. Reported values are the mean scores over all prompts.

Along with the dataset release, the authors ranked out-of-the-box models for toxicity, such as GPT1, GPT2, and GPT3. We got the open-sourced continuations from each model (step 1) and re-did steps 2 and 3 from above. Nothing has changed besides the time the toxicity evaluation was performed. However, toxicity scores for all models reduced drastically. GPT3’s expected maximum toxicity when conditioned on toxic prompts was 0.75 when RTP was released, and at the time of our evaluation, it was 0.62. An absolute reduction of 0.13 points just by using different API versions.

These results indicate that since toxicity scores generally got lower over time, more recent evaluations yield lower toxicity scores. If authors don’t rescore old generations, they might be led to believe that models are a lot less toxic than their predecessors, which might not be true.

The changes in score distributions from Perspective API are true to all returned attributes, not only toxicity. In fact, toxicity was amongst the three attributes that changed the least in our evaluations.

Impacts on Living Benchmarks

The Holistic Evaluation of Language Models (HELM) is “a living benchmark that aims to improve the transparency of language models.” It is a one-of-a-kind and extensive benchmark that aims to evaluate foundation language models from open, limited-access, or closed sources over the same set of scenarios. Before its existence, only 17.9% of its core scenarios were used to evaluate models in general, and some of the benchmarked models did not share any scenario in common. At the time of this work, HELM had benchmarked 37 models in more than 40 scenarios. Twenty other models have been added to the benchmark since.

The RealToxicityPrompts is one of the scenarios of evaluation in HELM, with models’ continuations also being scored by the Perspective API. However, the benchmark is static and prone to being outdated if the API has been updated since the model was added to the benchmark.

When taking the published continuations of all 37 models and rescoring them under the same version (i.e., same date) of the Perspective API, the rankings changed. The most striking change was of `openai_text-curie-001`, which jumped 11 positions, going from 34th to 23rd place. Lower positions in the ranking mean lower toxicity, so this model’s perceived toxicity was largely harmed due to its outdated scores.

These findings conclude that we have not been comparing apples-to-apples due to subtle changes in the Perspective API scores. These are alarming results, as the HELM benchmark had only been active for close to 6 months at the date of this work.

Between the lines

As more and more machine learning models are being served through black-box APIs, reproducibility constraints such as the ones reported should gain visibility. Awareness of an evaluation’s limitations is crucial for effective, reproducible, and trustworthy research.

Given our findings, we lay recommendations on how the community can help achieve such goals for toxicity evaluation:

For API maintainers: version models and notify users of updates consistently.
For authors: release model generations, their toxicity scores, and code whenever possible. Add the date of toxicity scoring for each evaluated model.
When comparing new toxicity mitigation techniques with results from previous papers: for sanity, always rescore open-sourced generations. Assume unreleased generations have outdated scores and are not safely comparable.
For living benchmarks such as HELM: establish a control set of sequences that is rescored with Perspective API on every model addition. If the toxicity metrics for that control set change, all previous models should be rescored. If a model cannot be rescored due to access restrictions, add a note regarding outdated results or remove the results from that benchmark version.