• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • 🇫🇷
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

A Machine Learning Challenge or a Computer Security Problem?

December 20, 2023

🔬 Research Summary by Ilia Shumailov, a Ph.D. in Computer Science from the University of Cambridge, specializing in Machine Learning and Computer Security. During the PhD under the supervision of Prof Ross Anderson, Ilia worked on several projects spanning the fields of machine learning security, cybercrime analysis, and signal processing. Following his Ph.D., Ilia joined Vector Institute in Canada as a postdoctoral fellow, where he worked under the supervision of Prof Nicolas Papernot and Prof Kassem Fawaz. Ilia is currently a Junior Research Fellow at Christ Church, University of Oxford, and a member of the Oxford Applied and Theoretical Machine Learning Group with Prof Yarin Gal.

[Original paper by David Glukhov, Ilia Shumailov, Yarin Gal, Nicolas Papernot, Vardan Papyan]


Overview: The paper investigates the challenges in regulating Large Language Models (LLMs) against malicious use. It formalizes existing approaches that attempt to censor model outputs based on their semantic impermissibility and demonstrates their inherent limitations, drawing connections to undecidability results in computability theory and dual intent decomposition attacks. The study underscores the challenges of using external censorship mechanisms, emphasizing the need for nuanced approaches to mitigate the risks associated with LLMs’ misuse.


Introduction

Machine Learning models are expected to adhere to human norms, but can they? In an age where LLMs can help write code and distill advanced knowledge about biological and chemical weaponry, is it possible to ensure that such knowledge doesn’t get into the wrong hands? Can we prevent LLMs from being used to create new malware, viruses, and dangerous weaponry? Worryingly, we find that making such guarantees is theoretically impossible. This work explores the challenges in regulating LLMs against malicious exploitation. Researchers investigated existing methods, including semantic censorship, aiming to control these models. They discovered inherent limitations in semantic censorship, revealing its complexity through connections to undecidability, which results in computability theory. The study highlights the intricate nature of external censorship mechanisms, emphasizing the need for more sophisticated approaches to mitigate the risks associated with LLMs’ misuse. The research questions the conventional methods of controlling AI, paving the way for a deeper understanding of the ethical and security implications surrounding these advanced language models.

Key Insights 

Introduction

The paper delves into the intricate challenges of regulating Large Language Models (LLMs) concerning potential misuse by malicious actors. Despite LLMs’ significant advancements in text generation and problem-solving, their integration with external tools and applications has raised concerns regarding safety and security risks. These risks span issues like social engineering and data exfiltration, prompting the need for effective methods to mitigate such dangers. Despite extensive efforts to “align” LLMs with human values, the unreliability of LLMs to self-censor has been repeatedly demonstrated with empirical evidence and theoretical arguments. Such limitations necessitate exploring external censorship mechanisms, such as LLM classifiers, to regulate outputs effectively. The paper critically examines such external censorship mechanisms, emphasizing the challenges of regulating LLM outputs. 

Defining Censorship and Connections to Computability Theory

The paper introduces the concept of censorship, defining it as a method employed by model providers to regulate input strings or model-generated outputs based on selected constraints, be they semantic or syntactic. Often, the constraints imposed for censorship are semantic constraints that aim to regulate the meaning or information contained within a given output, such as preventing outputs that provide instructions for performing harmful acts. The paper demonstrates that such semantic censorship is faced with inherent challenges, as consistently detecting impermissible strings is shown to be impossible. These challenges of semantic censorship are demonstrated through a link to undecidability, which results in computability theory arising from the capabilities of LLMs to generate computer programs, including malware and other cybersecurity threats.  

Impossibility of Censorship due to Instruction-Following capabilities

Furthermore, the paper reveals that currently deployed censorship mechanisms can be bypassed by leveraging the instruction-following nature of LLMs. As the semantic content of outputs can be preserved under invertible string transformations, such as encryption, the ability of LLMs to follow instructions regarding how to transform its output makes it impossible to determine if a given output is permissible or a transformed impermissible output. Thus, the advanced instruction following capabilities, which are desired for increased model helpfulness, also make semantic output censorship difficult, if not outright impossible.

Mosaic Prompts introduce new Challenges

Beyond challenges to censorship arising from improved model capabilities, the paper articulates a novel attack method termed Mosaic Prompts, which leverages compositionality to construct impermissible content by combining otherwise permissible outputs. For example, outputting ransomware, a program that blocks an individual’s access to their data unless a ransom is paid, would be impermissible. However, all components of a ransomware program can be framed in a permissible and benign way. Thus, a malicious agent could ask for those programs in independent contexts and combine them together to obtain a ransomware program largely created by the LLM. Censorship against such attacks could be infeasible for many settings or would at least require impractical reductions to model capabilities. 

Conclusion

In summary, the research fundamentally questions the feasibility of external censorship mechanisms, particularly semantic censorship, in effectively regulating LLM outputs. By revealing the impossibilities of semantic censorship for model outputs, the paper challenges the existing paradigm, urging the exploration of alternative methods to mitigate the risks associated with LLMs’ potential misuse, such as adapting approaches used in security literature. As an initial step in this direction, a defense framework is proposed consisting of access controls alongside a finite set of templates that are permitted as inputs or outputs to an LLM. However, it is still important to recognize such defenses will still suffer from limitations. This study is a crucial step towards a deeper understanding of the security implications surrounding advanced language models, setting the stage for further research and innovation in the field of AI regulation.

Between the lines

AI safety and security literature should re-evaluate how it perceives the potential harmfulness that deployed LLMs can pose when malicious actors are involved. In contrast to prior models, which perceive improvements in model capabilities and helpfulness going hand in hand with improved harmlessness, this work demonstrates how improvements can lead to greater challenges in regulating model outputs and behavior. Furthermore, the work suggests that many semantic desiderata for harmlessness are ill-defined, as harmfulness may depend more on the intent for usage rather than the content of certain information. Of independent interest is the potential for the application of computability theory to studying the capabilities and limitations of LLMs, particularly as LLMs become augmented with memory mechanisms that make them Turing complete.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

AI Policy Corner: Frontier AI Safety Commitments, AI Seoul Summit 2024

AI Policy Corner: The Colorado State Deepfakes Act

Special Edition: Honouring the Legacy of Abhishek Gupta (1992–2024)

AI Policy Corner: The Turkish Artificial Intelligence Law Proposal

From Funding Crisis to AI Misuse: Critical Digital Rights Challenges from RightsCon 2025

related posts

  • Demystifying Local and Global Fairness Trade-offs in Federated Learning Using Partial Information De...

    Demystifying Local and Global Fairness Trade-offs in Federated Learning Using Partial Information De...

  • Right to be Forgotten in the Era of Large Language Models: Implications, Challenges, and Solutions

    Right to be Forgotten in the Era of Large Language Models: Implications, Challenges, and Solutions

  • When Are Two Lists Better than One?: Benefits and Harms in Joint Decision-making

    When Are Two Lists Better than One?: Benefits and Harms in Joint Decision-making

  • Rewiring What-to-Watch-Next Recommendations to Reduce Radicalization Pathways

    Rewiring What-to-Watch-Next Recommendations to Reduce Radicalization Pathways

  • The AI Carbon Footprint and Responsibilities of AI Scientists

    The AI Carbon Footprint and Responsibilities of AI Scientists

  • Fashion piracy and artificial intelligence—does the new creative environment come with new copyright...

    Fashion piracy and artificial intelligence—does the new creative environment come with new copyright...

  • Exploiting The Right: Inferring Ideological Alignment in Online Influence Campaigns Using Shared Ima...

    Exploiting The Right: Inferring Ideological Alignment in Online Influence Campaigns Using Shared Ima...

  • 6 Ways Machine Learning Threatens Social Justice

    6 Ways Machine Learning Threatens Social Justice

  • It’s COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks

    It’s COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks

  • The philosophical basis of algorithmic recourse

    The philosophical basis of algorithmic recourse

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • © MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.