• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 7 (November 2025)
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

A Hazard Analysis Framework for Code Synthesis Large Language Models

December 6, 2022

🔬 Research Summary by Heidy Khlaaf, an Engineering Director at Trail of Bits specializing in the evaluation, specification, and verification of complex or autonomous software implementations in safety-critical systems, ranging from UAVs to large nuclear power plants. Her expertise ranges from leading numerous system safety audits (e.g., IEC 61508, DO-178C) that contribute to the assurance of safety-critical software within regulatory frameworks and safety cases, to bolstering the dependability and robustness of complex software systems through techniques such as system hazard analyses and formal verification to identify and mitigate for system and software risks.

[Original paper by Heidy Khlaaf, Pamela Mishkin, Joshua Achiam, Gretchen Krueger, Miles Brundage]


Overview: Code Synthesis Large Language Models (LLMs) such as Codex provide many benefits. Yet, these models have significant limitations, alignment problems, the potential to be misused, and the possibility to destabilize other technical fields. The safety impacts are not yet fully understood or categorized. This paper thus constructs a hazard analysis framework to uncover said hazards and safety risks, informed by a novel evaluation framework that determines the capabilities of advanced code generation techniques.


Introduction

Neural network models that generate code (e.g., Codex) have the potential to be useful in a range of ways. However, Codex also raises significant safety challenges, does not always produce code that is aligned with user intent, and has the potential to be misused. In this paper, we aim to assess the generative capabilities of these models and the risks attached to said generative uses via a novel hazard analysis approach adapted for LLMs. Although we developed this framework to study Codex specifically, our analysis generalizes to the broader class of code generation systems.

Risk assessments and hazard analyses require implicit assumptions and knowledge regarding a prospective system’s capacities, limitations, and failure modes. In the case of code synthesis LLMs, and more generally, LLMs, these capabilities and failure modes are not yet fully understood. Additionally, existing evaluation metrics have assumed relatively “simple” functions or module-level problems and have not considered the safety implications of these technologies’ misuse. Thus to better understand Codex’s limitations and safety implications, we developed an evaluation framework for assessing model capabilities. Our capabilities evaluation framework includes a set of qualitative attributes and test problems aiming to measure the extent to which models can generate code meeting increasingly complex and higher-level specifications. 

The capabilities evaluation then informs a hazard analysis tailored for large language models that generate code, like Codex. We adapt the traditional risk assessment framework with a newly defined set of Hazard Severity Categories (HSC) to accommodate for novel safety issues that LLMs and their applications pose. Defined harms and losses support the novel set of HSC associated with language model APIs.

Key Insights

 1. Evaluation Of Capabilities Of Language Model-Based Code Generation

Previous synthesis and generation metrics have concentrated on analyzing the code output’s correctness and complexity rather than the specification’s expressivity and complexity. For example, researchers have recommended using metrics such as McCabe Cyclomatic Complexity. Yet, evaluating the output of synthesized code is moot if there is no specification that it can be measured against. 

Suppose we wish to understand the performance of generation and synthesis models relative to human ability. In that case, we should evaluate them against the complexity and expressivity of specification prompts and assess their capability to understand and execute them. We propose adapting attributes previously used to measure the expressivity and complexity of formal specifications to natural language prompts in combination with varying degrees of specification abstraction levels. 

1.1 Specification Abstraction

Higher-level specifications are often distinct from lower-level specifications by allocating further structure and behavior within a defined boundary to satisfy one or more higher-level requirements. Indeed, there would be more ambiguity and difficulty in satisfying higher-level specifications for code synthesis models. The algorithm would need to implicitly derive an internal set of “lower-level” specifications before synthesizing the corresponding code solution. The degrees of separation between requirements and code would be greater and entail synthesizing inter-procedural and architectural solutions across a large, unconstrained space. 

Additionally, several coding practices are implicit in successfully generating code against higher-level specifications. These include: 

  • Code and parameterized reuse 
  • Automatic determination of program architecture 
  • Wide range of programming constructs 
  • Well-defined 
  • Wide applicability

Increasingly these higher-level specifications should not need to specify which programming constructs (defined in section 1.2) are required by the implementation. A code generation algorithm should be able to infer them instead. 

1.2 Specification Complexity

Beyond the specification abstraction level, language-independent constructs should be considered, which developers would practice at various degrees of expertise. This entails evaluating the ability to reason over computations and states at different levels of specification abstractions as a base metric for complexity and expressivity. We thus propose adapting the following attributes previously used to measure the expressivity and complexity of formal specifications to natural language prompts:

• Variable Interdependencies: Tracking the state of more than one variable, their interdependencies and nesting, all possible state permutations, and the relationship between input and output parameters.

• Temporal Reasoning: considering future and past program states, including – Safety properties entailing that a defined “bad” state never occurs – Liveness properties entailing progress towards a specific goal or state.

• Concurrency and Parallelism: Correct and sound reasoning over computational interleavings for various specification granularities (e.g., strong fairness, weak fairness, mutual exclusion, atomicity, and synchronization). 

• Nondeterminism: An algorithm that can provide different outputs for the same input on different executions.

• Hyperproperties: Information flow policies and cryptographic algorithms requiring observational determinism, which requires programs to behave as (deterministic) functions from low-security inputs to low-security outputs (e.g., non-interference).

Note that many of the attributes and metrics defined regard implementation-level design. 

2. Risk Assessment

Hazard analysis is a technique typically used in safety-critical systems that collects and interprets information on hazards and conditions that lead to their presence to determine significant risks (i.e., risk assessment) that lead to unsafe behavior. Risks are assessed within the context of the probability and severity of the hazard becoming a reality. However, unlike traditional software systems, LLMs models’ potential safety hazards, failure modes, and risks and their applications often need to be better understood, making a hazard analysis challenging.

Risk assessment frameworks require a defined Hazard Severity Category (HSC). Yet, the standard definitions utilized are insufficient to accommodate the novel safety issues that LLMs and their applications pose. In Table 1, we thus propose a novel set of HSC associated with language model APIs, supported by a set of defined harms and losses (see Table 2) that may be used as foundations for safety efforts for all language models. We believe this expansion of the standardized definitions of HSC will not only bolster the use of traditional hazard analysis practices within the ML community but will allow those industries that utilize hazard analysis to consider novel harms posed by all uses of LLMs appropriately (e.g., GPT-3).

A standard probability guide with qualitative metrics was used for hazard probabilities (i.e., Frequent (A), Probable (B), Occasional (C), Remote (D), Improbable (E)).  The cross-product of the above HSC and qualitative hazard probability levels are then used to form the Hazard Risk Index (HRI) in Table 3.

When we performed our hazard analysis for the Codex API, we used the results from evaluations in Section 1 to inform our HRI. The aim of our risk assessment is to focus on identifying risk factors with the potential to cause harm across several risk areas:

  • Applications (E.g., Human health, Opportunity and Livelihood, Social and Political Cues, Microtargeting, Integrations to Safety-Critical Systems, Government & Civics)
  • Alignment (which, here, we interpret as the degree to which the behavior of the AI does or does not accord with user intentions; misaligned AI may produce unsafe behavior)
  • System Design and Implementation (e.g., UX/UI, Documentation, Requirements, Data Provenance, Validation) 
  • Regulatory and Legal Oversight (e.g., Intellectual Property, Export Control, Data Privacy & Rights)
  • Defense and Security
  • Economic and Environmental Impacts

In Table 4 we provide an illustrative view of our final risk assessment approach with a sample simplified list of hazard sources, descriptions, and controls identified for the Codex API.

The preliminary HRI for each hazard helps us understand how risks compare and whether a given hazard is worth controlling. A multidisciplinary team should conduct risk assessments with backgrounds in safety, policy, security, engineering, and law to ensure comprehensive coverage of possible hazards and risks.

Between the lines

Current evaluation methodologies of synthesis models can only tackle tightly specified or narrow tasks, thus failing to exhaustively evaluate a model’s capabilities and thus the ability to address the potential safety hazards, failure modes, and risks of code synthesis LLMs and their applications. We should evaluate generation, and synthesis models against the complexity and expressivity of specification prompts and their capability to understand and execute them if we wish to understand their performance and, consequently, their safety risks relative to human expertise.

We thus proposed a novel evaluation framework that determines the capacity of advanced code generation techniques against the complexity and expressivity of specification prompts. Our evaluation framework is appropriate for current LLMs generating code and even prospectively more proficient models, given that Codex has only demonstrated a preliminary capability to solve relatively high-level and complex specifications. 

With a more informed capabilities evaluation, we describe how to construct a novel hazard analysis that appropriately considers novel harms posed by all uses of LLMs by expanding the standardized definitions of HSCs and HRIs. A present limitation is that both the evaluation and risk assessment requires manual effort by a human expert to interpret and classify model outputs.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

This image shows a large white, traditional, old building. The top half of the building represents the humanities (which is symbolised by the embedded text from classic literature which is faintly shown ontop the building). The bottom section of the building is embossed with mathematical formulas to represent the sciences. The middle layer of the image is heavily pixelated. On the steps at the front of the building there is a group of scholars, wearing formal suits and tie attire, who are standing around at the enternace talking and some of them are sitting on the steps. There are two stone, statute-like hands that are stretching the building apart from the left side. In the forefront of the image, there are 8 students - which can only be seen from the back. Their graduation gowns have bright blue hoods and they all look as though they are walking towards the old building which is in the background at a distance. There are a mix of students in the foreground.

Tech Futures: Co-opting Research and Education

Agentic AI systems and algorithmic accountability: a new era of e-commerce

ALL IN Conference 2025: Four Key Takeaways from Montreal

Beyond Dependency: The Hidden Risk of Social Comparison in Chatbot Companionship

AI Policy Corner: Restriction vs. Regulation: Comparing State Approaches to AI Mental Health Legislation

related posts

  • Research summary: Warning Signs: The Future of Privacy and Security in the Age of Machine Learning

    Research summary: Warning Signs: The Future of Privacy and Security in the Age of Machine Learning

  • AI Consent Futures: A Case Study on Voice Data Collection with Clinicians

    AI Consent Futures: A Case Study on Voice Data Collection with Clinicians

  • A Critical Analysis of the What3Words Geocoding Algorithm

    A Critical Analysis of the What3Words Geocoding Algorithm

  • The Ethics of AI in Medtech: A Discussion With Abhishek Gupta

    The Ethics of AI in Medtech: A Discussion With Abhishek Gupta

  • Creative Agents: Rethinking Agency and Creativity in Human and Artificial Systems

    Creative Agents: Rethinking Agency and Creativity in Human and Artificial Systems

  • Combatting Anti-Blackness in the AI Community

    Combatting Anti-Blackness in the AI Community

  • Moral Dilemmas for Moral Machines

    Moral Dilemmas for Moral Machines

  • A collection of principles for guiding and evaluating large language models

    A collection of principles for guiding and evaluating large language models

  • AI Economist: Reinforcement Learning is the Future for Equitable Economic Policy

    AI Economist: Reinforcement Learning is the Future for Equitable Economic Policy

  • AI Ethics: Inclusivity in Smart Cities

    AI Ethics: Inclusivity in Smart Cities

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.