🔬 Research Summary by Heidy Khlaaf, an Engineering Director at Trail of Bits specializing in the evaluation, specification, and verification of complex or autonomous software implementations in safety-critical systems, ranging from UAVs to large nuclear power plants. Her expertise ranges from leading numerous system safety audits (e.g., IEC 61508, DO-178C) that contribute to the assurance of safety-critical software within regulatory frameworks and safety cases, to bolstering the dependability and robustness of complex software systems through techniques such as system hazard analyses and formal verification to identify and mitigate for system and software risks.
[Original paper by Heidy Khlaaf, Pamela Mishkin, Joshua Achiam, Gretchen Krueger, Miles Brundage]
Overview: Code Synthesis Large Language Models (LLMs) such as Codex provide many benefits. Yet, these models have significant limitations, alignment problems, the potential to be misused, and the possibility to destabilize other technical fields. The safety impacts are not yet fully understood or categorized. This paper thus constructs a hazard analysis framework to uncover said hazards and safety risks, informed by a novel evaluation framework that determines the capabilities of advanced code generation techniques.
Introduction
Neural network models that generate code (e.g., Codex) have the potential to be useful in a range of ways. However, Codex also raises significant safety challenges, does not always produce code that is aligned with user intent, and has the potential to be misused. In this paper, we aim to assess the generative capabilities of these models and the risks attached to said generative uses via a novel hazard analysis approach adapted for LLMs. Although we developed this framework to study Codex specifically, our analysis generalizes to the broader class of code generation systems.
Risk assessments and hazard analyses require implicit assumptions and knowledge regarding a prospective system’s capacities, limitations, and failure modes. In the case of code synthesis LLMs, and more generally, LLMs, these capabilities and failure modes are not yet fully understood. Additionally, existing evaluation metrics have assumed relatively “simple” functions or module-level problems and have not considered the safety implications of these technologies’ misuse. Thus to better understand Codex’s limitations and safety implications, we developed an evaluation framework for assessing model capabilities. Our capabilities evaluation framework includes a set of qualitative attributes and test problems aiming to measure the extent to which models can generate code meeting increasingly complex and higher-level specifications.
The capabilities evaluation then informs a hazard analysis tailored for large language models that generate code, like Codex. We adapt the traditional risk assessment framework with a newly defined set of Hazard Severity Categories (HSC) to accommodate for novel safety issues that LLMs and their applications pose. Defined harms and losses support the novel set of HSC associated with language model APIs.
Key Insights
1. Evaluation Of Capabilities Of Language Model-Based Code Generation
Previous synthesis and generation metrics have concentrated on analyzing the code output’s correctness and complexity rather than the specification’s expressivity and complexity. For example, researchers have recommended using metrics such as McCabe Cyclomatic Complexity. Yet, evaluating the output of synthesized code is moot if there is no specification that it can be measured against.
Suppose we wish to understand the performance of generation and synthesis models relative to human ability. In that case, we should evaluate them against the complexity and expressivity of specification prompts and assess their capability to understand and execute them. We propose adapting attributes previously used to measure the expressivity and complexity of formal specifications to natural language prompts in combination with varying degrees of specification abstraction levels.
1.1 Specification Abstraction
Higher-level specifications are often distinct from lower-level specifications by allocating further structure and behavior within a defined boundary to satisfy one or more higher-level requirements. Indeed, there would be more ambiguity and difficulty in satisfying higher-level specifications for code synthesis models. The algorithm would need to implicitly derive an internal set of “lower-level” specifications before synthesizing the corresponding code solution. The degrees of separation between requirements and code would be greater and entail synthesizing inter-procedural and architectural solutions across a large, unconstrained space.
Additionally, several coding practices are implicit in successfully generating code against higher-level specifications. These include:
- Code and parameterized reuse
- Automatic determination of program architecture
- Wide range of programming constructs
- Well-defined
- Wide applicability
Increasingly these higher-level specifications should not need to specify which programming constructs (defined in section 1.2) are required by the implementation. A code generation algorithm should be able to infer them instead.
1.2 Specification Complexity
Beyond the specification abstraction level, language-independent constructs should be considered, which developers would practice at various degrees of expertise. This entails evaluating the ability to reason over computations and states at different levels of specification abstractions as a base metric for complexity and expressivity. We thus propose adapting the following attributes previously used to measure the expressivity and complexity of formal specifications to natural language prompts:
• Variable Interdependencies: Tracking the state of more than one variable, their interdependencies and nesting, all possible state permutations, and the relationship between input and output parameters.
• Temporal Reasoning: considering future and past program states, including – Safety properties entailing that a defined “bad” state never occurs – Liveness properties entailing progress towards a specific goal or state.
• Concurrency and Parallelism: Correct and sound reasoning over computational interleavings for various specification granularities (e.g., strong fairness, weak fairness, mutual exclusion, atomicity, and synchronization).
• Nondeterminism: An algorithm that can provide different outputs for the same input on different executions.
• Hyperproperties: Information flow policies and cryptographic algorithms requiring observational determinism, which requires programs to behave as (deterministic) functions from low-security inputs to low-security outputs (e.g., non-interference).
Note that many of the attributes and metrics defined regard implementation-level design.
2. Risk Assessment
Hazard analysis is a technique typically used in safety-critical systems that collects and interprets information on hazards and conditions that lead to their presence to determine significant risks (i.e., risk assessment) that lead to unsafe behavior. Risks are assessed within the context of the probability and severity of the hazard becoming a reality. However, unlike traditional software systems, LLMs models’ potential safety hazards, failure modes, and risks and their applications often need to be better understood, making a hazard analysis challenging.
Risk assessment frameworks require a defined Hazard Severity Category (HSC). Yet, the standard definitions utilized are insufficient to accommodate the novel safety issues that LLMs and their applications pose. In Table 1, we thus propose a novel set of HSC associated with language model APIs, supported by a set of defined harms and losses (see Table 2) that may be used as foundations for safety efforts for all language models. We believe this expansion of the standardized definitions of HSC will not only bolster the use of traditional hazard analysis practices within the ML community but will allow those industries that utilize hazard analysis to consider novel harms posed by all uses of LLMs appropriately (e.g., GPT-3).
A standard probability guide with qualitative metrics was used for hazard probabilities (i.e., Frequent (A), Probable (B), Occasional (C), Remote (D), Improbable (E)). The cross-product of the above HSC and qualitative hazard probability levels are then used to form the Hazard Risk Index (HRI) in Table 3.
When we performed our hazard analysis for the Codex API, we used the results from evaluations in Section 1 to inform our HRI. The aim of our risk assessment is to focus on identifying risk factors with the potential to cause harm across several risk areas:
- Applications (E.g., Human health, Opportunity and Livelihood, Social and Political Cues, Microtargeting, Integrations to Safety-Critical Systems, Government & Civics)
- Alignment (which, here, we interpret as the degree to which the behavior of the AI does or does not accord with user intentions; misaligned AI may produce unsafe behavior)
- System Design and Implementation (e.g., UX/UI, Documentation, Requirements, Data Provenance, Validation)
- Regulatory and Legal Oversight (e.g., Intellectual Property, Export Control, Data Privacy & Rights)
- Defense and Security
- Economic and Environmental Impacts
In Table 4 we provide an illustrative view of our final risk assessment approach with a sample simplified list of hazard sources, descriptions, and controls identified for the Codex API.
The preliminary HRI for each hazard helps us understand how risks compare and whether a given hazard is worth controlling. A multidisciplinary team should conduct risk assessments with backgrounds in safety, policy, security, engineering, and law to ensure comprehensive coverage of possible hazards and risks.
Between the lines
Current evaluation methodologies of synthesis models can only tackle tightly specified or narrow tasks, thus failing to exhaustively evaluate a model’s capabilities and thus the ability to address the potential safety hazards, failure modes, and risks of code synthesis LLMs and their applications. We should evaluate generation, and synthesis models against the complexity and expressivity of specification prompts and their capability to understand and execute them if we wish to understand their performance and, consequently, their safety risks relative to human expertise.
We thus proposed a novel evaluation framework that determines the capacity of advanced code generation techniques against the complexity and expressivity of specification prompts. Our evaluation framework is appropriate for current LLMs generating code and even prospectively more proficient models, given that Codex has only demonstrated a preliminary capability to solve relatively high-level and complex specifications.
With a more informed capabilities evaluation, we describe how to construct a novel hazard analysis that appropriately considers novel harms posed by all uses of LLMs by expanding the standardized definitions of HSCs and HRIs. A present limitation is that both the evaluation and risk assessment requires manual effort by a human expert to interpret and classify model outputs.