🔬 Research Summary by Joe O’Brien, an Associate Researcher at the Institute for AI Policy and Strategy, focusing on corporate governance and accountability surrounding developing and deploying frontier AI models.
[Original paper by Joe O’Brien, Shaun Ee, and Zoe Williams]
Overview: This report describes a toolkit that frontier AI developers can use to respond to risks discovered after the deployment of a model. It also provides a framework for AI developers to prepare and implement this toolkit.
Introduction
Recent history features plenty of cases where AI models have behaved or been used in unintended ways after model deployment. As AI capabilities progress and the scale of adoption of AI systems grows, the impacts of model deployments may become increasingly significant–and this may especially be the case for leading AI developers, such as OpenAI, Google DeepMind, Anthropic, Microsoft, Google, Amazon, and Meta. While AI developers can adopt several safety practices before deployment (such as red-teaming, risk assessment, and fine-tuning) to reduce the likelihood of incidents, these practices are unlikely to pre-empt all potential issues.
To manage this gap, this paper recommends that leading AI developers establish the capacity for “deployment corrections”–a set of tools to rapidly restrict access to a deployed model for all or part of its functionality and/or users. This would facilitate appropriate and fast responses to a) dangerous capabilities or behaviors identified in post-deployment risk assessment and monitoring and b) serious incidents. The paper also describes practices that can lower the barrier to making decisive, appropriate decisions on deployment corrections.
Key Insights
As a toolkit
Frontier AI developers, which make their models available to downstream users via an interface (e.g., API rather than via open-sourcing), have many tools at their disposal to limit access to the model. At a high level, this toolkit includes:
- User-based restrictions (such as blocklisting or allowlisting)
- Access frequency restrictions (such as throttling the number of prompts that can be submitted to a model in a time period)
- Capability restrictions (such as filtering harmful model outputs)
- Use case restrictions (such as prohibiting a model’s use in high-stakes applications)
- Full shutdown (such as decommissioning a model)
These tools can be used in a broad range of scenarios, from cases where risks from the model are fairly limited to scenarios where the harms are potentially severe and can arise even from proper use by a trusted user.
Restricting model access may be difficult in practice, as downstream users may become dependent on the capabilities of newly deployed models. To minimize these harms and to lower the barrier for developers to institute deployment corrections as a precaution, we outline a space for deployment corrections to allow a scalable and targeted approach. AI developers can opt for combinations of restrictions and tailor these choices to respond effectively to specific incidents while minimizing downstream harms.
Building organizational capacity
Tools alone are insufficient for action–AI developers will need to develop procedures, roles, and responsibilities for managing decisions around deployment corrections to respond to incidents with their deployed models most effectively. The paper recommends that AI developers focus on four stages of implementation: preparation, monitoring, execution, and post-incident follow-up.
Preparation refers to the act of building and adopting the tools and procedures that will allow an AI developer to act swiftly and effectively in response to an incident. It includes identifying and understanding possible threats, establishing triggers for deployment corrections, developing tools and procedures for incident response, and establishing decision-making authorities. Externally, it includes sharing insights on best practices with regulators and industry partners and defining fallback options for downstream users in the case of service interruption.
Monitoring refers to the process of continuously gathering data on a model’s capabilities, behavior, and use (via a diverse range of sources), analyzing this data for anomalies, and escalating cases of concern to relevant decision-makers. AI developers should also feed relevant data back into the threat modeling process.
Execution refers to the decision to apply a deployment correction to a model and the procedures that follow this decision. This stage also includes alerting and coordinating with relevant regulatory authorities, implementing fallback systems for downstream users, and notifying customers of the situation.
Post-incident follow-up refers to the set of actions relevant to recovery, restoration, learning, and ongoing risk management in the wake of an incident. This stage involves the process of repairing a model and restoring service, after-action reviews, and feeding lessons back into the previous stages. In some cases, this stage may require significant involvement from external parties (such as when the incident is particularly severe and likely to occur in models developed by other companies).
Between the lines
While some recently published standards and guidance have called out the need for AI developers to monitor deployed models for risks–and be prepared to withdraw them when necessary–there is more work to be done. Policymakers and AI companies will need to coordinate on several capacity-building measures, including (but not limited to):
- Defining and sharing threat models and developing tools to parse data for signs of misuse or undesired model behavior.
- Developing a standardized framework for frontier AI incident response and sharing best practices.
- Establishing secure reporting lines for quickly communicating across industry and government in the case of an incident or discovered vulnerability.
Policymakers could also consider requiring frontier AI developers to take certain critical steps, such as maintaining control over model access or maintaining incident response plans and making such plans available to relevant agencies.
Finally, it is worth noting that the deployment corrections framework is not a silver bullet for managing AI risks. It is one small part of a larger conversation to build stronger governance mechanisms around frontier AI model development and deployment. This conversation has recently seen major advancements in the form of a US Executive Order and a flurry of publications of AI firms’ safety policies. While we look forward to seeing work that expands on our framework, we also look forward to work that fills important gaps in the broader project governing frontier AI development and deployment.