
✍️By Tejasvi Nallagundla.
Tejasvi is an Undergraduate Student in Computer Science, Artificial Intelligence and Global Studies and an Undergraduate Affiliate at the Governance and Responsible AI Lab (GRAIL), Purdue University.
📌 Editor’s Note: This article is part of our AI Policy Corner series, a collaboration between the Montreal AI Ethics Institute (MAIEI) and the Governance and Responsible AI Lab (GRAIL) at Purdue University. The series provides concise insights into critical AI policy developments from the local to international levels, helping our readers stay informed about the evolving landscape of AI governance. This piece uses Anthropic’s Claude Constitution, Responsible Scaling Policy, and Claude Sonnet 4.6 System Card as a representative example to look at how layered corporate AI policy documents come together to define and shape the boundaries of model behavior across a governance stack.
AI Governance as a Stack
We often think of AI governance as a set of rules or clear guidelines on the “yes” versus “no” of what a certain implementation or tool can or cannot do. In practice, however, the companies building these tools have to “govern” AI not just in terms of compliance with rules set out by governing bodies, but also in a corporate sense. This corporate governance doesn’t just come in the form of a single “AI policy document,” but rather as a range of materials, from high-level principles to in-depth technical assessments. We can think about these materials using the idea of a stack: a layered set of documents that each serve a different purpose within the broader governance goals of a corporate AI company.
One way for us to unpack this stack and see it in practice is to analyze a company’s approach with a specific question in mind across several documents. Here, I am choosing to do so through the AI lab Anthropic, and the question of how boundaries for model behavior are defined. Looking across three layers of Anthropic’s governance stack, we see that the idea of the boundary is not defined in just one place but rather constructed across all of them, with each layer serving a different purpose in shaping and presenting that boundary, albeit with the same overarching goal.
The Value Layer
Starting at the top of the stack is Claude’s Constitution, which Anthropic defines as “a detailed description of Anthropic’s intentions for Claude’s values and behavior.” From the perspective of our policy stack, we can think of this as the value layer. In terms of how the boundaries of model behavior are presented in this document, the company treats them as not fixed. They explain their approach of favoring the cultivation of “good values and judgment over strict rules and decision procedures”, with Claude playing a role in determining its behavior through holistic prioritization, balancing considerations such as helpfulness with guidelines and safety. Thus, at the value layer, the boundary definition is normative, describing what the model ought to do, rather than it being fixed or operationalized.
The Risk Layer
Moving down to the next layer of the stack is Anthropic’s Responsible Scaling Policy (RSP), which they define as a framework establishing how they “identify and evaluate risks” as well as make “decisions about AI development and deployment.” We can think of this as the risk layer. Going back to our question on the boundaries of model behavior, in the RSP, these are presented in a more structured way, tying them to factors such as capability thresholds, risk analyses, and internal governance decisions, while still highlighting that there is “flexibility in how risk thresholds are evaluated”. Thus, at the risk level, the idea of the boundary is shaped and presented with a focus on evaluation, rather than as a completely fixed line (similar to how it wasn’t in the Constitution).
The Evaluation Layer
Now we turn to Anthropic’s System Cards, which we can think of as the evaluation layer of the governance stack. Looking at their recent System Card for Claude Sonnet 4.6, they describe the document as outlining the model’s “characteristics, capabilities, and safety profile that [they] carried out before its public deployment.” Boundaries, at this layer, are presented through testing against thresholds and associated evaluation results. One interesting thing of note here is their observation that “confidently ruling out these thresholds is becoming increasingly difficult” with evolving model capabilities, thus highlighting how, at this layer too, the boundary is not fixed but rather shaped through testing and a sense of uncertainty.
Beyond the evaluation layer, a whole range of other materials add onto Anthropic’s governance stack, from their Transparency Hub to government consultations, usage policies, and research publications. Taken together, the focus, content, and framing of these materials on different topics, including boundaries of model behavior, shape how corporate governance surrounding these issues is constructed and distributed across the system.
Further Reading
- In Claude We Trust? Evaluating the New Constitution
- Claude’s New Constitution: AI Alignment, Ethics, and the Future of Model Governance
- Exclusive: Anthropic Drops Flagship Safety Pledge
- Responsible Scaling: Comparing Government Guidance and Company Policy
Image credit: Distant Writing by Fabrizio Matarese / Better Images of AI / CC BY 4.0
