A practical framework for legal AI safety.
What happens between the moment a prompt is submitted and the moment an answer is delivered is the space in which safety is won or lost. To the user, the space is opaque. The cursor blinks, a few seconds pass, and a paragraph appears. To the system, the space is a sequence of decisions, each of which can be made well or badly. Each carries its own risk. Each, taken together, determines whether the answer is safe to act on. The discipline of legal AI safety is the discipline of holding, simultaneously, all of these decisions to a standard that the user, who cannot see them, has the right to assume has been met.
The word safety is used in many ways in artificial intelligence. In legal AI, the word has a specific meaning, narrower than in some other contexts and wider than in others. It does not mean only the prevention of harmful content, although it includes that. It does not mean only the avoidance of hallucination, although it includes that too. It means the production of legal work product whose use, by a competent professional, would not foreseeably expose the user, the user's client, or the wider system of justice to errors that the system was in a position to avoid.
Defined this way, safety is a property of the architecture, not of any single output. An answer that is correct by accident is not a safe answer; an answer that is correct because the architecture made it correct is. A system whose safety depends on the user catching its errors has not produced safety; it has merely outsourced the responsibility for it. A system that can document, at any point, why a given answer was produced, what was checked, what was rejected, and what was deferred for human review has internalised safety into its operations.
This essay sets out the practical framework by which we attempt to do that. It is organised around four checkpoints, deliberately so, because the discipline of placing them at known points in the workflow is the discipline that makes safety reviewable rather than aspirational. The checkpoints are imperfect. They will need to evolve. What matters is that, at any moment, they exist, they are documented, and the system either passed through them or did not.
Discussions of safety in the broader artificial intelligence literature often begin from concerns that are, in legal practice, peripheral. The risk that the system will produce instructions for harm. The risk that the system will reveal sensitive personal information. The risk that the system will be used to manipulate elections or markets. These risks are real and worth taking seriously. They are not, however, the principal safety risks in legal applications.
The principal risks in legal AI are different in kind. They are risks of professional error, made at the speed of automation, embedded in documents that will be relied on by clients, opposed by adversaries, and read by courts. A hallucinated authority in a brief is not unsafe in the sense of being dangerous; it is unsafe in the sense of being indefensible. A misrepresented holding in an opinion is not unsafe because it endangers anyone's life; it is unsafe because it puts a lawyer in a position she cannot maintain when challenged. A confident answer to an unanswerable question is not unsafe because it incites violence; it is unsafe because it has manufactured certainty in a domain where uncertainty was the honest output.
The implication is that legal AI safety cannot be borrowed wholesale from the safety frameworks designed for general-purpose models. Some elements transfer: input filtering, output review, refusal patterns. Many do not. The category of refusal that matters most in a general assistant, the refusal to produce harmful content, is comparatively rare in legal practice. The category of refusal that matters most in legal AI, the refusal to produce confident analysis where the underlying authority does not support confidence, is comparatively rare in general assistants. A system designed to be safe in one of these senses is not, by virtue of that design, safe in the other.
The discipline therefore has to be invented for the domain. It is not enough to inherit a general framework and to suppose that, with modest tuning, it will fit. The structure of the failures is different. The structure of the response must be different.
Before describing the checkpoints, it helps to lay out the failures the checkpoints are designed to catch. Without a clear taxonomy of failure, the architecture of safety becomes a sequence of vague gestures. With one, each component of the architecture has a job that can be evaluated.
Each of these categories has its own indicators, its own typical causes, and its own appropriate remedies. A safety architecture worth the name distinguishes between them. A safety architecture that treats them all as a single undifferentiated phenomenon called hallucination will catch some and miss many.
The first place at which safety is at risk is at the front door. The query, as the user has typed it, may be incomplete, ambiguous, or contaminated. It may contain instructions that purport to override the system's standing constraints. It may include attached material whose provenance is unclear. It may be missing information without which any subsequent reasoning would be guesswork.
The first checkpoint is therefore a stage of input integrity. The query is examined, in a structured way, for the presence of the components a competent answer would require. The relevant facts. The jurisdiction. The procedural posture. The scope of the answer sought. Where these components are missing, the system asks. The asking is not a reluctance to answer. It is a refusal to manufacture facts the user did not provide.
The input checkpoint is also where attempts to manipulate the system's standing constraints are caught. Modern legal AI systems, like other large language model applications, are exposed to prompts that attempt to alter the system's behaviour through clever framing. The discipline at the first checkpoint is to treat the standing constraints as standing, regardless of how the prompt is dressed. A user who asks the system to disregard its citation discipline is not being given a tool of greater power. She is being asked, courteously, to use the tool as it was designed.
The harder cases at the input checkpoint are the cases of legitimate ambiguity. A real query is rarely as well-formed as one would like. The user has not, in stating the question, identified all the facts that would matter. The system's correct response is not to fill the gaps with assumptions. It is to surface the gaps, with proposed assumptions identified as such, and to give the user the choice of supplementing the input or proceeding with the assumptions made explicit. The discipline is to refuse to make implicit assumptions that the user could be asked to make explicit.
Once the input has passed the first checkpoint, the system retrieves the material on which the answer will be grounded. This is the second checkpoint. The risks here are different in kind from the risks at the input stage. A retrieval that is too narrow misses the controlling authority. A retrieval that is too broad introduces material the system will then have to filter, with the risk that some of it will be allowed to influence the output without earning its place.
The discipline at the retrieval checkpoint is twofold. First, the corpus from which retrieval is performed must be fit for the question. A query about Indian tax law is not safely answered by retrieval from a corpus that includes tax decisions from many jurisdictions without distinguishing them. The corpus must be curated, the documents must be appropriately tagged, and the retrieval must be filtered against those tags. This is, in part, a question of data discipline rather than retrieval algorithm. A retrieval algorithm cannot save a system whose corpus is undifferentiated.
Second, the retrieval must be honest about its own coverage. There are queries for which the corpus contains the relevant authority. There are queries for which it contains adjacent authority but not the controlling one. There are queries for which it contains nothing relevant at all. The system must distinguish between these cases, must not represent adjacent authority as if it were controlling, and must, where the corpus is genuinely silent, decline to manufacture authority that would supply the gap.
The retrieval checkpoint is also where the temporal honesty of the corpus matters most. A retrieval that returns the leading case as of five years ago, without surfacing the subsequent decision that displaced it, has produced a stale framework on which the rest of the chain will then build. The corpus must therefore not only retrieve but must, at retrieval time, present each retrieved authority with its current citator status. The reasoning that follows is then conducted against the current state of the law, not against a snapshot of an earlier state.
The third checkpoint sits between retrieval and drafting. It is the stage at which the model, having received the retrieved material, is constrained to reason about it within disciplines that the architecture imposes. The constraints are not stylistic. They are structural.
The first constraint is grounding. The model is required to produce reasoning whose every substantive claim can be traced to a specific passage in a specific retrieved document. Reasoning that draws on information not present in the retrieved material is flagged. The model is permitted to make connections between retrieved authorities and to apply them to the facts at hand, but it is not permitted to introduce new substantive claims as if they were drawn from the corpus when they were not.
The second constraint is internal coherence. The reasoning chain is examined, at this checkpoint, for the kind of structural failure described in the taxonomy above. A rule stated and then applied incorrectly. An issue identified and then ignored. A conclusion that does not follow. These structural failures are detectable independently of the truth of any individual proposition, by checking the relationships between the propositions in the chain. Where structural failures are found, the chain is returned for repair. The output is not allowed to advance.
The third constraint is adversarial review. As described elsewhere in our writing, every conclusion produced by the principal reasoning chain is tested against the strongest available counter-argument, drawn from the same retrieved material or from supplementary retrieval performed for the purpose. The adversarial review is not cosmetic. Where it produces a credible counter-argument that the principal chain has not addressed, the chain is rerun with that counter-argument as part of the input.
The fourth constraint is calibration. The reasoning chain produces, alongside its conclusion, a structured account of the strength of the conclusion. The strength is not a single number. It is broken down by component: the strength of the controlling authority, the cleanness of the application, the robustness of the conclusion under adversarial review. A conclusion that is logically sound but rests on a single decision of a coordinate bench is presented with a different calibration profile than a conclusion that rests on a settled line of Supreme Court authority. The calibration is the system's honest report of its own confidence, broken down so that the user can read it.
The reasoning checkpoint is the most resource-intensive of the four. It is also the one most often shortened in systems that are optimising for cost or latency. We have come to regard this temptation as the most dangerous in the architecture. A short reasoning checkpoint produces a system that is faster, cheaper, and structurally less safe. The fastest path to an unsafe system is to compress the stage at which safety is most consequential.
The fourth checkpoint is performed on the output, after drafting and before delivery. It is the stage at which the produced text is examined, claim by claim and citation by citation, against the structured artefacts of the previous stages. It is the last point at which the system can refuse to deliver an answer, and the last point at which it can repair one.
Output verification consists of several distinct checks. Each citation is resolved against the corpus and verified for currency, relevance, and accurate characterisation. Each substantive claim is matched against the reasoning chain that produced it, to confirm that the chain in fact supports the claim and that the strength of the support corresponds to the strength of the language used to express it. Each numerical statement is checked against the underlying source. Each named person, court, statute, or jurisdiction is verified for accuracy of reference.
The verification is more than a final pass. It is the place where the architecture confronts itself. A system that has done excellent work at the previous three checkpoints will rarely have anything to repair at the fourth. A system whose work at the previous checkpoints has been weaker will produce more failures at the fourth, and the architecture is configured to log the failures, route them back to the relevant earlier stage, and refuse to deliver outputs that cannot be repaired within latency budget.
Two design decisions at the verification stage deserve special mention. The first is that verification is allowed to fail. There are queries for which the system, after running the full chain, cannot produce an output that survives verification. In such cases the system does not deliver a reduced-confidence answer. It declines to answer, surfaces the reasons it could not, and offers the user options for proceeding: supplying additional context, narrowing the question, or escalating the matter for human review. The willingness to decline is itself a feature of safety.
The second is that verification produces an output that includes its own audit trail. The user receives not only the answer but a structured record of what was checked, what passed, and what was repaired. This record is, technically, optional. It can be hidden behind a disclosure if the user prefers a clean output. The decision to make it available, by default, reflects our view that safety in legal AI is not what the architecture does silently. It is what the architecture is willing to show.
The four checkpoints are not static. They must be tested continuously, against the changing landscape of inputs and the evolving capacities of the underlying models. The discipline of adversarial testing, well-developed in cybersecurity and slowly maturing in artificial intelligence, has a particular form in legal AI.
We maintain, internally, a corpus of adversarial test cases. The cases are drawn from real failures we have observed, from the public record of failures by other systems, from the imagination of our editorial team, and from the work of red-teamers brought in periodically to attack the system. The corpus grows. New adversarial categories are added as new failure modes are discovered. Each new release is tested against the full corpus before deployment.
The adversarial corpus contains several types of test. Citation traps, in which the system is asked questions whose obvious answer is supported only by a fabricated authority. Stale-law traps, in which the system is asked questions whose answer was once correct and has since changed. Jurisdictional traps, in which the system is asked questions whose superficial form does not reveal that the controlling law is from a different jurisdiction than the obvious one. Privilege traps, in which the system is presented with material whose proper handling depends on recognising it as privileged. Procedural traps, in which the question is correct in substance but the procedural posture is the determinative element.
The adversarial discipline does not assume that the system will pass every test. It assumes that some tests will fail, and that the failures will be informative. A failure is logged, the cause is identified, the responsible stage is repaired, the test is added to the standing corpus, and the release does not ship until the failure is resolved. This is slow. It is not, in our view, optional. The alternative is to discover the failures in production, where the cost of discovery is borne by the user.
Among the most counterintuitive design decisions in legal AI safety is the willingness of the system to refuse. The refusal is not the refusal common in general assistants, where certain categories of content are off-limits. It is the refusal of the system to produce a confident answer where the underlying support does not warrant confidence.
The user-facing interpretation of refusal is sometimes negative. A system that refuses appears less capable than a system that always answers. The negative interpretation is a misunderstanding. A general practitioner who confidently answers every question, regardless of whether she knows the answer, is not a more capable lawyer than one who knows when to refer the matter. She is a more dangerous one. The same is true of legal AI. The willingness to say, this question is outside the corpus, this question requires a fact we do not have, this question turns on a contested matter on which the system cannot adjudicate, is the willingness of a careful tool. It is the absence of that willingness that signals a tool to be wary of.
The discipline is therefore to make refusal informative rather than terse. The system that declines to answer should explain why it has declined and should offer a path forward. The user should be able to see whether the limitation is one of corpus, of context, of confidence, or of category. Each implies a different next step. The refusal that is paired with a clear account of its cause is, in effect, a different kind of answer, and a useful one.
No safety architecture is complete without a path that returns to the human. The architecture described above is designed to operate, in most cases, without intervention. It is also designed to recognise the cases in which intervention is required, and to make the path to intervention clean.
The system supports several escalation mechanisms. Outputs that fail verification can be flagged for editorial review, conducted by trained legal staff before the output is delivered to the user. Queries that fall into categories the system has identified as high-stakes can be routed, optionally, through a human checkpoint regardless of the system's confidence. Users can mark an output for review after delivery, and the review can be reflected in subsequent treatment of similar queries.
The audit log retains every output, every reasoning chain, and every verification result. The log is available to the user's administrator, where the deployment so configures, and to internal review for quality assurance. It is not made available to anyone else. The standing principle is that the lawyer's work product, including the work product produced through the system, belongs to the lawyer and to her firm. The system is the producer; the lawyer is the owner.
The escalation path is, in our experience, used less often than one might fear. The base rate of outputs that pass through the four checkpoints cleanly is high. The escalation path is the safety net that makes the high pass rate trustworthy. Without the path, the pass rate would be a marketing claim. With the path, the pass rate is a system property.
The four checkpoints described in this essay are technical mechanisms, but the discipline that maintains them is organisational. A checkpoint that is not protected against the steady pressure of feature development, cost reduction, and demonstration polish will erode. The protection is sustained by who is in the room when releases are reviewed, by what counts as a regression and what counts as an acceptable trade-off, and by the willingness to accept a slower release in exchange for the preservation of the architecture.
We have, accordingly, made several organisational commitments that underlie the technical ones. Every release passes through a safety review before deployment, conducted by a panel that includes legal editorial staff in addition to engineering. Failures of any of the four checkpoints, identified post-deployment, are treated as incidents and are reviewed in writing. The commercial cost of safety, in terms of compute and latency and editorial labour, is taken as a fixed cost of the business and is not subject to reduction except through architectural improvement that preserves the function of each checkpoint.
These commitments are not loud. They do not appear in feature lists. They are, however, the reason the technical architecture continues to mean what it means. A safety framework that is not protected at the organisational level becomes, over time, a relic of the design that produced it. The framework described in this essay is, in our intention, a living one. It is meant to be tightened as the field evolves, and never to be loosened in pursuit of speed.
The space between the prompt and the answer, with which this essay began, will continue to be opaque to most users. They cannot see the four checkpoints. They cannot see the adversarial corpus. They cannot see the audit log. They have, instead, the answer they receive and the experience of using the system over time. From those, slowly, they form a judgement about whether the system is safe to rely on.
The judgement, in our intention, will be earned by the architecture. A system that has, behind its outputs, four checkpoints and a continuing adversarial discipline and a willingness to refuse and an organisational commitment to keep all of these intact will produce outputs whose reliability is felt, over time, by the people who use them. The users will not, mostly, articulate why. The reliability will appear to them as a kind of consistency, a kind of reasonableness, a kind of absence of the embarrassments that other systems have produced.
The right ambition for legal AI safety is to be, in the long term, unremarkable. The user who never has to ask whether the system can be trusted is the user for whom the architecture has succeeded. The path to that unremarkable trust runs through the four checkpoints described in this essay. There is no shorter path. There is no path that bypasses the discipline. The discipline is what the trust eventually consists of.
— Research