Governance Doesn't End at Approval: Why Enforcement Is the Missing Half of Your AI Framework

Jun 17

Over the past several months, I’ve been deep in the enforcement side of AI governance, making sure a use case that’s already live is still performing the way it was approved to perform. Most frameworks are built to get a use case into production. Far fewer are built to keep watching it once it’s there. This post is about that gap, why the common workarounds don’t hold up, and what continuous enforcement actually looks like in practice.

The Problem With “Single Point in Time” Governance

I keep running into the same phrase from governance teams: “single point in time” governance. A use case gets built, it goes through the GRC team, legal weighs in, maybe privacy and security sign off too, and then it ships. The review was real, and it was thorough. But it was also a snapshot, taken on one day, of a system that keeps changing after that day.

The question I get asked constantly is some version of: “Okay, so when do we check back in?” And the honest answer is that most organizations don’t have a great one. I’ve seen re-review cadences set by team bandwidth (we’ll get to it when we have time), by the original risk tier of the use case (high-risk gets reviewed quarterly, low-risk gets reviewed annually), or simply by a flat renewal clock that starts ticking the day a use case hits production. Each of these is a reasonable-sounding policy. Each of them is also, at best, a half-measure.

What “Enforcement” Actually Means

When I say enforcement, I mean something fairly specific: the moment a use case goes sideways, whether it starts producing biased or unfair outputs, begins hallucinating, or a prompt injection pushes the model into doing something it was never scoped to do, enforcement is the mechanism that catches that and gets it resolved immediately, not at the next scheduled check-in.

The only approach I’ve seen actually hold up under that definition is continuous monitoring: a process running against every model, in every live use case, measuring against the specific bad-output signals and metrics your organization has defined. When something crosses that line, the right people get notified, and the use case can be pulled offline while the issue gets resolved. Not next quarter. Not at the next scheduled touchpoint. Now.

This is the distinction worth sitting with: a review cadence tells you when someone is scheduled to look. Continuous monitoring tells you the moment something is actually wrong. A governance framework can have an excellent review process and still have a dangerous enforcement gap, because those are two different jobs.

A Worked Example: 88 Days of Exposure

Here’s a scenario that makes the stakes concrete. Say your re-review policy is every three months. You stand up a new use case running on Fable, your governance team reviews it, approves it, and sets the next check-in for three months out.

Two days later

A regulatory order halts the use of Fable globally. Your approved use case is now running on an unapproved model, one your team has never reviewed, and nothing in your process is positioned to notice.

If your only checkpoint is the three-month re-review, that use case keeps running on an unapproved, unvetted model for 88 more days before anyone catches it. That’s 88 days of potential bad outputs. 88 days where hallucinations, prompt injections, bias, or fairness issues could be quietly accumulating. 88 days where a problem could go completely unreported, and where data could be exposed in a way that opens your organization up to real liability.

88 days Exposure window under a quarterly re-review policy

Same day Detection window with continuous monitoring in place

With active monitoring in place, that same scenario gets caught the same day. The use case comes offline, the model gets swapped, the approval process runs again, and it goes back into production, all without 88 days of unmonitored risk sitting in between.

What You’re Actually Monitoring For

“Continuous monitoring” is one of those phrases that sounds complete on its own but actually leaves out the part that matters most: monitoring for what, specifically? If you can’t name the signal, you don’t have a monitoring program, you have a dashboard nobody looks at. In practice, most enforcement programs are watching for some combination of three distinct kinds of drift, and they don’t show up the same way or get caught by the same checks.

Data drift is when the inputs flowing into your model stop resembling what it was trained or validated on. A resume-screening use case built around a certain applicant pool starts seeing a different mix of applicants after a market shift, and now the model is making decisions on data it was never really tested against.

Concept drift is sneakier, because the inputs can look identical while the meaning underneath them changes. A fraud-detection model that was tuned against last year’s fraud patterns is still scoring transactions the same way, but the actual fraud techniques have evolved, so the relationship between “what the model sees” and “what it should flag” has quietly shifted out from under it.

Model drift is the one most people picture instinctively: the underlying model itself changes. A vendor pushes an update, retrains on new data, or deprecates the version you approved and routes you to a new one without much fanfare. This is exactly what happened in the Fable example above. Nothing about your use case changed, but the thing powering it did.

Each of these needs a different kind of check. Data drift gets caught by watching input distributions over time. Concept drift gets caught by tracking outcome accuracy and feedback loops, not just inputs. Model drift gets caught by pinning and verifying model versions, and by watching vendor changelogs and API behavior as closely as you watch your own outputs. A monitoring program that only covers one of these will catch one category of failure and miss the other two entirely, and most teams find out which one they were missing only after something has already gone wrong.

Monitoring Without an Empowered Human Is Just a Dashboard

I wrote about the HOTL model (Human on the Loop) in an earlier post on Agentic AI, and it’s worth connecting directly back to enforcement here, because monitoring on its own doesn’t actually do anything. A detection signal that fires into a Slack channel nobody’s watching, or an alert that lands in someone’s inbox three layers removed from anyone with the authority to act on it, isn’t enforcement. It’s just data collection with extra steps.

For monitoring to function as enforcement, three things have to be true at the same time. Someone has to be watching, or be reliably paged, in real time. That person needs an actual kill switch, meaning the practical ability to take the use case offline without needing a change-management ticket and a four-day approval cycle to do it. And the authority to use that kill switch needs to be established before the incident, not negotiated in the moment while a use case is actively misbehaving in production. Organizations that skip this step usually end up with a beautifully instrumented monitoring system that detects the problem in real time and then takes the same 88 days to actually do something about it, because nobody was clearly authorized to pull the plug.

Where the Standards Land on This

This isn’t just a best practice I’ve picked up from being in the weeds on this. It’s increasingly baked into the major frameworks organizations are being asked to align with. NIST’s AI Risk Management Framework has a dedicated function, Measure, that’s explicitly about analyzing and monitoring AI risk on an ongoing basis, not just at launch. ISO/IEC 42001, the international standard for AI management systems, treats monitoring the same way ISO 27001 treats information security controls: as a recurring obligation baked into the management system itself, not a one-time gate you pass through on your way to production.

The EU AI Act adds a harder deadline to all of this. High-risk AI systems, the kind used in employment decisions, credit scoring, critical infrastructure, and similar categories, face fully binding obligations starting August 2026. That’s not far off, and post-deployment monitoring is part of what those obligations require, not an optional add-on layered on top of an initial conformity assessment.

Here’s the part worth being honest about, though: none of these frameworks were actually built with agentic AI in mind. NIST’s framework predates the current wave of autonomous, tool-using agents. ISO 42001 was published before multi-agent orchestration was a mainstream deployment pattern. The EU AI Act’s risk categories were drafted around models that produce an output, not models that take actions, chain tool calls, and operate with persistent memory across sessions. That gap means the standards give you a strong floor, but they don’t give you a complete answer for agentic use cases. The org chart, swim lanes, and the HITL/HOTL/HOOTL classifications I’ve written about elsewhere exist precisely to fill in what the standards haven’t caught up to yet.

Why “Good Enough” Cadences Aren’t Good Enough

None of the common cadence-based approaches are unreasonable on their face. Reviewing based on team bandwidth is realistic given resourcing. Reviewing based on initial risk tier makes intuitive sense. A flat renewal clock is at least predictable and easy to communicate. The problem isn’t that these approaches are lazy, it’s that they’re built around a fixed calendar, and risk doesn’t wait for the calendar. A model provider can be ordered offline, a data source can get compromised, a regulation can shift, a prompt injection vulnerability can get discovered and exploited, all on a random Tuesday that has nothing to do with your review schedule.

Enforcement has to be an active, ongoing engagement, not a once-a-quarter dart throw where the plan is “three months sounds about right, we’ll reassess then.” Without some form of continuous monitoring sitting underneath your governance framework, you’re carrying open-ended liability for however long the gap between reviews happens to be.

Agentic AIContinuous MonitoringEnforcement

Nicholas Baker