How CRAFT Catches AI Mistakes Before You Do

Your AI is going to make things up — not occasionally, routinely. CRAFT’s quality assurance framework catches Most hallucinations, scores confidence, and flags drift before it reaches your deliverables.

Last week we showed how CRAFT’s device-switch handoff makes your project location-independent. This week: that mobility is only worth something if you can trust what your AI tells you when you arrive.

The Hallucination Problem

AI confabulation isn’t a fringe edge case. It’s a baseline behavior of every large language model — including the one in your Cowork session right now. The model will confidently report file contents it didn’t actually read, recipe versions that don’t exist, decisions you never made. The output looks fluent. It cites specifics. It feels right.

That’s the part that makes it dangerous. A confident wrong answer is harder to catch than an obvious one, because nothing about the response signals doubt.

This compounds in long sessions. Around the 70 percent token mark, model reliability degrades — not because the model “gets tired,” but because context becomes too dense to track precisely. Late-session outputs drift further from grounded fact and closer to plausible reconstruction. You don’t notice the moment it happens. You notice three sessions later, when something that was supposed to be true turns out not to be.

The Source Hierarchy

CRAFT files first, then observed behavior, then design intent, then reasoning, then external sources. Every factual claim gets graded against this hierarchy — the lower the source rung, the lower the confidence score.

The Verification Layer

CRAFT treats verification as a built-in concern, not an afterthought. It runs at four levels — individual claims, recipe execution, file integrity, and cross-session consistency — and every level has a recipe behind it. The core idea is simple: every claim the AI makes needs to be traceable to evidence. If it can’t be traced, it’s flagged.

The heart of the framework is a four-gate verification pattern that any recipe can call before reporting a result. It’s packaged as a reusable sub-routine: RCP-CWK-024 — Verification Gate Sub-Routine.

Gate 1 File-Pointability

Can the claim be traced to a specific file or location? “The recipe says X” needs to point at the recipe. “The handoff says Y” needs to point at the handoff. Claims with no file pointer are flagged as ungrounded.

Gate 2 Read-vs-Reconstructed

Was the data actually read in this session, or is the AI reconstructing it from memory? Reconstructed data is flagged. The fix is to re-read the file. The cost is a few tokens. The benefit is catching the failure mode that does the most damage.

Gate 3 Lessons-Learned Conflict

Does the claim conflict with any documented lesson learned? CRAFT’s LL file is the project’s accumulated truth. If a current claim contradicts a known LL entry, the gate stops and asks for resolution before continuing.

Gate 4 Untested Assumption

Is this verified or assumed? If the AI is reasoning from assumption rather than evidence, the assumption gets surfaced as a flagged item — visible to you, not hidden inside the answer.

Confidence Scoring with Decay

Beyond the gates, every factual claim gets a 0–100 confidence score. Evidence read directly from files scores 80–100. Behavioral observation of tool output scores 50–79. Design intent inferred from documentation scores 30–49. Pure reasoning with no source scores 0–29.

A 10-point penalty kicks in once the session passes 70 percent token usage — the late-session decay correction. The number isn’t there to make you suspicious of the AI; it’s there to give you a calibrated read on whether to trust this specific claim. A 92 needs no second look. A 38 does.

Calibration, Not Suspicion

A confidence score isn’t a verdict on the AI’s honesty. It’s a measurement of how grounded a specific claim is. High-score claims earn the trust they signal. Low-score claims earn a closer look.

The Evidence

The QA framework caught a real licensing error in this campaign’s own content. Earlier in development, nine Week 1 content files referenced CRAFT as “open source” — incorrect, because CRAFT actually uses a dual license (Business Source License 1.1 for the spec, proprietary for content). A factual claim validation pass flagged the language as inconsistent with the documented license. All nine files were corrected before publication.

That’s a small example, but it’s the right shape: an error a human reviewer might miss on the tenth read got caught on the first automated pass, before any of it shipped.

The framework’s project cleanup recipe — a longitudinal cross-file audit run every five to ten sessions — has caught version drift in tracking files at a rate of 40 percent. Almost half of all “current state” tables turn out to be stale by the time they’re audited. That drift would otherwise get inherited into every subsequent session as silent ground truth.

Why This Matters

The intelligence in your project is only valuable if it’s reliable. Memory that drifts, claims that confabulate, scores that don’t track confidence — those don’t add up to a working environment, no matter how impressive the surface output looks.

CRAFT’s quality assurance framework gives you something most AI workflows don’t have: a measurable answer to the question “should I trust this?” The gates are objective. The scoring is calibrated. The audits are reproducible. You’re not guessing about AI reliability — you have a record.

The Series

This is Week 5 of our 8-week capability spotlight. Each week we go deep on one part of CRAFT — how it works, what problems it solves, and how to use it. Follow along as we build the case for structured AI working environments.

Try It Yourself

CRAFT for Cowork is available now as a free public beta. The verification gate sub-routine, factual claim validator, and full QA framework documentation are on GitHub.

View on GitHub

CRAFT Language Spec: BSL 1.1 (converts to Apache 2.0, Jan 1 2029) · License Details

Next Week: Verification works best when there’s a structure to verify against. We’ll show how CRAFT’s project layout makes everything findable, traceable, and reproducible — the substrate that makes the rest of the framework possible.