
Your AI is going to make things up — not occasionally, routinely. CRAFT’s quality assurance framework catches Most hallucinations, scores confidence, and flags drift before it reaches your deliverables.
Last week we showed how CRAFT’s device-switch handoff makes your project location-independent. This week: that mobility is only worth something if you can trust what your AI tells you when you arrive.
The Hallucination Problem
AI confabulation isn’t a fringe edge case. It’s a baseline behavior of every large language model — including the one in your Cowork session right now. The model will confidently report file contents it didn’t actually read, recipe versions that don’t exist, decisions you never made. The output looks fluent. It cites specifics. It feels right.
That’s the part that makes it dangerous. A confident wrong answer is harder to catch than an obvious one, because nothing about the response signals doubt.
This compounds in long sessions. Around the 70 percent token mark, model reliability degrades — not because the model “gets tired,” but because context becomes too dense to track precisely. Late-session outputs drift further from grounded fact and closer to plausible reconstruction. You don’t notice the moment it happens. You notice three sessions later, when something that was supposed to be true turns out not to be.
CRAFT files first, then observed behavior, then design intent, then reasoning, then external sources. Every factual claim gets graded against this hierarchy — the lower the source rung, the lower the confidence score.
The Verification Layer
CRAFT treats verification as a built-in concern, not an afterthought. It runs at four levels — individual claims, recipe execution, file integrity, and cross-session consistency — and every level has a recipe behind it. The core idea is simple: every claim the AI makes needs to be traceable to evidence. If it can’t be traced, it’s flagged.
The heart of the framework is a four-gate verification pattern that any recipe can call before reporting a result. It’s packaged as a reusable sub-routine: RCP-CWK-024 — Verification Gate Sub-Routine.
Can the claim be traced to a specific file or location? “The recipe says X” needs to point at the recipe. “The handoff says Y” needs to point at the handoff. Claims with no file pointer are flagged as ungrounded.
Was the data actually read in this session, or is the AI reconstructing it from memory? Reconstructed data is flagged. The fix is to re-read the file. The cost is a few tokens. The benefit is catching the failure mode that does the most damage.
Does the claim conflict with any documented lesson learned? CRAFT’s LL file is the project’s accumulated truth. If a current claim contradicts a known LL entry, the gate stops and asks for resolution before continuing.
Is this verified or assumed? If the AI is reasoning from assumption rather than evidence, the assumption gets surfaced as a flagged item — visible to you, not hidden inside the answer.
Confidence Scoring with Decay
Beyond the gates, every factual claim gets a 0–100 confidence score. Evidence read directly from files scores 80–100. Behavioral observation of tool output scores 50–79. Design intent inferred from documentation scores 30–49. Pure reasoning with no source scores 0–29.
A 10-point penalty kicks in once the session passes 70 percent token usage — the late-session decay correction. The number isn’t there to make you suspicious of the AI; it’s there to give you a calibrated read on whether to trust this specific claim. A 92 needs no second look. A 38 does.
A confidence score isn’t a verdict on the AI’s honesty. It’s a measurement of how grounded a specific claim is. High-score claims earn the trust they signal. Low-score claims earn a closer look.
The Evidence
The QA framework caught a real licensing error in this campaign’s own content. Earlier in development, nine Week 1 content files referenced CRAFT as “open source” — incorrect, because CRAFT actually uses a dual license (Business Source License 1.1 for the spec, proprietary for content). A factual claim validation pass flagged the language as inconsistent with the documented license. All nine files were corrected before publication.
That’s a small example, but it’s the right shape: an error a human reviewer might miss on the tenth read got caught on the first automated pass, before any of it shipped.
The framework’s project cleanup recipe — a longitudinal cross-file audit run every five to ten sessions — has caught version drift in tracking files at a rate of 40 percent. Almost half of all “current state” tables turn out to be stale by the time they’re audited. That drift would otherwise get inherited into every subsequent session as silent ground truth.
Why This Matters
The intelligence in your project is only valuable if it’s reliable. Memory that drifts, claims that confabulate, scores that don’t track confidence — those don’t add up to a working environment, no matter how impressive the surface output looks.
CRAFT’s quality assurance framework gives you something most AI workflows don’t have: a measurable answer to the question “should I trust this?” The gates are objective. The scoring is calibrated. The audits are reproducible. You’re not guessing about AI reliability — you have a record.
This is Week 5 of our 8-week capability spotlight. Each week we go deep on one part of CRAFT — how it works, what problems it solves, and how to use it. Follow along as we build the case for structured AI working environments.
