Evidence-Based AI Drafting: A Reusable Architecture for Human-in-the-Loop Workflows

The pattern

I recently built two AI-assisted drafting systems for very different domains: one produces weekly customer health reports for enterprise accounts, the other produces performance feedback and self-reflections for a company review cycle. Different data sources, different outputs, different audiences. But the same architecture appeared in both, not because one inherited from the other, but because both encode the same underlying approach to research and analysis.

This post describes that architecture. It's not about the specific tools or outputs. It's about encoding a reasoning pattern into a system that AI agents can follow. The architecture works across domains because it isn't domain-specific engineering. It's how I think, made structural.

Why this matters

The default approach to AI-assisted writing is conversational: "help me write X." The AI searches, synthesizes, and drafts in a single pass. This works for casual tasks. It fails for anything where accuracy matters, because editorial judgment gets applied during collection. The AI compresses and filters evidence before you've reviewed what was found. The result: confident-sounding output with silent gaps.

The pattern described here is a structural alternative. It separates collection from interpretation, inserts verification gates between phases, and treats AI output as untrusted input at every step. The tradeoff is more steps, more files, and more interaction points, but every claim in the final output is traceable to evidence and every judgment call is explicit.

The reasoning model

The architecture didn't emerge from studying AI workflows. It encodes how I naturally approach any research-intensive task:

See everything before deciding anything. I don't form conclusions while gathering data. I collect broadly, note what I have and what I don't, and only start interpreting once I'm confident the picture is complete.
Know where each piece of information came from. Before I trust a claim, I ask: who said this, how do they know, and can I verify it somewhere else? One source is a data point. Two sources are corroboration.
Understand why something matters before writing about it. Listing facts is easy. Connecting them to outcomes requires a separate step. If I can't explain why a data point matters, I'm not ready to include it.
Apply the lens after seeing the data, not before. Context about a subject (their history, their sensitivities, the narrative frame) is valuable for interpretation but distorting for collection. I look at what the data says first, then calibrate through the lens of context.
Check your work against the source material before presenting it. Memory drifts, framing shifts meaning, and confidence isn't accuracy. Go back to the raw evidence.
Know what you're least sure about. Every conclusion has a confidence gradient. Making that gradient explicit, for yourself and for your reviewer, is more useful than presenting everything with equal certainty.

Each layer of the architecture below is one of these principles made into a structural constraint that the AI can't skip.

The architecture

Six layers, from load-bearing structure to domain-specific adapters.

Layer 1: Two-Phase Evidence Pipeline

See everything before deciding anything.

The foundation. Every workflow built on this pattern has two distinct phases with a hard gate between them.

Phase 1: Collection. Raw, exhaustive evidence gathering with no editorial lens. The AI searches all available sources, captures everything with full context, and tags each item by origin and reliability. No summarization, no interpretation, no filtering. The output is a structured evidence file organized by source.

Phase 2: Analysis & Drafting. Editorial judgment activated. The AI extracts themes, traces impact, flags contradictions, applies editorial config, and drafts. But only after the user has reviewed the raw evidence and approved the transition.

The evidence gate sits between the phases. The AI presents:

What it found, organized by theme, with confidence scores
What it didn't find (data gaps, classified by type)
Which claims rest on a single source
A recommendation: proceed or gather more data

The user must approve before interpretation begins. This prevents evidence gaps from becoming blind spots in the draft.

Why it works: The most impactful errors caught at the gate, in both systems, were claims where the AI had high confidence but only one source, or where a data gap was about to be silently carried into the draft. The gate forces the AI to show its work before applying judgment.

Layer 2: Source-Aware Evidence Quality

Know where each piece of information came from.

Every piece of collected data gets a source tag. The specific tags differ by domain, but the principle is identical: the provenance of a claim determines how it can be used.

The key rule: claims supported by only one source are flagged and cannot support high-stakes conclusions without user verification. In the customer health system, rating changes require two independent sources. In the feedback system, specific attributions ("X said Y at event Z") require cross-source validation.

This is a structural constraint, not a suggestion. The AI can't route around it because tagging happens during collection (Phase 1), and the gate checks for single-source claims before analysis begins (Phase 2).

Data gaps are classified, not ignored:

Dead-end: No evidence the signal continues. May mean the situation resolved.
Restricted: Evidence points to a source the AI can't access. Flag as high-priority for the user. These often contain the strongest signal.

Layer 3: Impact Extraction

Understand why something matters before writing about it.

Raw data says what happened. Impact extraction asks: why does it matter?

For each significant evidence entry, the AI produces a structured triple:

What: The fact
Why it matters: The business context, the risk, the decision it informs
Impact/outcome: What changed, was prevented, or was decided

If the AI can't articulate "why it matters," it flags the entry as incomplete rather than filling in a vague summary. An impact completeness audit catches these gaps before they reach the draft.

This step was discovered through a specific failure. Early versions of both systems produced output that listed facts without connecting them to outcomes. The customer health system described ticket counts without explaining what they meant for the relationship. The feedback system described tasks without connecting them to business impact. Impact extraction is the structural fix in both cases.

Layer 4: Config-as-Interpretation-Layer

Apply the lens after seeing the data, not before.

Each subject has an editorial config that defines:

How to weight different data sources for this subject
Sensitivity rules (what to scrub, what terminology to use)
Anchoring criteria (what would justify a rating change or a strong assertion)
Voice and framing guidance

The critical design decision: config is read after theme extraction, not before. The AI extracts themes from raw evidence with no editorial lens, then applies the config to calibrate interpretation. If you read the config first, its narrative frame biases what themes the AI sees.

This was independently validated in both systems. Reading a customer config that says "cautious tone, trust recovery" before analyzing the data made the AI emphasize recovery signals and downweight contradictory evidence. Reading a feedback config that defined a "direct" working relationship made the AI overweight thin evidence to match the expected depth.

The sequence that works: raw evidence → unbiased themes → config as calibration lens.

Layer 5: Pre-Draft Verification

Check your work against the source material, and know what you're least sure about.

Before any draft is presented, a sequential checklist runs:

Analysis enforcement: Does the draft follow the structural blueprint from the analysis?
Fact verification: Spot-check names, attributions, dates, and causal claims against raw evidence
Domain-specific quality check: Sensitivity (customer reports) or credibility (feedback)
Anchoring check: Does the conclusion meet the explicit criteria in the config?
Freshness check: Is any source past its freshness threshold?
Self-check disclosure: Identify the 2-3 claims with lowest confidence, present alongside the draft

The precedence rule: accuracy outranks structure. If the analysis recommended a framing that fact verification flags as unsupported, fact verification wins.

The self-check disclosure is the most operationally useful step. When processing multiple subjects per cycle, the user can't review every line equally. Surfacing the highest-risk sections focuses review time where it matters most.

Layer 6: Session Recovery & Cross-Cycle Continuity

Each time you revisit a subject, you should know more than last time.

Workflows processing multiple subjects need state tracking:

Status markers in evidence files (in-progress / complete)
Per-source completion tracking
Checkpoint after each subject is finalized
Overwrite guard: never silently replace files from a previous session
Resume, reconcile, or start-fresh options when a new session detects existing work

Cross-cycle continuity is the compounding effect. As long as sessions run from a consistent workspace, per-subject history files, decisions logs, and editorial configs accumulate institutional knowledge across cycles. Each run starts with richer context than the last: prior ratings, previous report content, settled editorial decisions. The architecture gets more valuable over time, not less. Interrupted sessions are the norm when processing multiple subjects, and cross-cycle context is lost entirely without persistent state.

What's invariant vs. what's interchangeable

The six layers above are the invariant structure. Everything else is a domain-specific adapter:

Component	Adapts per domain
Data sources	Different APIs, databases, manual inputs
Source tags	Different provenance categories
Per-subject config	Different sensitivity rules, anchoring criteria
Quality filter	Different thresholds for what requires verification
Output format	Different templates, different audiences
Batch strategy	Bulk collection vs. sequential, depending on data source characteristics

A new domain (incident postmortems, competitive analysis, quarterly reviews) would plug in different sources, configs, templates, and quality filters. The pipeline, gate, impact extraction, config timing, and verification checklist stay the same.

Design constraints

Sequential judgment bottleneck. Even with bulk data collection, the evidence gate and review steps are serial per subject. You can front-load mechanical work, but the judgment work is inherently one-at-a-time. Batch mode makes each review faster (all data already collected), but the serial structure remains.

Overhead scales with subjects, not complexity. The full pipeline for all 11 subjects in the customer health system takes about 15-20 minutes total in batch mode. The pipeline also works well for quick ad-hoc checks when the user has limited context. The automated collection and evidence gate still add value even in a lightweight pass. The overhead concern is real only for trivially simple tasks where the answer is already known.

The architecture surfaces thin data, it doesn't create data. When a subject has almost no information available anywhere, the pipeline correctly reports there isn't enough to work with. But this is rarely the actual situation. Sources like Slack often provide links and references that expand search visibility into other platforms, and cross-cycle continuity means each cycle accumulates more context than the last.

Applying this to your own workflows

You don't need a specific toolchain. The principles work with any AI assistant:

Separate collection from interpretation. Run all research first, save the raw output, then start a new phase for analysis. Don't let the AI summarize while it's searching.
Gate the transition. Before interpreting, review what was found and what wasn't. Make the approval explicit.
Tag provenance. Note where each piece of data came from. Single-source claims get flagged, not trusted.
Extract impact before synthesizing. For each data point: What / Why it matters / Impact. If "why it matters" is blank, that's a gap, not something to skip.
Read editorial context after extracting themes. Let the data speak first; apply the lens second.
Verify before presenting. Run claims against raw evidence. Surface the 2-3 things you're least confident about. Let the reviewer focus on the highest-risk sections.
Persist context across cycles. Each run should start with more knowledge than the last. Save decisions, history, and editorial config to disk so the next cycle builds on institutional knowledge rather than starting from scratch.

For recurring workflows where accuracy matters and the output influences decisions, the tradeoff is worth it.

← Back to all posts