The Problem: When Your LLM Confidently Lies
The project was an evidence generation pipeline for FedRAMP. Input: hundreds of NIST 800-53 controls. Output: structured evidence an auditor could walk from paper straight to the running system.
First pass was embarrassing in hindsight. Hand Claude the control, ask it to classify and extract evidence, move on.
It worked until it didn't. A compliance analyst flagged outputs where fields that were supposed to hold abstract resource classes had "EC2" and "S3" sitting in them. The model was stamping concrete AWS names into abstract slots with full confidence. Worse, it was throwing high confidence scores on controls where the metadata flat out contradicted the description.
No amount of prompt tuning was going to fix that. It was an architecture problem.
What Naive Prompting Actually Produces
Real example. Control titled "Automated Inventory." Summary: "Use authoritative sources to automatically maintain real time inventories." The naive prompt gave me this:
{
"requirement_type": "technical_configuration",
"candidate_subjects": ["AWS Config", "EC2 instances", "S3 buckets"],
"technical_split": 0.9,
"process_split": 0.1,
"enrichment_confidence": "high"
}Three failure modes in one output. Classified off the prose. Leaked AWS names. Confidence inflated past what the metadata supports.
The Insight: LLMs Should Interpret, Not Decide
The turning point was realizing the model was doing too many jobs. Classifying, extracting entities, scoring confidence, deciding routing. All of it in one prompt. Every responsibility is another surface where the model can make something up.
Strategy Pattern is what unlocked the redesign. One job per stage. The LLM only runs where language understanding is actually needed. Everywhere else is deterministic.
Routing, validation, mapping, escape hatches, all of it lives in code. The model never touches those decisions.
Constraint 1: Structured Input, Not Raw Prose
First fix was also the cheapest. The model was grabbing nouns out of free text. "Automated Inventory" has the word "automated" in it, so it classified as fully technical. Meanwhile the metadata said automation_status was "No" and validation_method was "Manual." The title was lying.
Fix: never hand the model raw prose. Package everything into structured JSON where the metadata fields sit right next to the description. That way the model can't ignore contradicting signals just because the description sounds confident.
That one change killed an entire class of misclassification.
Constraint 2: The 22 Service Blocklist
Even with structured input, AWS names kept leaking into candidate_subjects. "Compute instances" mutated into "EC2 instances." "Object storage" became "S3 buckets." The model was doing a downstream job at this stage, resolving abstract concepts to concrete services, which was not its job at all.
22 service blocklist, whole word regex. ec2, s3, iam, vpc, lambda, cloudwatch, cloudtrail, config, guardduty, inspector, macie, securityhub, cloudformation, systems_manager, kms, secrets_manager, waf, shield, route53, elb, rds, dynamodb.
If any of these show up in the output, the response gets rejected. Not flagged. Not logged. Rejected. The pipeline runs classification again or falls back to a deterministic default.
This is the point of Strategy Pattern in LLM work. You don't politely ask the model to follow rules. You enforce rules around it, in code the model can't see.
Constraint 3: Mathematical Invariants
Every classification produces a technical_split and a process_split. They represent how much of the control is machine testable versus how much needs a human to attest to it. The invariant is trivial. They sum to 1.0. That's it.
The validator does round(technical_split + process_split, 10) != 1.0. If the math is off, the output is invalid. Done.
Sounds trivial, catches a real pattern. The model likes to hedge by inflating both numbers. technical_split: 0.8 and process_split: 0.7 is a common one. It's saying "I'm not sure" in the only way it knows. Problem is, the result is mathematically impossible. The constraint forces the model to commit to a real distribution instead of weaseling.
Constraint 4: Deterministic Escape Hatches
Some controls never need the LLM. If automation_status is "No" and validation_method is "Manual," the answer is already written. candidate_subjects empty, technical_split 0.0, process_split 1.0, downstream action "do_not_component_map."
Those escape hatches run before any API call. The model doesn't get to weigh in on cases where the metadata already told us the answer.
Another Strategy Pattern principle. Route to the simplest thing that can handle the case. Only reach for the LLM when deterministic logic can't close it out.
Constraint 5: Confidence Gates the Queue
Last constraint handles confidence inflation. When enrichment_confidence is "low," ambiguity_notes has to contain something. Validated in code, not prompted for. The model can't hand wave past uncertainty with a blank field.
Low confidence also flips requires_human_review to true, which halts automation for that control. It goes into a human review queue instead of moving forward.
Net effect: the model is allowed to say it's unsure. It just has to show its work, and being unsure actually costs something in the pipeline.
The Constrained Output
With the five constraints in place, the same "Automated Inventory" control produces this:
{
"requirement_type": "hybrid",
"candidate_subjects": [
"resource inventory records",
"asset discovery configurations",
"inventory source authorities"
],
"technical_split": 0.4,
"process_split": 0.6,
"enrichment_confidence": "medium",
"ambiguity_notes": "Title/summary describe automated capability but source metadata (automation_status: No, validation_method: Manual) indicates manual validation. Source metadata takes precedence."
}Metadata beats prose. Abstract subjects where AWS names used to leak. Confidence honestly set to medium, with a note explaining why. Math works: 0.4 + 0.6 = 1.0.
Four Phases of Validation
Constraints block most hallucinations. They don't block all of them. The validation pipeline catches what leaks through. Four phases. Every phase has to pass before output is allowed into production.
Phase A is schema validation. Structural errors, split arithmetic, escape hatch violations. All blocking. If the JSON isn't valid, it stops here.
Phase B is the gold set. Ten test cases written by analysts, with thresholds: 8 of 10 type matches, 10 of 10 clean on AWS names, 2 of 2 escape hatches fired correctly. This is the regression gate whenever the prompt changes.
Phase C is divergence analysis. Every gap between expected and actual gets categorized into one of eight buckets. type_error, scope_narrowing, aws_name_leak, confidence_inflation, split_deviation, escape_hatch_miss, subject_hallucination, justification_gap. Each bucket has a predefined fix.
Phase D is the full run. End to end counts across every control, traceability checks between layers, human review queue sizing. This phase catches systemic drift that single test cases never see.
The Strategy Pattern in LLM Pipelines
Classic Strategy Pattern swaps algorithms at runtime behind a common interface. In LLM pipelines the same idea lives one level up. You're swapping between deterministic strategies and LLM strategies depending on the input.
Escape hatches are the deterministic strategy. The LLM classifier is the model strategy. The constraint and validation layers are the context that picks which strategy runs and checks the result.
That inversion is what makes the pipeline actively reject hallucinations instead of just tolerating them. The LLM doesn't decide what it does. The pipeline decides what the LLM does, scopes the job tightly, and validates the output against invariants the model can't override.
Net result. The LLM handles semantic interpretation, which is the thing it's actually good at. Everything else is enforced by code, and code doesn't hallucinate.
What I'd Do Differently
The blocklist works and it's brittle. A 23rd AWS service means a code change. Next iteration I'd swap it for a semantic similarity check against an embedding of "abstract resource class." Still deterministic at inference, way less maintenance.
Ten gold cases is a starting point. Small. I'd invest in automated gold set expansion. Every time a reviewer corrects a classification, that correction becomes a new test case. The validation suite should grow with the system, not stay frozen at launch.
The escape hatch logic is if else chains right now. As control types grow, I'd refactor into a proper Strategy registry. A dict mapping metadata patterns to deterministic classifiers, with the LLM as the default fallback.
Key Takeaway
If you're building LLM pipelines for anything that matters, stop chasing better prompts. Start writing smaller ones. Scope the model down to the minimum semantic task. Enforce every other constraint in code. Validate outputs against invariants the model can't negotiate with.
The LLM is a powerful tool. It's a tool. Not the architect. Not the validator. Definitely not the source of truth. The pipeline is all three.