Multi Pipeline LLM Automation Platform
An AI platform that handles the grinding parts of FedRAMP documentation and evidence work. Control mapping, evidence narratives, AWS CLI evidence commands, appendix planning, inventory reports. The stuff that used to take weeks.
Problem
FedRAMP authorization needs hundreds of control mappings, evidence narratives, inventory artifacts, and audit ready outputs. Doing that by hand is slow, inconsistent, and expensive.
Solution
Built a platform that runs multiple pipelines over LLMs, retrieval, prompt libraries, and structured validation. It connects source documents, vector search, control metadata, and reusable workflows into one repeatable system, and it generates grounded compliance output at real volume.
Impact
- →Cut manual compliance authoring and evidence prep from days or weeks down to batch jobs you rerun on demand
- →Generates output for hundreds of NIST 800-53 control parts automatically
- →Produces reusable outputs that feed other compliance docs and audit workflows downstream
Architecture
- 01Google Drive documents feed into vector and file-search workflows
- 02The component mapping pipeline identifies which services implement each control
- 03Downstream generators produce evidence narratives and AWS CLI command sets
- 04Inventory and appendix workflows produce artifacts ready for submission
- 05Shared prompt libraries, app bootstrap, and RAG services sit underneath the whole platform
Capabilities
- ·Control to service component mapping
- ·Evidence narrative generation
- ·Read only AWS CLI evidence command generation
- ·Appendix planning and document support
- ·Checkpointing and resumable batch execution
- ·Prompt profile management
- ·Validation and fallback logic
Stack
Technical Deep Dive
Architecture internals and annotated code from the production system.
Architecture Overview
The Control to Service Mapper is a layered pipeline with three resolution stages for the many to many NIST to AWS relationship. Instead of hiding that complexity, I broke it down into a chain of one to few lookups, each with its own confidence score. Ambiguity gets tracked, not buried.
Key Architectural Decisions
Taxonomy Registry
A static lookup dict of roughly 15 abstract evidence classes. Things like "network-boundary-control" and "identity-principal." Each one comes pre-wired with its aws_resource_classes, supporting_resource_classes, observable_properties, and relationships. That's the indirection layer. NIST intent on one side, AWS specifics on the other, and nothing in between depends on both.
Confidence Gated Routing
A Strategy pattern dispatcher sorts records into three paths before mapping logic runs at all. Passthrough. Review queue. Full mapping. That routing decision is cheap and deterministic, and it keeps the expensive logic off records that don't need it.
Domain and Detection Hints
Two tie-breaking registries, DOMAIN_HINTS and DETECTION_CLASS_HINTS, bias ambiguous subject resolution toward the right taxonomy class based on the control family. Without them, the scorer guesses on close calls. With them, the control family tells the scorer which class usually wins.
Crosswalk Index Precomputation
The many to many KSI to domain relationship gets materialized once at startup into a fast lookup index. Refined entries win over legacy entries by construction. The lookup is O(1), and no code path later needs to rebuild that relationship on the fly.
Code Showcase 1
Strategy Pattern Entry Point
map_record() is the router. 30 lines that cleanly split routing from mapping. Each layer2_action dispatches to its own builder. Unknown actions and Layer 1 inconsistencies go to the review queue. Nothing gets silently dropped.
def map_record(record: dict[str, Any]) -> tuple[dict[str, Any] | None, dict[str, Any] | None]:
"""
Map one Layer 1 record to Layer 2 output.
Returns (mapped_record, review_record). Exactly one is non-None.
"""
req_id = record.get("requirement_id", "<unknown>")
layer2_action = record.get("layer2_action", "")
if layer2_action == "do_not_component_map":
log.debug("Passthrough (process_attestation): %s", req_id)
return _build_passthrough(record), None
if layer2_action == "human_review_required":
log.debug("Routing to review queue (Layer 1 decision): %s", req_id)
return None, _build_review_record(record, reason="human_review_required_by_layer1")
if layer2_action == "map_components":
# Safety: Layer 1 inconsistency
if record.get("requires_human_review"):
log.warning(
"Layer 1 inconsistency for %s: layer2_action=map_components "
"but requires_human_review=true — routing to review queue", req_id
)
return None, _build_review_record(record, reason="layer1_inconsistency")
log.debug("Mapping components: %s", req_id)
return _build_mapping(record), None
log.warning("Unknown layer2_action %r for %s — routing to review queue", layer2_action, req_id)
return None, _build_review_record(record, reason=f"unknown_layer2_action_{layer2_action!r}")| Property | Detail |
|---|---|
| Pattern | Strategy. Each layer2_action dispatches to a dedicated builder (_build_passthrough, _build_review_record, _build_mapping). |
| Tuple Discriminator | Returns (mapped, None) or (None, review). Exactly one is non-None, enforced by contract. |
| Defensive Routing | Unknown actions and Layer 1 inconsistencies both route to the review queue. No record gets dropped silently. |
| Separation of Concerns | Zero mapping logic in the router. It decides which strategy runs, not how. |
| Count Invariant | The runner verifies mapped + review == total. The tuple contract guarantees no records go missing. |
Code Showcase 2
Many to Many Taxonomy Expansion
_match_subject() resolves ambiguous abstract subjects to concrete taxonomy classes. Jaccard-like keyword overlap, plus domain hints, plus detection-class hints, plus a monitoring demotion. "Encryption key configurations" could plausibly map to cryptographic-key-store, configuration-enforcement-control, or access-policy. This scoring stack picks one deterministically. No LLM at this layer.
for class_name, entry in TAXONOMY.items():
score = 0.0
for kw in entry["subjects"]:
kw_norm = _normalize(kw)
kw_words = set(kw_norm.split())
subj_words = set(norm.split())
overlap = kw_words & subj_words
if not overlap:
continue
union = kw_words | subj_words
score = max(score, len(overlap) / len(union) * 3.0)
if score == 0:
class_words = set(class_name.replace("-", " ").split())
subj_words = set(norm.split())
overlap = class_words & subj_words
if overlap:
score = len(overlap) / len(class_words | subj_words) * 1.5
if score == 0:
continue
# Apply bonuses
if class_name in dc_preferred:
score += 2.0
if class_name in domain_preferred:
score += 1.0
if demote_monitoring and class_name == "monitoring-infrastructure":
score = max(score - 1.5, 0.01)
scores[class_name] = score| Property | Detail |
|---|---|
| Scoring Stack | Scoring stack. Keyword overlap (Jaccard x3.0), then class-name fallback (x1.5), then detection-class boost (+2.0), then domain boost (+1.0), then monitoring demotion (-1.5). |
| No AI at Layer 2 | No AI at Layer 2. Disambiguation is fully deterministic, so the result is reproducible and auditable. |
| Tie-Breaking | Domain and detection-class hints from the control family break ties toward the right taxonomy class. |
Data Lifecycle
End-to-end flow of a single compliance check through the pipeline. Every arrow is a single NDJSON file. Every stage enforces a schema gate and count invariant before writing its output.
Ingest + Normalize + Enrich
Four FedRAMP 20x source files get loaded and merged into one canonical requirement record. An escape hatch classifier catches obvious process attestation records without an LLM call. Everything else goes to Claude (via Bedrock or direct API) for semantic enrichment. The LLM returns requirement_type, validation_intent, candidate_subjects, candidate_test_modes, and enrichment_confidence. That's the full set.
AWS Component Mapping
Each enriched requirement's abstract candidate_subjects resolve to concrete AWS CloudFormation resource types via the evidence taxonomy. The strategy router routes records into passthrough, review queue, or full mapping. The scoring engine does the disambiguation: keyword overlap, domain hints, detection class hints.
Vanta Coverage Comparison
The Layer 2 AWS resource map gets compared against live Vanta exports across four dimensions. Component availability, test existence, inventory presence, and test precision. Out the other end, every requirement has a coverage_assessment.
Gap Confirmation + Test Design
Every coverage gap gets confirmed and classified into a specific gap type. Automation feasibility, priority signal, and a layer5_action that decides whether to generate a custom test. Controls that don't need a custom test never make it to the test generation stage.
Backlog Generation + Prioritization
Confirmed gap candidates turn into prioritized backlog items. Complexity tier, readiness, priority score, prerequisite links. This is the output an engineering team actually works from.
Orchestration + Manifest
The orchestrator sequences Layers 1 through 5, manages checkpoint state so a failed run can resume instead of restart, and writes the authoritative run manifest. This is the control plane. It doesn't transform data. It runs the things that do.
Impact Analysis
Time savings backed by real pipeline runs. 65 FedRAMP 20x KSIs, all 5 layers, actual output counts. No hand-waving.
Pipeline Scale
| Metric | Count | Source |
|---|---|---|
| FedRAMP 20x KSIs in scope | 65 | Layer 1 audit |
| Control-part evidence mappings generated | 148 | PROMPT_1 (116) + PROMPT_2 (32) |
| Directly covered by Vanta (no gap) | 36 | Layer 3 assessment |
| Confirmed gaps requiring remediation | 29 | Layer 4 classification |
| Backlog candidates auto-prioritized | 20 | Layer 5 output |
| Individual evidence extraction prompts | 116 | Custom test prompts |
Manual Baseline Per Control
For a FedRAMP Moderate SSP, a compliance analyst doing a single control manually performs roughly this work. Midpoint: ~6 hours per control.
| Manual Task | Estimated |
|---|---|
| Read requirement, parse Rev5 crosswalk, understand scope | 0.5-1 hr |
| Map requirement to actual infrastructure (which AWS services?) | 1-2 hrs |
| Determine what evidence is needed and from which tools | 1-2 hrs |
| Collect evidence (screenshots, CLI output, configs) | 1-3 hrs |
| Write the implementation narrative for the SSP | 1-2 hrs |
| Cross-reference with related controls for consistency | 0.5-1 hr |
| Total per control | 4-10 hrs |
What the Pipeline Replaces
| Task | Manual | Pipeline | Time w/ Pipeline |
|---|---|---|---|
| Requirement analysis & classification | 0.5-1 hr | Layer 1 enrichment. Seconds | ~0 min |
| Infrastructure mapping | 1-2 hrs | Layer 2 AWS component mapping. Seconds | ~0 min |
| Evidence determination | 1-2 hrs | Layer 3 coverage + Layer 4 gap analysis. Seconds | ~0 min |
| Evidence collection guidance | 1-3 hrs | 116 pre-built evidence prompts with exact artifacts listed | ~15-30 min (human still collects) |
| Narrative writing | 1-2 hrs | Structured output with validation_intent field | ~15-30 min (human reviews) |
| Cross-referencing | 0.5-1 hr | Cross-layer traceability checks (PP.3) | ~5 min (automated) |
| Total per control | ~6 hrs | ~35-65 min |
Time Savings
Your Scope (65 KSIs)
Full Moderate Baseline (325 Controls)
Savings by Control Type
The 86% figure moves around by control type. Here are the honest numbers.
| Control Type | Coverage | Savings |
|---|---|---|
| Directly covered (36 KSIs) | Full pipeline: classify, map, confirm coverage. Done | ~90%. Human just reviews |
| Confirmed gaps (29 KSIs) | Pipeline identifies gap + generates prioritized backlog with complexity tier | ~75%. Human still builds the fix |
| Process attestation (escape hatch) | Pipeline routes to do_not_component_map, skips technical mapping entirely | ~80%. Saved from wasting time on false mapping |
| Human review queue (4 KSIs) | Pipeline flags uncertainty, blocks propagation | ~50%. Analyst still does the hard thinking |
Secondary Wins
Hallucination Out of Evidence Claims
Manual SSPs are full of vague filler like "we use AWS services to ensure compliance." The pipeline forces abstract resource-class language, enforces technical_split + process_split = 1.0, requires ambiguity notes whenever confidence dips, and validates everything against a 10 sample gold set. Every claim an auditor reads traces back to a specific validation gate.
Deterministic Routing Kills Analyst Errors
Escape hatches stop analysts from burning hours mapping process-only controls like KSI-RPL-01 (Recovery Objectives) to AWS infrastructure. The pipeline catches those in milliseconds. automation_status: No plus validation_method: Manual resolves to candidate_subjects: [] and layer2_action: do_not_component_map. Nobody wastes a morning on it.
Signal Conflict Detection
KSI-PIY-01's title says "Automated Inventory." Its metadata says automation_status: No. A human skimming the SSP template classifies it as technical nine times out of ten. The pipeline's rule, source metadata wins over prose, catches the contradiction every time. This kind of error scales linearly with analyst fatigue and faster than linearly with control count.
Faster Audit Cycles from Pre-Built Traceability
Cross-layer consistency checks verify every requirement_id appears exactly once across all layers. When the 3PAO asks "show me how you determined this control is covered," the answer is a traceable chain. Layer 1 classification, Layer 2 AWS mapping, Layer 3 Vanta comparison, Layer 4 gap confirmation. Industry consensus says audit cycle time drops 40 to 60% with pre-structured evidence like this.
Consistent Prioritization
Layer 5 priority scoring is a formula, not a vote. layer4_signal + technical_weight + gap_type + coverage_depth + reuse, minus complexity and confidence_penalty. That means remediation order isn't driven by whoever shouted loudest. POA&M auto-generation maps severity to priority (P0: 30 days, P1: 90 days, P2: 180 days, P3: 365 days), which gives auditors the exact timeline structure they already expect to see.
Executive Summary
For 65 FedRAMP 20x KSIs, the pipeline cuts classification, mapping, and evidence determination from about 390 analyst-hours down to 54. That's an 86% reduction. Scale it to the full 325 control Moderate baseline and it saves roughly 1,680 hours per SSP cycle. Secondary benefits: evidence claims are enforced by programmatic validation gates instead of trust. Deterministic routing kills the misclassification errors analysts burn out making. And pre-built audit traceability knocks an estimated 40 to 60% off 3PAO cycle time.