Multi-Pipeline LLM Automation Platform
An AI-powered platform that automates major parts of the FedRAMP documentation and evidence workflow, including control-to-service mapping, evidence narrative generation, AWS CLI evidence commands, appendix planning, and inventory reporting.
Problem
FedRAMP authorization requires hundreds of control mappings, evidence narratives, inventory artifacts, and audit-ready outputs. Producing these manually is slow, inconsistent, and expensive.
Solution
Built a multi-pipeline platform that uses LLMs, retrieval, prompt libraries, and structured validation to generate grounded compliance outputs at scale. The platform connects source documents, vector search, control metadata, and reusable workflows into a repeatable system.
Impact
- →Reduced manual compliance authoring and evidence-prep effort from days or weeks to repeatable batch workflows
- →Automated generation for hundreds of NIST 800-53 control parts
- →Created reusable outputs that feed multiple downstream compliance documents and audit workflows
Architecture
- 01Google Drive documents are ingested into vector/file-search workflows
- 02Component mapping pipeline identifies which services implement each control
- 03Downstream generators produce evidence narratives and AWS CLI command sets
- 04Supporting inventory and appendix workflows produce submission-ready artifacts
- 05Shared prompt libraries, app bootstrap, and RAG services support the full platform
Capabilities
- ·Control-to-service component mapping
- ·Evidence narrative generation
- ·Read-only AWS CLI evidence command generation
- ·Appendix planning and document support
- ·Progress checkpointing and resumable batch execution
- ·Prompt profile management
- ·Validation and fallback logic
Stack
Technical Deep Dive
Architecture internals and annotated code from the production system.
Architecture Overview
The Control-to-Service Mapper follows a layered pipeline architecture with three distinct resolution stages for the many-to-many NIST-to-AWS relationship. The many-to-many complexity is managed by decomposing it into a chain of one-to-few lookups, each with its own confidence score, so ambiguity is tracked rather than hidden.
Key Architectural Decisions
Taxonomy Registry
A static lookup dict of ~15 abstract evidence classes (e.g., "network-boundary-control", "identity-principal"), each pre-wired with aws_resource_classes, supporting_resource_classes, observable_properties, and relationships. This is the indirection layer that decouples NIST intent from AWS specifics.
Confidence-Gated Routing
A Strategy-pattern dispatcher that gates records into three paths before any mapping logic runs — passthrough, review queue, or full mapping.
Domain / Detection Hints
Tie-breaking registries (DOMAIN_HINTS, DETECTION_CLASS_HINTS) that bias ambiguous subject resolution toward the correct taxonomy class based on the control family.
Crosswalk Index Pre-Computation
The many-to-many KSI-to-domain relationship is materialized once at startup into a fast lookup index with refined vs. legacy discrimination.
Code Showcase 1
Strategy-Pattern Entry Point
The map_record() function is a Strategy Pattern router — 30 lines that cleanly separate routing logic from mapping logic. Each layer2_action dispatches to a dedicated builder, with unknown actions and Layer 1 inconsistencies funneled to the review queue rather than silently dropping records.
def map_record(record: dict[str, Any]) -> tuple[dict[str, Any] | None, dict[str, Any] | None]:
"""
Map one Layer 1 record to Layer 2 output.
Returns (mapped_record, review_record). Exactly one is non-None.
"""
req_id = record.get("requirement_id", "<unknown>")
layer2_action = record.get("layer2_action", "")
if layer2_action == "do_not_component_map":
log.debug("Passthrough (process_attestation): %s", req_id)
return _build_passthrough(record), None
if layer2_action == "human_review_required":
log.debug("Routing to review queue (Layer 1 decision): %s", req_id)
return None, _build_review_record(record, reason="human_review_required_by_layer1")
if layer2_action == "map_components":
# Safety: Layer 1 inconsistency
if record.get("requires_human_review"):
log.warning(
"Layer 1 inconsistency for %s: layer2_action=map_components "
"but requires_human_review=true — routing to review queue", req_id
)
return None, _build_review_record(record, reason="layer1_inconsistency")
log.debug("Mapping components: %s", req_id)
return _build_mapping(record), None
log.warning("Unknown layer2_action %r for %s — routing to review queue", layer2_action, req_id)
return None, _build_review_record(record, reason=f"unknown_layer2_action_{layer2_action!r}")| Property | Detail |
|---|---|
| Pattern | Strategy — each layer2_action dispatches to a dedicated builder (_build_passthrough, _build_review_record, _build_mapping) |
| Tuple Discriminator | Returns (mapped, None) or (None, review) — exactly one is non-None, enforced by contract |
| Defensive Routing | Unknown actions and Layer 1 inconsistencies both funnel to the review queue rather than silently dropping records |
| Separation of Concerns | Zero mapping logic here — this function only decides which strategy runs, not how it runs |
| Count Invariant | The runner verifies mapped + review == total, so the tuple contract guarantees no records are lost |
Code Showcase 2
Many-to-Many Taxonomy Expansion
The _match_subject() scoring loop resolves ambiguous abstract subjects to concrete taxonomy classes using Jaccard-like keyword overlap + domain hints + detection-class hints + monitoring demotion. A single abstract subject like "encryption key configurations" could map to cryptographic-key-store, configuration-enforcement-control, or access-policy — this scoring stack resolves it deterministically without AI inference at Layer 2.
for class_name, entry in TAXONOMY.items():
score = 0.0
for kw in entry["subjects"]:
kw_norm = _normalize(kw)
kw_words = set(kw_norm.split())
subj_words = set(norm.split())
overlap = kw_words & subj_words
if not overlap:
continue
union = kw_words | subj_words
score = max(score, len(overlap) / len(union) * 3.0)
if score == 0:
class_words = set(class_name.replace("-", " ").split())
subj_words = set(norm.split())
overlap = class_words & subj_words
if overlap:
score = len(overlap) / len(class_words | subj_words) * 1.5
if score == 0:
continue
# Apply bonuses
if class_name in dc_preferred:
score += 2.0
if class_name in domain_preferred:
score += 1.0
if demote_monitoring and class_name == "monitoring-infrastructure":
score = max(score - 1.5, 0.01)
scores[class_name] = score| Property | Detail |
|---|---|
| Scoring Stack | Keyword overlap (Jaccard ×3.0) → class-name fallback (×1.5) → detection-class boost (+2.0) → domain boost (+1.0) → monitoring demotion (−1.5) |
| No AI at Layer 2 | Disambiguation is fully deterministic — no LLM calls, ensuring reproducibility and auditability |
| Tie-Breaking | Domain and detection-class hints from the control family bias ambiguous matches toward the correct taxonomy class |
Data Lifecycle
End-to-end flow of a single compliance check through the pipeline. Every arrow is a single NDJSON file. Every stage enforces a schema gate and count invariant before writing its output.
Ingest + Normalize + Enrich
Four FedRAMP 20x source files are loaded and merged into a canonical requirement record. An escape-hatch classifier catches obvious process-attestation records without calling the LLM. Everything else gets sent to Claude (via Bedrock or direct API) for semantic enrichment — the LLM determines requirement_type, validation_intent, candidate_subjects, candidate_test_modes, and enrichment_confidence.
AWS Component Mapping
Each enriched requirement's abstract candidate_subjects are resolved to concrete AWS CloudFormation resource types via the evidence taxonomy. The strategy router (map_record) dispatches records into passthrough, review queue, or full mapping. The scoring engine (_match_subject) resolves ambiguous subjects using keyword overlap + domain hints + detection-class hints.
Vanta Coverage Comparison
The Layer 2 AWS resource map is compared against live Vanta platform exports across four dimensions — component availability, test existence, inventory presence, and test precision. This produces a coverage_assessment state for each requirement.
Gap Confirmation + Test Design
Each coverage gap is confirmed and classified into a specific gap type. Automation feasibility is assessed, priority signals are assigned, and records are tagged with a layer5_action that determines whether a custom test should be generated.
Backlog Generation + Prioritization
Confirmed gap candidates are shaped into prioritized backlog items with implementation complexity tiers, readiness assessments, priority scores, and prerequisite linkage. This is the actionable output an engineering team uses.
Orchestration + Manifest
The top-level orchestrator sequences Layers 1-5, manages checkpoint state for resumability, and writes the authoritative run manifest. This is the control plane, not a data transformation.
Impact Analysis
Time savings grounded in actual pipeline data — 65 FedRAMP 20x KSIs processed through all 5 layers with real output counts.
Pipeline Scale
| Metric | Count | Source |
|---|---|---|
| FedRAMP 20x KSIs in scope | 65 | Layer 1 audit |
| Control-part evidence mappings generated | 148 | PROMPT_1 (116) + PROMPT_2 (32) |
| Directly covered by Vanta (no gap) | 36 | Layer 3 assessment |
| Confirmed gaps requiring remediation | 29 | Layer 4 classification |
| Backlog candidates auto-prioritized | 20 | Layer 5 output |
| Individual evidence extraction prompts | 116 | Custom test prompts |
Manual Baseline Per Control
For a FedRAMP Moderate SSP, a compliance analyst doing a single control manually performs roughly this work. Midpoint: ~6 hours per control.
| Manual Task | Estimated |
|---|---|
| Read requirement, parse Rev5 crosswalk, understand scope | 0.5-1 hr |
| Map requirement to actual infrastructure (which AWS services?) | 1-2 hrs |
| Determine what evidence is needed and from which tools | 1-2 hrs |
| Collect evidence (screenshots, CLI output, configs) | 1-3 hrs |
| Write the implementation narrative for the SSP | 1-2 hrs |
| Cross-reference with related controls for consistency | 0.5-1 hr |
| Total per control | 4-10 hrs |
What the Pipeline Replaces
| Task | Manual | Pipeline | Time w/ Pipeline |
|---|---|---|---|
| Requirement analysis & classification | 0.5-1 hr | Layer 1 enrichment — seconds | ~0 min |
| Infrastructure mapping | 1-2 hrs | Layer 2 AWS component mapping — seconds | ~0 min |
| Evidence determination | 1-2 hrs | Layer 3 coverage + Layer 4 gap analysis — seconds | ~0 min |
| Evidence collection guidance | 1-3 hrs | 116 pre-built evidence prompts with exact artifacts listed | ~15-30 min (human still collects) |
| Narrative writing | 1-2 hrs | Structured output with validation_intent field | ~15-30 min (human reviews) |
| Cross-referencing | 0.5-1 hr | Cross-layer traceability checks (PP.3) | ~5 min (automated) |
| Total per control | ~6 hrs | ~35-65 min |
Time Savings
Your Scope (65 KSIs)
Full Moderate Baseline (325 Controls)
Savings by Control Type
The 86% figure varies by control type — here are the honest numbers.
| Control Type | Coverage | Savings |
|---|---|---|
| Directly covered (36 KSIs) | Full pipeline: classify, map, confirm coverage — done | ~90% — human just reviews |
| Confirmed gaps (29 KSIs) | Pipeline identifies gap + generates prioritized backlog with complexity tier | ~75% — human still builds the fix |
| Process attestation (escape hatch) | Pipeline routes to do_not_component_map, skips technical mapping entirely | ~80% — saved from wasting time on false mapping |
| Human review queue (4 KSIs) | Pipeline flags uncertainty, blocks propagation | ~50% — analyst still does the hard thinking |
Secondary Wins
Eliminated Hallucination in Evidence Claims
Manual SSPs are full of vague narratives like 'we use AWS services to ensure compliance.' The pipeline enforces abstract resource-class language, mathematical constraints (technical_split + process_split = 1.0), mandatory ambiguity disclosure, and a 10-sample gold validation set. An auditor can trace every claim to a specific validation gate.
Reduced Human Error Through Deterministic Routing
The escape hatch mechanism prevents analysts from spending hours mapping process-only controls like KSI-RPL-01 (Recovery Objectives) to AWS infrastructure. The pipeline catches this in milliseconds: automation_status: No + validation_method: Manual = candidate_subjects: [], layer2_action: do_not_component_map.
Signal Conflict Detection
KSI-PIY-01's title says 'Automated Inventory' but its metadata says automation_status: No. A human skimming the SSP template would likely classify this as technical. The pipeline's constraint — 'source metadata must win over prose' — catches the contradiction. This class of error scales linearly with analyst fatigue and nonlinearly with control count.
Faster Audit Cycles Through Pre-Built Traceability
Cross-layer consistency checks verify that every requirement_id appears exactly once across all layers. When the 3PAO asks 'show me how you determined this control is covered,' the answer is a traceable chain: Layer 1 classification, Layer 2 AWS mapping, Layer 3 Vanta comparison, Layer 4 gap confirmation. Industry consensus: audit cycle time drops 40-60% with pre-structured evidence.
Consistent Prioritization
Layer 5 priority scoring (layer4_signal + technical_weight + gap_type + coverage_depth + reuse - complexity - confidence_penalty) means remediation order isn't driven by whoever shouted loudest. POA&M auto-generation with severity-to-priority mapping (P0: 30 days, P1: 90 days, P2: 180 days, P3: 365 days) gives auditors exactly the timeline structure they expect.
Executive Summary
For 65 FedRAMP 20x KSIs: reduced from ~390 analyst-hours to ~54, an 86% reduction in classification, mapping, and evidence determination effort. At scale (325 Moderate baseline controls), this saves ~1,680 hours per SSP cycle. Secondary benefits include zero-hallucination evidence claims enforced by programmatic validation gates, deterministic routing that eliminates common analyst misclassification errors, and pre-built audit traceability that reduces 3PAO cycle time by an estimated 40-60%.