← Back to projects
AI / Compliance Automation / GovTechJan 2025

Multi-Pipeline LLM Automation Platform

An AI-powered platform that automates major parts of the FedRAMP documentation and evidence workflow, including control-to-service mapping, evidence narrative generation, AWS CLI evidence commands, appendix planning, and inventory reporting.

Problem

FedRAMP authorization requires hundreds of control mappings, evidence narratives, inventory artifacts, and audit-ready outputs. Producing these manually is slow, inconsistent, and expensive.

Solution

Built a multi-pipeline platform that uses LLMs, retrieval, prompt libraries, and structured validation to generate grounded compliance outputs at scale. The platform connects source documents, vector search, control metadata, and reusable workflows into a repeatable system.

Impact

  • Reduced manual compliance authoring and evidence-prep effort from days or weeks to repeatable batch workflows
  • Automated generation for hundreds of NIST 800-53 control parts
  • Created reusable outputs that feed multiple downstream compliance documents and audit workflows

Architecture

  1. 01Google Drive documents are ingested into vector/file-search workflows
  2. 02Component mapping pipeline identifies which services implement each control
  3. 03Downstream generators produce evidence narratives and AWS CLI command sets
  4. 04Supporting inventory and appendix workflows produce submission-ready artifacts
  5. 05Shared prompt libraries, app bootstrap, and RAG services support the full platform

Capabilities

  • ·Control-to-service component mapping
  • ·Evidence narrative generation
  • ·Read-only AWS CLI evidence command generation
  • ·Appendix planning and document support
  • ·Progress checkpointing and resumable batch execution
  • ·Prompt profile management
  • ·Validation and fallback logic

Stack

PythonOpenAI GPT-4.x / Assistants APIOpenSearchAmazon BedrockpandasAWSGoogle Drive APIJSON prompt libraries

Technical Deep Dive

Architecture internals and annotated code from the production system.

Architecture Overview

The Control-to-Service Mapper follows a layered pipeline architecture with three distinct resolution stages for the many-to-many NIST-to-AWS relationship. The many-to-many complexity is managed by decomposing it into a chain of one-to-few lookups, each with its own confidence score, so ambiguity is tracked rather than hidden.

 NIST SP 800-53 Control
[Crosswalk Index] KSI / OSCAL ref — confidence-weighted, per-domain
[Layer 1] candidate_subjects + test_modes — abstract intent
[Layer 2] taxonomy class → AWS CloudFormation resource types
[Output] candidate_aws_resource_classes + relationships + observable_properties

Key Architectural Decisions

01

Taxonomy Registry

A static lookup dict of ~15 abstract evidence classes (e.g., "network-boundary-control", "identity-principal"), each pre-wired with aws_resource_classes, supporting_resource_classes, observable_properties, and relationships. This is the indirection layer that decouples NIST intent from AWS specifics.

02

Confidence-Gated Routing

A Strategy-pattern dispatcher that gates records into three paths before any mapping logic runs — passthrough, review queue, or full mapping.

03

Domain / Detection Hints

Tie-breaking registries (DOMAIN_HINTS, DETECTION_CLASS_HINTS) that bias ambiguous subject resolution toward the correct taxonomy class based on the control family.

04

Crosswalk Index Pre-Computation

The many-to-many KSI-to-domain relationship is materialized once at startup into a fast lookup index with refined vs. legacy discrimination.

Code Showcase 1

Strategy-Pattern Entry Point

The map_record() function is a Strategy Pattern router — 30 lines that cleanly separate routing logic from mapping logic. Each layer2_action dispatches to a dedicated builder, with unknown actions and Layer 1 inconsistencies funneled to the review queue rather than silently dropping records.

python
def map_record(record: dict[str, Any]) -> tuple[dict[str, Any] | None, dict[str, Any] | None]:
    """
    Map one Layer 1 record to Layer 2 output.

    Returns (mapped_record, review_record). Exactly one is non-None.
    """
    req_id = record.get("requirement_id", "<unknown>")
    layer2_action = record.get("layer2_action", "")

    if layer2_action == "do_not_component_map":
        log.debug("Passthrough (process_attestation): %s", req_id)
        return _build_passthrough(record), None

    if layer2_action == "human_review_required":
        log.debug("Routing to review queue (Layer 1 decision): %s", req_id)
        return None, _build_review_record(record, reason="human_review_required_by_layer1")

    if layer2_action == "map_components":
        # Safety: Layer 1 inconsistency
        if record.get("requires_human_review"):
            log.warning(
                "Layer 1 inconsistency for %s: layer2_action=map_components "
                "but requires_human_review=true — routing to review queue", req_id
            )
            return None, _build_review_record(record, reason="layer1_inconsistency")
        log.debug("Mapping components: %s", req_id)
        return _build_mapping(record), None

    log.warning("Unknown layer2_action %r for %s — routing to review queue", layer2_action, req_id)
    return None, _build_review_record(record, reason=f"unknown_layer2_action_{layer2_action!r}")
PropertyDetail
PatternStrategy — each layer2_action dispatches to a dedicated builder (_build_passthrough, _build_review_record, _build_mapping)
Tuple DiscriminatorReturns (mapped, None) or (None, review) — exactly one is non-None, enforced by contract
Defensive RoutingUnknown actions and Layer 1 inconsistencies both funnel to the review queue rather than silently dropping records
Separation of ConcernsZero mapping logic here — this function only decides which strategy runs, not how it runs
Count InvariantThe runner verifies mapped + review == total, so the tuple contract guarantees no records are lost

Code Showcase 2

Many-to-Many Taxonomy Expansion

The _match_subject() scoring loop resolves ambiguous abstract subjects to concrete taxonomy classes using Jaccard-like keyword overlap + domain hints + detection-class hints + monitoring demotion. A single abstract subject like "encryption key configurations" could map to cryptographic-key-store, configuration-enforcement-control, or access-policy — this scoring stack resolves it deterministically without AI inference at Layer 2.

python
for class_name, entry in TAXONOMY.items():
    score = 0.0
    for kw in entry["subjects"]:
        kw_norm = _normalize(kw)
        kw_words = set(kw_norm.split())
        subj_words = set(norm.split())
        overlap = kw_words & subj_words
        if not overlap:
            continue
        union = kw_words | subj_words
        score = max(score, len(overlap) / len(union) * 3.0)

    if score == 0:
        class_words = set(class_name.replace("-", " ").split())
        subj_words = set(norm.split())
        overlap = class_words & subj_words
        if overlap:
            score = len(overlap) / len(class_words | subj_words) * 1.5

    if score == 0:
        continue

    # Apply bonuses
    if class_name in dc_preferred:
        score += 2.0
    if class_name in domain_preferred:
        score += 1.0

    if demote_monitoring and class_name == "monitoring-infrastructure":
        score = max(score - 1.5, 0.01)

    scores[class_name] = score
PropertyDetail
Scoring StackKeyword overlap (Jaccard ×3.0) → class-name fallback (×1.5) → detection-class boost (+2.0) → domain boost (+1.0) → monitoring demotion (−1.5)
No AI at Layer 2Disambiguation is fully deterministic — no LLM calls, ensuring reproducibility and auditability
Tie-BreakingDomain and detection-class hints from the control family bias ambiguous matches toward the correct taxonomy class

Data Lifecycle

End-to-end flow of a single compliance check through the pipeline. Every arrow is a single NDJSON file. Every stage enforces a schema gate and count invariant before writing its output.

L1Ingest + Normalize + Enrich
|fedramp_layer1_handoff.ndjson
L2AWS Component Mapping
|fedramp_layer2_mapped.ndjson
L3Vanta Coverage Comparison
|fedramp_layer3_gaps.ndjson
L4Gap Confirmation + Test Design
|fedramp_layer4_candidates.ndjson
L5Backlog Generation + Prioritization
|fedramp_layer5_backlog.ndjson
L6Orchestration + Manifest
|run_manifest.json
Stage 1

Ingest + Normalize + Enrich

Four FedRAMP 20x source files are loaded and merged into a canonical requirement record. An escape-hatch classifier catches obvious process-attestation records without calling the LLM. Everything else gets sent to Claude (via Bedrock or direct API) for semantic enrichment — the LLM determines requirement_type, validation_intent, candidate_subjects, candidate_test_modes, and enrichment_confidence.

Inputfedramp_20x_ksi.json, fedramp_20x_oscal_catalog.json, fedramp_20x_moderate_baseline_profile.json, fedramp_20x_rev5_crosswalk.json
Processingingestion.py loads + merges → normalizer.py builds canonical record → _is_clear_process_attestation() escape hatch (no LLM) → enricher.py + backend.py calls Claude → normalizer.apply_post_enrichment_fields() derives layer2_action
Outputfedramp_layer1_handoff.ndjson — each record carries: requirement_id, requirement_type, validation_intent, candidate_subjects[], candidate_test_modes[], technical_split/process_split, enrichment_confidence, layer2_action
Quality GateSchema validation >= 95%, count_delta == 0, gold set escape-hatch + AWS-name-check blocking gates
Key Filelayer1/runner.py:63-404 — S1.0 through S1.10
Stage 2

AWS Component Mapping

Each enriched requirement's abstract candidate_subjects are resolved to concrete AWS CloudFormation resource types via the evidence taxonomy. The strategy router (map_record) dispatches records into passthrough, review queue, or full mapping. The scoring engine (_match_subject) resolves ambiguous subjects using keyword overlap + domain hints + detection-class hints.

Inputfedramp_layer1_handoff.ndjson
Processingmapper.map_record() routes by layer2_action → _build_mapping() iterates candidate_subjects → _match_subject() scores against TAXONOMY → expands to aws_resource_classes, relationships, observable_properties
Outputfedramp_layer2_mapped.ndjson — each record gains: direct_evidence_subjects[], supporting_evidence_subjects[], candidate_aws_resource_classes[], candidate_relationships[], candidate_observable_properties[], component_mapping_confidence, layer3_action
Quality GateSchema validation >= 95%, count invariant (mapped + review == L1 total)
Key Filelayer2/mapper.py:34-63 (strategy router), layer2/taxonomy.py:24-569 (registry)
Stage 3

Vanta Coverage Comparison

The Layer 2 AWS resource map is compared against live Vanta platform exports across four dimensions — component availability, test existence, inventory presence, and test precision. This produces a coverage_assessment state for each requirement.

Inputfedramp_layer2_mapped.ndjson + Vanta exports (vanta_components.json, vanta_tests.json, vanta_inventory.json, vanta_controls.json, vanta_control_information.json)
Processingcatalog.build_catalogs() normalizes Vanta data → comparator.compare_record() runs 4-dimension comparison → derives coverage_assessment (directly_covered / indirectly_covered / partially_covered / not_covered / unclear)
Outputfedramp_layer3_gaps.ndjson — each record gains: coverage_assessment, coverage_dimensions{component, test, inventory, precision}, coverage_gaps[], precision_gaps[], relationship_gaps[], layer4_action
Quality GateSchema validation >= 95%, count invariant (assessed + review == L2 mapped count)
Key Filelayer3/comparator.py (4-dimension engine), layer3/catalog.py (Vanta normalization)
Stage 4

Gap Confirmation + Test Design

Each coverage gap is confirmed and classified into a specific gap type. Automation feasibility is assessed, priority signals are assigned, and records are tagged with a layer5_action that determines whether a custom test should be generated.

Inputfedramp_layer3_gaps.ndjson
Processingconfirmer.confirm_record() applies decision rules RS1-RS9 → classifies confirmed_gap_type (non_a_gap / structural / missing_test / precision / relationship / indirect / advisory / requires_human_review) → assesses automation_feasibility (feasible / uncertain / blocked) → assigns priority_signal (critical / high / medium / low)
Outputfedramp_layer4_candidates.ndjson — each record gains: confirmed_gap_type, automation_feasibility, priority_signal, custom_test_candidate (bool), candidate_custom_test_scope, candidate_custom_test_intent, candidate_evidence_source, layer5_action
Quality GateSchema validation >= 95%, count invariant (candidates == L3 assessed count)
Key Filelayer4/confirmer.py (gap decision rules)
Stage 5

Backlog Generation + Prioritization

Confirmed gap candidates are shaped into prioritized backlog items with implementation complexity tiers, readiness assessments, priority scores, and prerequisite linkage. This is the actionable output an engineering team uses.

Inputfedramp_layer4_candidates.ndjson
Processingbuilder.process_records() routes by layer5_action → assigns backlog_bucket, implementation_complexity (T1-T6), implementation_readiness (ready / needs_review / blocked / deferred) → calculates priority_score (0-100) → links prerequisites[] across candidates
Outputfedramp_layer5_backlog.ndjson — each record gains: backlog_bucket, implementation_complexity, implementation_readiness, priority_score, prerequisites[], custom_resource_required, custom_resource_definition_hint
Quality GateSchema validation >= 95%, count invariant (backlog + review + skipped == L4 candidates)
Key Filelayer5/builder.py (backlog shaping + priority scoring)
Stage 6

Orchestration + Manifest

The top-level orchestrator sequences Layers 1-5, manages checkpoint state for resumability, and writes the authoritative run manifest. This is the control plane, not a data transformation.

Inputpipeline_config.json + all source files
Processingpipeline/runner.py:start() generates run_id → checkpoint.RunState loads/saves .run_state.json → sequences run_layer1() through run_layer5() → catches LayerFatalError to halt on hard gate failure → manifest.build_and_write() produces final record
Outputrun_manifest.json (run_id, started_at, completed_at, per-layer results, validation_summary, releasable flag, release_blocking_reasons), pipeline.log, per-layer *_summary.md + *_validation_report.json
Quality Gatereleasable = all_hard_gates_passed AND all schema rates >= 95%
Key Filepipeline/runner.py:36-190

Impact Analysis

Time savings grounded in actual pipeline data — 65 FedRAMP 20x KSIs processed through all 5 layers with real output counts.

Pipeline Scale

MetricCountSource
FedRAMP 20x KSIs in scope65Layer 1 audit
Control-part evidence mappings generated148PROMPT_1 (116) + PROMPT_2 (32)
Directly covered by Vanta (no gap)36Layer 3 assessment
Confirmed gaps requiring remediation29Layer 4 classification
Backlog candidates auto-prioritized20Layer 5 output
Individual evidence extraction prompts116Custom test prompts

Manual Baseline Per Control

For a FedRAMP Moderate SSP, a compliance analyst doing a single control manually performs roughly this work. Midpoint: ~6 hours per control.

Manual TaskEstimated
Read requirement, parse Rev5 crosswalk, understand scope0.5-1 hr
Map requirement to actual infrastructure (which AWS services?)1-2 hrs
Determine what evidence is needed and from which tools1-2 hrs
Collect evidence (screenshots, CLI output, configs)1-3 hrs
Write the implementation narrative for the SSP1-2 hrs
Cross-reference with related controls for consistency0.5-1 hr
Total per control4-10 hrs

What the Pipeline Replaces

TaskManualPipelineTime w/ Pipeline
Requirement analysis & classification0.5-1 hrLayer 1 enrichment — seconds~0 min
Infrastructure mapping1-2 hrsLayer 2 AWS component mapping — seconds~0 min
Evidence determination1-2 hrsLayer 3 coverage + Layer 4 gap analysis — seconds~0 min
Evidence collection guidance1-3 hrs116 pre-built evidence prompts with exact artifacts listed~15-30 min (human still collects)
Narrative writing1-2 hrsStructured output with validation_intent field~15-30 min (human reviews)
Cross-referencing0.5-1 hrCross-layer traceability checks (PP.3)~5 min (automated)
Total per control~6 hrs~35-65 min

Time Savings

Your Scope (65 KSIs)

Manual effort390 analyst-hours (~9.75 analyst-weeks)
With pipeline~54 analyst-hours (~1.35 analyst-weeks)
Saved~86%

Full Moderate Baseline (325 Controls)

Manual effort1,950 hrs (~49 analyst-weeks)
With pipeline~270 hrs (~6.75 analyst-weeks)
Saved~1,680 hrs saved (~86%)

Savings by Control Type

The 86% figure varies by control type — here are the honest numbers.

Control TypeCoverageSavings
Directly covered (36 KSIs)Full pipeline: classify, map, confirm coverage — done~90% — human just reviews
Confirmed gaps (29 KSIs)Pipeline identifies gap + generates prioritized backlog with complexity tier~75% — human still builds the fix
Process attestation (escape hatch)Pipeline routes to do_not_component_map, skips technical mapping entirely~80% — saved from wasting time on false mapping
Human review queue (4 KSIs)Pipeline flags uncertainty, blocks propagation~50% — analyst still does the hard thinking

Secondary Wins

01

Eliminated Hallucination in Evidence Claims

Manual SSPs are full of vague narratives like 'we use AWS services to ensure compliance.' The pipeline enforces abstract resource-class language, mathematical constraints (technical_split + process_split = 1.0), mandatory ambiguity disclosure, and a 10-sample gold validation set. An auditor can trace every claim to a specific validation gate.

02

Reduced Human Error Through Deterministic Routing

The escape hatch mechanism prevents analysts from spending hours mapping process-only controls like KSI-RPL-01 (Recovery Objectives) to AWS infrastructure. The pipeline catches this in milliseconds: automation_status: No + validation_method: Manual = candidate_subjects: [], layer2_action: do_not_component_map.

03

Signal Conflict Detection

KSI-PIY-01's title says 'Automated Inventory' but its metadata says automation_status: No. A human skimming the SSP template would likely classify this as technical. The pipeline's constraint — 'source metadata must win over prose' — catches the contradiction. This class of error scales linearly with analyst fatigue and nonlinearly with control count.

04

Faster Audit Cycles Through Pre-Built Traceability

Cross-layer consistency checks verify that every requirement_id appears exactly once across all layers. When the 3PAO asks 'show me how you determined this control is covered,' the answer is a traceable chain: Layer 1 classification, Layer 2 AWS mapping, Layer 3 Vanta comparison, Layer 4 gap confirmation. Industry consensus: audit cycle time drops 40-60% with pre-structured evidence.

05

Consistent Prioritization

Layer 5 priority scoring (layer4_signal + technical_weight + gap_type + coverage_depth + reuse - complexity - confidence_penalty) means remediation order isn't driven by whoever shouted loudest. POA&M auto-generation with severity-to-priority mapping (P0: 30 days, P1: 90 days, P2: 180 days, P3: 365 days) gives auditors exactly the timeline structure they expect.

Executive Summary

For 65 FedRAMP 20x KSIs: reduced from ~390 analyst-hours to ~54, an 86% reduction in classification, mapping, and evidence determination effort. At scale (325 Moderate baseline controls), this saves ~1,680 hours per SSP cycle. Secondary benefits include zero-hallucination evidence claims enforced by programmatic validation gates, deterministic routing that eliminates common analyst misclassification errors, and pre-built audit traceability that reduces 3PAO cycle time by an estimated 40-60%.