AI / Compliance Automation / GovTechJan 2025

Multi Pipeline LLM Automation Platform

An AI platform that handles the grinding parts of FedRAMP documentation and evidence work. Control mapping, evidence narratives, AWS CLI evidence commands, appendix planning, inventory reports. The stuff that used to take weeks.

Problem

FedRAMP authorization needs hundreds of control mappings, evidence narratives, inventory artifacts, and audit ready outputs. Doing that by hand is slow, inconsistent, and expensive.

Solution

Built a platform that runs multiple pipelines over LLMs, retrieval, prompt libraries, and structured validation. It connects source documents, vector search, control metadata, and reusable workflows into one repeatable system, and it generates grounded compliance output at real volume.

Impact

→Cut manual compliance authoring and evidence prep from days or weeks down to batch jobs you rerun on demand
→Generates output for hundreds of NIST 800-53 control parts automatically
→Produces reusable outputs that feed other compliance docs and audit workflows downstream

Architecture

01Google Drive documents feed into vector and file-search workflows
02The component mapping pipeline identifies which services implement each control
03Downstream generators produce evidence narratives and AWS CLI command sets
04Inventory and appendix workflows produce artifacts ready for submission
05Shared prompt libraries, app bootstrap, and RAG services sit underneath the whole platform

Capabilities

·Control to service component mapping
·Evidence narrative generation
·Read only AWS CLI evidence command generation
·Appendix planning and document support
·Checkpointing and resumable batch execution
·Prompt profile management
·Validation and fallback logic

Stack

PythonOpenAI GPT-4.x / Assistants APIOpenSearchAmazon BedrockpandasAWSGoogle Drive APIJSON prompt libraries

Technical Deep Dive

Architecture internals and annotated code from the production system.

Architecture Overview

The Control to Service Mapper is a layered pipeline with three resolution stages for the many to many NIST to AWS relationship. Instead of hiding that complexity, I broke it down into a chain of one to few lookups, each with its own confidence score. Ambiguity gets tracked, not buried.

NIST SP 800-53 Control

→[Crosswalk Index] KSI and OSCAL references, confidence weighted per domain

→[Layer 1] candidate_subjects and test_modes capture the abstract intent

→[Layer 2] taxonomy class maps to AWS CloudFormation resource types

→[Output] candidate_aws_resource_classes, relationships, and observable_properties

Key Architectural Decisions

Taxonomy Registry

A static lookup dict of roughly 15 abstract evidence classes. Things like "network-boundary-control" and "identity-principal." Each one comes pre-wired with its aws_resource_classes, supporting_resource_classes, observable_properties, and relationships. That's the indirection layer. NIST intent on one side, AWS specifics on the other, and nothing in between depends on both.

Confidence Gated Routing

A Strategy pattern dispatcher sorts records into three paths before mapping logic runs at all. Passthrough. Review queue. Full mapping. That routing decision is cheap and deterministic, and it keeps the expensive logic off records that don't need it.

Domain and Detection Hints

Two tie-breaking registries, DOMAIN_HINTS and DETECTION_CLASS_HINTS, bias ambiguous subject resolution toward the right taxonomy class based on the control family. Without them, the scorer guesses on close calls. With them, the control family tells the scorer which class usually wins.

Crosswalk Index Precomputation

The many to many KSI to domain relationship gets materialized once at startup into a fast lookup index. Refined entries win over legacy entries by construction. The lookup is O(1), and no code path later needs to rebuild that relationship on the fly.

Code Showcase 1

Strategy Pattern Entry Point

map_record() is the router. 30 lines that cleanly split routing from mapping. Each layer2_action dispatches to its own builder. Unknown actions and Layer 1 inconsistencies go to the review queue. Nothing gets silently dropped.

python

def map_record(record: dict[str, Any]) -> tuple[dict[str, Any] | None, dict[str, Any] | None]:
    """
    Map one Layer 1 record to Layer 2 output.

    Returns (mapped_record, review_record). Exactly one is non-None.
    """
    req_id = record.get("requirement_id", "<unknown>")
    layer2_action = record.get("layer2_action", "")

    if layer2_action == "do_not_component_map":
        log.debug("Passthrough (process_attestation): %s", req_id)
        return _build_passthrough(record), None

    if layer2_action == "human_review_required":
        log.debug("Routing to review queue (Layer 1 decision): %s", req_id)
        return None, _build_review_record(record, reason="human_review_required_by_layer1")

    if layer2_action == "map_components":
        # Safety: Layer 1 inconsistency
        if record.get("requires_human_review"):
            log.warning(
                "Layer 1 inconsistency for %s: layer2_action=map_components "
                "but requires_human_review=true — routing to review queue", req_id
            )
            return None, _build_review_record(record, reason="layer1_inconsistency")
        log.debug("Mapping components: %s", req_id)
        return _build_mapping(record), None

    log.warning("Unknown layer2_action %r for %s — routing to review queue", layer2_action, req_id)
    return None, _build_review_record(record, reason=f"unknown_layer2_action_{layer2_action!r}")

Property	Detail
Pattern	Strategy. Each layer2_action dispatches to a dedicated builder (_build_passthrough, _build_review_record, _build_mapping).
Tuple Discriminator	Returns (mapped, None) or (None, review). Exactly one is non-None, enforced by contract.
Defensive Routing	Unknown actions and Layer 1 inconsistencies both route to the review queue. No record gets dropped silently.
Separation of Concerns	Zero mapping logic in the router. It decides which strategy runs, not how.
Count Invariant	The runner verifies mapped + review == total. The tuple contract guarantees no records go missing.

Code Showcase 2

Many to Many Taxonomy Expansion

_match_subject() resolves ambiguous abstract subjects to concrete taxonomy classes. Jaccard-like keyword overlap, plus domain hints, plus detection-class hints, plus a monitoring demotion. "Encryption key configurations" could plausibly map to cryptographic-key-store, configuration-enforcement-control, or access-policy. This scoring stack picks one deterministically. No LLM at this layer.

python

for class_name, entry in TAXONOMY.items():
    score = 0.0
    for kw in entry["subjects"]:
        kw_norm = _normalize(kw)
        kw_words = set(kw_norm.split())
        subj_words = set(norm.split())
        overlap = kw_words & subj_words
        if not overlap:
            continue
        union = kw_words | subj_words
        score = max(score, len(overlap) / len(union) * 3.0)

    if score == 0:
        class_words = set(class_name.replace("-", " ").split())
        subj_words = set(norm.split())
        overlap = class_words & subj_words
        if overlap:
            score = len(overlap) / len(class_words | subj_words) * 1.5

    if score == 0:
        continue

    # Apply bonuses
    if class_name in dc_preferred:
        score += 2.0
    if class_name in domain_preferred:
        score += 1.0

    if demote_monitoring and class_name == "monitoring-infrastructure":
        score = max(score - 1.5, 0.01)

    scores[class_name] = score

Property	Detail
Scoring Stack	Scoring stack. Keyword overlap (Jaccard x3.0), then class-name fallback (x1.5), then detection-class boost (+2.0), then domain boost (+1.0), then monitoring demotion (-1.5).
No AI at Layer 2	No AI at Layer 2. Disambiguation is fully deterministic, so the result is reproducible and auditable.
Tie-Breaking	Domain and detection-class hints from the control family break ties toward the right taxonomy class.

Data Lifecycle

End-to-end flow of a single compliance check through the pipeline. Every arrow is a single NDJSON file. Every stage enforces a schema gate and count invariant before writing its output.

L1Ingest + Normalize + Enrich

|fedramp_layer1_handoff.ndjson▼

L2AWS Component Mapping

|fedramp_layer2_mapped.ndjson▼

L3Vanta Coverage Comparison

|fedramp_layer3_gaps.ndjson▼

L4Gap Confirmation + Test Design

|fedramp_layer4_candidates.ndjson▼

L5Backlog Generation + Prioritization

|fedramp_layer5_backlog.ndjson▼

L6Orchestration + Manifest

|run_manifest.json▼

Stage 1

Ingest + Normalize + Enrich

Four FedRAMP 20x source files get loaded and merged into one canonical requirement record. An escape hatch classifier catches obvious process attestation records without an LLM call. Everything else goes to Claude (via Bedrock or direct API) for semantic enrichment. The LLM returns requirement_type, validation_intent, candidate_subjects, candidate_test_modes, and enrichment_confidence. That's the full set.

Inputfedramp_20x_ksi.json, fedramp_20x_oscal_catalog.json, fedramp_20x_moderate_baseline_profile.json, fedramp_20x_rev5_crosswalk.json

Processingingestion.py loads + merges → normalizer.py builds canonical record → _is_clear_process_attestation() escape hatch (no LLM) → enricher.py + backend.py calls Claude → normalizer.apply_post_enrichment_fields() derives layer2_action

Outputfedramp_layer1_handoff.ndjson. Each record carries: requirement_id, requirement_type, validation_intent, candidate_subjects[], candidate_test_modes[], technical_split/process_split, enrichment_confidence, layer2_action

Quality GateSchema validation >= 95%, count_delta == 0, gold set escape-hatch + AWS-name-check blocking gates

Key Filelayer1/runner.py:63-404. S1.0 through S1.10

Stage 2

AWS Component Mapping

Each enriched requirement's abstract candidate_subjects resolve to concrete AWS CloudFormation resource types via the evidence taxonomy. The strategy router routes records into passthrough, review queue, or full mapping. The scoring engine does the disambiguation: keyword overlap, domain hints, detection class hints.

Inputfedramp_layer1_handoff.ndjson

Processingmapper.map_record() routes by layer2_action → _build_mapping() iterates candidate_subjects → _match_subject() scores against TAXONOMY → expands to aws_resource_classes, relationships, observable_properties

Outputfedramp_layer2_mapped.ndjson. Each record gains: direct_evidence_subjects[], supporting_evidence_subjects[], candidate_aws_resource_classes[], candidate_relationships[], candidate_observable_properties[], component_mapping_confidence, layer3_action

Quality GateSchema validation >= 95%, count invariant (mapped + review == L1 total)

Key Filelayer2/mapper.py:34-63 (strategy router), layer2/taxonomy.py:24-569 (registry)

Stage 3

Vanta Coverage Comparison

The Layer 2 AWS resource map gets compared against live Vanta exports across four dimensions. Component availability, test existence, inventory presence, and test precision. Out the other end, every requirement has a coverage_assessment.

Inputfedramp_layer2_mapped.ndjson + Vanta exports (vanta_components.json, vanta_tests.json, vanta_inventory.json, vanta_controls.json, vanta_control_information.json)

Processingcatalog.build_catalogs() normalizes Vanta data → comparator.compare_record() runs 4-dimension comparison → derives coverage_assessment (directly_covered / indirectly_covered / partially_covered / not_covered / unclear)

Outputfedramp_layer3_gaps.ndjson. Each record gains: coverage_assessment, coverage_dimensions{component, test, inventory, precision}, coverage_gaps[], precision_gaps[], relationship_gaps[], layer4_action

Quality GateSchema validation >= 95%, count invariant (assessed + review == L2 mapped count)

Key Filelayer3/comparator.py (4-dimension engine), layer3/catalog.py (Vanta normalization)

Stage 4

Gap Confirmation + Test Design

Every coverage gap gets confirmed and classified into a specific gap type. Automation feasibility, priority signal, and a layer5_action that decides whether to generate a custom test. Controls that don't need a custom test never make it to the test generation stage.

Inputfedramp_layer3_gaps.ndjson

Processingconfirmer.confirm_record() applies decision rules RS1-RS9 → classifies confirmed_gap_type (non_a_gap / structural / missing_test / precision / relationship / indirect / advisory / requires_human_review) → assesses automation_feasibility (feasible / uncertain / blocked) → assigns priority_signal (critical / high / medium / low)

Outputfedramp_layer4_candidates.ndjson. Each record gains: confirmed_gap_type, automation_feasibility, priority_signal, custom_test_candidate (bool), candidate_custom_test_scope, candidate_custom_test_intent, candidate_evidence_source, layer5_action

Quality GateSchema validation >= 95%, count invariant (candidates == L3 assessed count)

Key Filelayer4/confirmer.py (gap decision rules)

Stage 5

Backlog Generation + Prioritization

Confirmed gap candidates turn into prioritized backlog items. Complexity tier, readiness, priority score, prerequisite links. This is the output an engineering team actually works from.

Inputfedramp_layer4_candidates.ndjson

Processingbuilder.process_records() routes by layer5_action → assigns backlog_bucket, implementation_complexity (T1-T6), implementation_readiness (ready / needs_review / blocked / deferred) → calculates priority_score (0-100) → links prerequisites[] across candidates

Outputfedramp_layer5_backlog.ndjson. Each record gains: backlog_bucket, implementation_complexity, implementation_readiness, priority_score, prerequisites[], custom_resource_required, custom_resource_definition_hint

Quality GateSchema validation >= 95%, count invariant (backlog + review + skipped == L4 candidates)

Key Filelayer5/builder.py (backlog shaping + priority scoring)

Stage 6

Orchestration + Manifest

The orchestrator sequences Layers 1 through 5, manages checkpoint state so a failed run can resume instead of restart, and writes the authoritative run manifest. This is the control plane. It doesn't transform data. It runs the things that do.

Inputpipeline_config.json + all source files

Processingpipeline/runner.py:start() generates run_id → checkpoint.RunState loads/saves.run_state.json → sequences run_layer1() through run_layer5() → catches LayerFatalError to halt on hard gate failure → manifest.build_and_write() produces final record

Outputrun_manifest.json (run_id, started_at, completed_at, per-layer results, validation_summary, releasable flag, release_blocking_reasons), pipeline.log, per-layer *_summary.md + *_validation_report.json

Quality Gatereleasable = all_hard_gates_passed AND all schema rates >= 95%

Key Filepipeline/runner.py:36-190

Impact Analysis

Time savings backed by real pipeline runs. 65 FedRAMP 20x KSIs, all 5 layers, actual output counts. No hand-waving.

Pipeline Scale

Metric	Count	Source
FedRAMP 20x KSIs in scope	65	Layer 1 audit
Control-part evidence mappings generated	148	PROMPT_1 (116) + PROMPT_2 (32)
Directly covered by Vanta (no gap)	36	Layer 3 assessment
Confirmed gaps requiring remediation	29	Layer 4 classification
Backlog candidates auto-prioritized	20	Layer 5 output
Individual evidence extraction prompts	116	Custom test prompts

Manual Baseline Per Control

For a FedRAMP Moderate SSP, a compliance analyst doing a single control manually performs roughly this work. Midpoint: ~6 hours per control.

Manual Task	Estimated
Read requirement, parse Rev5 crosswalk, understand scope	0.5-1 hr
Map requirement to actual infrastructure (which AWS services?)	1-2 hrs
Determine what evidence is needed and from which tools	1-2 hrs
Collect evidence (screenshots, CLI output, configs)	1-3 hrs
Write the implementation narrative for the SSP	1-2 hrs
Cross-reference with related controls for consistency	0.5-1 hr
Total per control	4-10 hrs

What the Pipeline Replaces

Task	Manual	Pipeline	Time w/ Pipeline
Requirement analysis & classification	0.5-1 hr	Layer 1 enrichment. Seconds	~0 min
Infrastructure mapping	1-2 hrs	Layer 2 AWS component mapping. Seconds	~0 min
Evidence determination	1-2 hrs	Layer 3 coverage + Layer 4 gap analysis. Seconds	~0 min
Evidence collection guidance	1-3 hrs	116 pre-built evidence prompts with exact artifacts listed	~15-30 min (human still collects)
Narrative writing	1-2 hrs	Structured output with validation_intent field	~15-30 min (human reviews)
Cross-referencing	0.5-1 hr	Cross-layer traceability checks (PP.3)	~5 min (automated)
Total per control	~6 hrs		~35-65 min

Time Savings

Your Scope (65 KSIs)

Manual effort390 analyst-hours (~9.75 analyst-weeks)

With pipeline~54 analyst-hours (~1.35 analyst-weeks)

Saved~86%

Full Moderate Baseline (325 Controls)

Manual effort1,950 hrs (~49 analyst-weeks)

With pipeline~270 hrs (~6.75 analyst-weeks)

Saved~1,680 hrs saved (~86%)

Savings by Control Type

The 86% figure moves around by control type. Here are the honest numbers.

Control Type	Coverage	Savings
Directly covered (36 KSIs)	Full pipeline: classify, map, confirm coverage. Done	~90%. Human just reviews
Confirmed gaps (29 KSIs)	Pipeline identifies gap + generates prioritized backlog with complexity tier	~75%. Human still builds the fix
Process attestation (escape hatch)	Pipeline routes to do_not_component_map, skips technical mapping entirely	~80%. Saved from wasting time on false mapping
Human review queue (4 KSIs)	Pipeline flags uncertainty, blocks propagation	~50%. Analyst still does the hard thinking

Secondary Wins

Hallucination Out of Evidence Claims

Manual SSPs are full of vague filler like "we use AWS services to ensure compliance." The pipeline forces abstract resource-class language, enforces technical_split + process_split = 1.0, requires ambiguity notes whenever confidence dips, and validates everything against a 10 sample gold set. Every claim an auditor reads traces back to a specific validation gate.

Deterministic Routing Kills Analyst Errors

Escape hatches stop analysts from burning hours mapping process-only controls like KSI-RPL-01 (Recovery Objectives) to AWS infrastructure. The pipeline catches those in milliseconds. automation_status: No plus validation_method: Manual resolves to candidate_subjects: [] and layer2_action: do_not_component_map. Nobody wastes a morning on it.

Signal Conflict Detection

KSI-PIY-01's title says "Automated Inventory." Its metadata says automation_status: No. A human skimming the SSP template classifies it as technical nine times out of ten. The pipeline's rule, source metadata wins over prose, catches the contradiction every time. This kind of error scales linearly with analyst fatigue and faster than linearly with control count.

Faster Audit Cycles from Pre-Built Traceability

Cross-layer consistency checks verify every requirement_id appears exactly once across all layers. When the 3PAO asks "show me how you determined this control is covered," the answer is a traceable chain. Layer 1 classification, Layer 2 AWS mapping, Layer 3 Vanta comparison, Layer 4 gap confirmation. Industry consensus says audit cycle time drops 40 to 60% with pre-structured evidence like this.

Consistent Prioritization

Layer 5 priority scoring is a formula, not a vote. layer4_signal + technical_weight + gap_type + coverage_depth + reuse, minus complexity and confidence_penalty. That means remediation order isn't driven by whoever shouted loudest. POA&M auto-generation maps severity to priority (P0: 30 days, P1: 90 days, P2: 180 days, P3: 365 days), which gives auditors the exact timeline structure they already expect to see.

Executive Summary

For 65 FedRAMP 20x KSIs, the pipeline cuts classification, mapping, and evidence determination from about 390 analyst-hours down to 54. That's an 86% reduction. Scale it to the full 325 control Moderate baseline and it saves roughly 1,680 hours per SSP cycle. Secondary benefits: evidence claims are enforced by programmatic validation gates instead of trust. Deterministic routing kills the misclassification errors analysts burn out making. And pre-built audit traceability knocks an estimated 40 to 60% off 3PAO cycle time.