DR Posture Summary

STATUS Public — share freely VERSION 3.0 DATE May 2026 AUDIENCE Engineering leads pre-scoping
Download as PDF (8 pages, 290 KB) →

Or just read it inline below — same content, no gate, no email required.

This page is the actual artifact — not a teaser for a PDF you can’t download. It describes the disaster recovery standard and process applied to every CPLT engagement, and demonstrates the documentation quality, recovery posture, and operational rigor we bring to client deployments.

Every engagement is custom-scoped. The components, container counts, and specific tools vary. What doesn’t vary: the documentation standard, the recovery methodology, and the handover process. The full 54-page playbook contains detailed recovery procedures, verification commands, and operational runbooks from a reference implementation. It is shared under NDA during scoping conversations.

Engagement model: custom-scoped, standard-documented

CPLT deployments are not off-the-shelf products. Each engagement is custom-scoped to client requirements (document types, volume, integrations, compliance needs), custom-built with client-specific components (MCP servers, databases, inference models, application layer), and documented to the same standard regardless of component selection.

What varies by engagement

  • Container count and composition
  • Specific MCP servers or tool integrations
  • Database tier (which databases, sizing, schema)
  • Inference layer (which models, local vs. provider mix)
  • Application layer (LibreChat, custom UI, internal tools, etc.)

What doesn’t vary

  • Documentation standard (operator-grade runbooks, decision records, verification tooling)
  • Recovery methodology (decision trees, not checklists; verification at every stage)
  • Backup architecture (layered strategy with independent failure domains)
  • Handover process (full source code, IaC, runbooks; zero retention after engagement)

Reference implementation

The full playbook documents a reference deployment that demonstrates the CPLT standard:

  • Orchestration: Docker Compose (not Kubernetes — right scale for on-prem)
  • Inference layer: LiteLLM routing 6+ providers; local model serving via llama.cpp
  • Tool integration: 13 MCP servers with host-network isolation
  • Application: LibreChat (self-hosted) with custom patches
  • Database tier: MongoDB, PostgreSQL + pgvector, Redis, Meilisearch

Every container, service, and configuration is documented with: purpose and responsibility, dependencies (what it needs, what needs it), verification method (how to confirm it’s working), and recovery procedure (how to fix it when it breaks). The specific container count, tool selections, and configurations vary by engagement. The documentation standard does not.

Recovery time objectives

RTOs are measured from detection to verified recovery. Verification is mandatory — recovery is not complete until automated checks pass.

ScenarioTarget RTOTested
Single container failure< 5 minutes
Multi-container service degradation< 15 minutes
Full application stack restart< 30 minutes
Data drive failure (restore from backup)< 2 hours
Complete bare-metal rebuild< 2 hours

Restart (30 min) assumes a healthy system being restarted in order, with manual verification at each stage. Rebuild (2 hours) assumes starting from nothing with automated scripts and verified backups — faster than it sounds because it’s scripted end-to-end.

Backup architecture

Three-layer strategy, each with independent cadence and retention:

Layer 1 — Local (high-frequency)

  • All stateful services (databases, application state, configuration)
  • Retention: 7—30 days (engagement-specific)
  • Storage: local volume, separate from application data

Layer 2 — Offsite (daily)

  • All Layer 1 artifacts, encrypted at rest
  • Retention: 90 days
  • Storage: S3-compatible object storage, client-specified region

Layer 3 — Version control (continuous)

  • All application configuration, infrastructure definitions, custom-built components, sanitized templates
  • Retention: permanent (git history)
  • Storage: version-controlled repository

Recovery decision tree

Recovery follows a diagnostic decision tree, not a linear checklist:

Something is wrong
│
├─ Is the application UI reachable?
│  ├─ Yes → Specific feature broken?
│  │  ├─ Yes → Isolate: which container/service?
│  │  │  └─ Check logs, restart, verify
│  │  └─ No → Performance degradation?
│  │     └─ Check inference units, resource limits
│  │
│  └─ No → Is any container running?
│     ├─ Yes → Multi-container failure?
│     │  └─ Check dependencies, restart stack
│     └─ No → Host unresponsive?
│        ├─ Yes → Boot drive failure?
│        │  ├─ Yes → Full bare-metal rebuild
│        │  └─ No → Data drive failure?
│        │     ├─ Yes → Restore from backup
│        │     └─ No → Configuration drift?
│        │        └─ Re-apply from version control
│        └─ No → Secrets compromise?
│           └─ Rotate all credentials

Each branch terminates in a specific recovery procedure with explicit verification steps (not just “check if it works”), pre-flight conditions, and post-recovery validation. The decision tree structure is standard. The specific containers, services, and verification commands are engagement-specific.

Verification and stale-check

Pre-recovery checks

  • Container count matches expected inventory
  • Configuration files match documented state
  • Backup freshness within engagement-specific threshold
  • Offsite backup reachability
  • Secrets bundle integrity

Post-recovery checks

  • All services reachable and responding
  • Database connections established
  • Inference units accepting requests
  • Tool integrations functional
  • Performance within expected bounds

Drift detection

  • Stale-check script runs before any recovery procedure
  • Compares current state against documented state
  • Fails hard if drift exceeds tolerance — forces playbook update before execution
  • Warns on soft drift (e.g., retired containers, updated versions)

The tooling is generated from the engagement’s source of truth. When components change, the verification suite is regenerated — not manually updated.

Recovery scenarios covered

The full playbook covers seven recovery scenarios in detail. Each includes a diagnostic decision tree, explicit pre-flight checks, step-by-step recovery with verification at each stage, automated post-recovery validation, and common failure modes with workarounds.

  • Single container failure — isolate, restart, verify
  • Multi-container service degradation — dependency analysis, ordered restart
  • Stack-wide recovery — full shutdown/startup sequence
  • Data drive failure — restore from local backup
  • Full bare-metal rebuild — OS reinstall through application restore
  • Secrets compromise — credential rotation, access revocation
  • Data loss event — restore from offsite backup, verify completeness

How to access the full playbook

The full 54-page DR Playbook is available during scoping conversations:

  • Submit a scope request — cplt.online/contact
  • Initial conversation about your requirements and stack
  • Standard mutual NDA execution
  • Full document shared under NDA

The playbook is part of every CPLT engagement deliverable, not a standalone product.

What you also get in an engagement

Every CPLT engagement ships the documentation set below alongside the production stack. None of these are standalone products — they’re part of the deliverable package, shared under NDA during scoping conversations.

NDA — ENGAGEMENT

Full DR Playbook

54 pages — reference implementation

Complete container BOM and dependency graph, systemd unit inventory with restart procedures, SSOT compiler source structure, all 7 DR scenarios with step-by-step commands, bare-metal rebuild procedure (tested < 2h), and stale-check tooling with current verification results.

NDA — ENGAGEMENT

Architecture Decision Records

Per-engagement decision log

The technical decisions that shape your stack — why Docker Compose over Kubernetes, why LiteLLM over direct provider APIs, why SSOT compilation over manual config. Each ADR documents the decision, alternatives considered, and the tradeoff accepted.

NDA — ENGAGEMENT

Operator Runbook

Tailored to your deployment

Day-to-day operations: container restart procedures, log inspection workflows, backup verification, alert triage, performance investigation, and routine maintenance. Written so your team can run the stack independently after Stage 3 handover.

NDA — ENGAGEMENT

OCR Engine Comparison

When OCR is in scope

Side-by-side comparison of the four OCR engines in the CPLT stack — Vision LLM, Surya, Tesseract, gemma4-ocr. Accuracy benchmarks per document class, speed comparison, routing decision matrix, GPU vs CPU performance notes, and the engagement-specific routing logic for your document mix.

The stack behind the documentation

Every CPLT engagement ships documentation at this standard — plus the production stack it documents. Scoped, built, and handed over with zero retention.

Scope your deployment → Download the DR summary →