Resources — Free DR Summary & Engineering Documentation

DR Posture Summary

STATUS Public — share freely VERSION 3.0 DATE May 2026 AUDIENCE Engineering leads pre-scoping

Or just read it inline below — same content, no gate, no email required.

This page is the actual artifact — not a teaser for a PDF you can’t download. It describes the disaster recovery standard and process applied to every CPLT engagement, and demonstrates the documentation quality, recovery posture, and operational rigor we bring to client deployments.

Every engagement is custom-scoped. The components, container counts, and specific tools vary. What doesn’t vary: the documentation standard, the recovery methodology, and the handover process. The full 54-page playbook contains detailed recovery procedures, verification commands, and operational runbooks from a reference implementation. It is shared under NDA during scoping conversations.

Engagement model: custom-scoped, standard-documented

CPLT deployments are not off-the-shelf products. Each engagement is custom-scoped to client requirements (document types, volume, integrations, compliance needs), custom-built with client-specific components (MCP servers, databases, inference models, application layer), and documented to the same standard regardless of component selection.

What varies by engagement

Container count and composition
Specific MCP servers or tool integrations
Database tier (which databases, sizing, schema)
Inference layer (which models, local vs. provider mix)
Application layer (LibreChat, custom UI, internal tools, etc.)

What doesn’t vary

Documentation standard (operator-grade runbooks, decision records, verification tooling)
Recovery methodology (decision trees, not checklists; verification at every stage)
Backup architecture (layered strategy with independent failure domains)
Handover process (full source code, IaC, runbooks; zero retention after engagement)

Reference implementation

The full playbook documents a reference deployment that demonstrates the CPLT standard:

Orchestration: Docker Compose (not Kubernetes — right scale for on-prem)
Inference layer: LiteLLM routing 6+ providers; local model serving via llama.cpp
Tool integration: 13 MCP servers with host-network isolation
Application: LibreChat (self-hosted) with custom patches
Database tier: MongoDB, PostgreSQL + pgvector, Redis, Meilisearch

Every container, service, and configuration is documented with: purpose and responsibility, dependencies (what it needs, what needs it), verification method (how to confirm it’s working), and recovery procedure (how to fix it when it breaks). The specific container count, tool selections, and configurations vary by engagement. The documentation standard does not.

Recovery time objectives

RTOs are measured from detection to verified recovery. Verification is mandatory — recovery is not complete until automated checks pass.

Scenario	Target RTO	Tested
Single container failure	< 5 minutes	✓
Multi-container service degradation	< 15 minutes	✓
Full application stack restart	< 30 minutes	✓
Data drive failure (restore from backup)	< 2 hours	✓
Complete bare-metal rebuild	< 2 hours	✓

Restart (30 min) assumes a healthy system being restarted in order, with manual verification at each stage. Rebuild (2 hours) assumes starting from nothing with automated scripts and verified backups — faster than it sounds because it’s scripted end-to-end.

Backup architecture

Three-layer strategy, each with independent cadence and retention:

Layer 1 — Local (high-frequency)

All stateful services (databases, application state, configuration)
Retention: 7—30 days (engagement-specific)
Storage: local volume, separate from application data

Layer 2 — Offsite (daily)

All Layer 1 artifacts, encrypted at rest
Retention: 90 days
Storage: S3-compatible object storage, client-specified region

Layer 3 — Version control (continuous)

All application configuration, infrastructure definitions, custom-built components, sanitized templates
Retention: permanent (git history)
Storage: version-controlled repository

Recovery decision tree

Recovery follows a diagnostic decision tree, not a linear checklist:

Something is wrong
│
├─ Is the application UI reachable?
│  ├─ Yes → Specific feature broken?
│  │  ├─ Yes → Isolate: which container/service?
│  │  │  └─ Check logs, restart, verify
│  │  └─ No → Performance degradation?
│  │     └─ Check inference units, resource limits
│  │
│  └─ No → Is any container running?
│     ├─ Yes → Multi-container failure?
│     │  └─ Check dependencies, restart stack
│     └─ No → Host unresponsive?
│        ├─ Yes → Boot drive failure?
│        │  ├─ Yes → Full bare-metal rebuild
│        │  └─ No → Data drive failure?
│        │     ├─ Yes → Restore from backup
│        │     └─ No → Configuration drift?
│        │        └─ Re-apply from version control
│        └─ No → Secrets compromise?
│           └─ Rotate all credentials

Each branch terminates in a specific recovery procedure with explicit verification steps (not just “check if it works”), pre-flight conditions, and post-recovery validation. The decision tree structure is standard. The specific containers, services, and verification commands are engagement-specific.

Verification and stale-check

Pre-recovery checks

Container count matches expected inventory
Configuration files match documented state
Backup freshness within engagement-specific threshold
Offsite backup reachability
Secrets bundle integrity

Post-recovery checks

All services reachable and responding
Database connections established
Inference units accepting requests
Tool integrations functional
Performance within expected bounds

Drift detection

Stale-check script runs before any recovery procedure
Compares current state against documented state
Fails hard if drift exceeds tolerance — forces playbook update before execution
Warns on soft drift (e.g., retired containers, updated versions)

The tooling is generated from the engagement’s source of truth. When components change, the verification suite is regenerated — not manually updated.

Recovery scenarios covered

The full playbook covers seven recovery scenarios in detail. Each includes a diagnostic decision tree, explicit pre-flight checks, step-by-step recovery with verification at each stage, automated post-recovery validation, and common failure modes with workarounds.

Single container failure — isolate, restart, verify
Multi-container service degradation — dependency analysis, ordered restart
Stack-wide recovery — full shutdown/startup sequence
Data drive failure — restore from local backup
Full bare-metal rebuild — OS reinstall through application restore
Secrets compromise — credential rotation, access revocation
Data loss event — restore from offsite backup, verify completeness

How to access the full playbook

The full 54-page DR Playbook is available during scoping conversations:

Submit a scope request — cplt.online/contact
Initial conversation about your requirements and stack
Standard mutual NDA execution
Full document shared under NDA

The playbook is part of every CPLT engagement deliverable, not a standalone product.

This document is public. Share freely. The full playbook is not.

What you also get in an engagement

Every CPLT engagement ships the documentation set below alongside the production stack. None of these are standalone products — they’re part of the deliverable package, shared under NDA during scoping conversations.

NDA — ENGAGEMENT

Full DR Playbook

54 pages — reference implementation

Complete container BOM and dependency graph, systemd unit inventory with restart procedures, SSOT compiler source structure, all 7 DR scenarios with step-by-step commands, bare-metal rebuild procedure (tested < 2h), and stale-check tooling with current verification results.

NDA — ENGAGEMENT

Architecture Decision Records

Per-engagement decision log

The technical decisions that shape your stack — why Docker Compose over Kubernetes, why LiteLLM over direct provider APIs, why SSOT compilation over manual config. Each ADR documents the decision, alternatives considered, and the tradeoff accepted.

NDA — ENGAGEMENT

Operator Runbook

Tailored to your deployment

Day-to-day operations: container restart procedures, log inspection workflows, backup verification, alert triage, performance investigation, and routine maintenance. Written so your team can run the stack independently after Stage 3 handover.

NDA — ENGAGEMENT

OCR Engine Comparison

When OCR is in scope

Side-by-side comparison of the four OCR engines in the CPLT stack — Vision LLM, Surya, Tesseract, gemma4-ocr. Accuracy benchmarks per document class, speed comparison, routing decision matrix, GPU vs CPU performance notes, and the engagement-specific routing logic for your document mix.

The stack behind the documentation

Every CPLT engagement ships documentation at this standard — plus the production stack it documents. Scoped, built, and handed over with zero retention.

Scope your deployment → Download the DR summary →

Engineering documentation from a production stack

DR Posture Summary

Engagement model: custom-scoped, standard-documented

What varies by engagement

What doesn’t vary

Reference implementation

Recovery time objectives

Backup architecture

Layer 1 — Local (high-frequency)

Layer 2 — Offsite (daily)

Layer 3 — Version control (continuous)

Recovery decision tree

Verification and stale-check

Pre-recovery checks

Post-recovery checks

Drift detection

Recovery scenarios covered

How to access the full playbook

What you also get in an engagement

Full DR Playbook

Architecture Decision Records

Operator Runbook

OCR Engine Comparison

The stack behind the documentation