DOCUMENTATION
Engineering documentation from a production stack
Operator-grade documentation — the same standard applied to every CPLT engagement. The full DR posture summary is on this page; the 54-page playbook is shared under NDA during scoping.
DR Posture Summary
Or just read it inline below — same content, no gate, no email required.
This page is the actual artifact — not a teaser for a PDF you can’t download. It describes the disaster recovery standard and process applied to every CPLT engagement, and demonstrates the documentation quality, recovery posture, and operational rigor we bring to client deployments.
Every engagement is custom-scoped. The components, container counts, and specific tools vary. What doesn’t vary: the documentation standard, the recovery methodology, and the handover process. The full 54-page playbook contains detailed recovery procedures, verification commands, and operational runbooks from a reference implementation. It is shared under NDA during scoping conversations.
Engagement model: custom-scoped, standard-documented
CPLT deployments are not off-the-shelf products. Each engagement is custom-scoped to client requirements (document types, volume, integrations, compliance needs), custom-built with client-specific components (MCP servers, databases, inference models, application layer), and documented to the same standard regardless of component selection.
What varies by engagement
- Container count and composition
- Specific MCP servers or tool integrations
- Database tier (which databases, sizing, schema)
- Inference layer (which models, local vs. provider mix)
- Application layer (LibreChat, custom UI, internal tools, etc.)
What doesn’t vary
- Documentation standard (operator-grade runbooks, decision records, verification tooling)
- Recovery methodology (decision trees, not checklists; verification at every stage)
- Backup architecture (layered strategy with independent failure domains)
- Handover process (full source code, IaC, runbooks; zero retention after engagement)
Reference implementation
The full playbook documents a reference deployment that demonstrates the CPLT standard:
- Orchestration: Docker Compose (not Kubernetes — right scale for on-prem)
- Inference layer: LiteLLM routing 6+ providers; local model serving via llama.cpp
- Tool integration: 13 MCP servers with host-network isolation
- Application: LibreChat (self-hosted) with custom patches
- Database tier: MongoDB, PostgreSQL + pgvector, Redis, Meilisearch
Every container, service, and configuration is documented with: purpose and responsibility, dependencies (what it needs, what needs it), verification method (how to confirm it’s working), and recovery procedure (how to fix it when it breaks). The specific container count, tool selections, and configurations vary by engagement. The documentation standard does not.
Recovery time objectives
RTOs are measured from detection to verified recovery. Verification is mandatory — recovery is not complete until automated checks pass.
| Scenario | Target RTO | Tested |
|---|---|---|
| Single container failure | < 5 minutes | ✓ |
| Multi-container service degradation | < 15 minutes | ✓ |
| Full application stack restart | < 30 minutes | ✓ |
| Data drive failure (restore from backup) | < 2 hours | ✓ |
| Complete bare-metal rebuild | < 2 hours | ✓ |
Restart (30 min) assumes a healthy system being restarted in order, with manual verification at each stage. Rebuild (2 hours) assumes starting from nothing with automated scripts and verified backups — faster than it sounds because it’s scripted end-to-end.
Backup architecture
Three-layer strategy, each with independent cadence and retention:
Layer 1 — Local (high-frequency)
- All stateful services (databases, application state, configuration)
- Retention: 7—30 days (engagement-specific)
- Storage: local volume, separate from application data
Layer 2 — Offsite (daily)
- All Layer 1 artifacts, encrypted at rest
- Retention: 90 days
- Storage: S3-compatible object storage, client-specified region
Layer 3 — Version control (continuous)
- All application configuration, infrastructure definitions, custom-built components, sanitized templates
- Retention: permanent (git history)
- Storage: version-controlled repository
Recovery decision tree
Recovery follows a diagnostic decision tree, not a linear checklist:
Something is wrong │ ├─ Is the application UI reachable? │ ├─ Yes → Specific feature broken? │ │ ├─ Yes → Isolate: which container/service? │ │ │ └─ Check logs, restart, verify │ │ └─ No → Performance degradation? │ │ └─ Check inference units, resource limits │ │ │ └─ No → Is any container running? │ ├─ Yes → Multi-container failure? │ │ └─ Check dependencies, restart stack │ └─ No → Host unresponsive? │ ├─ Yes → Boot drive failure? │ │ ├─ Yes → Full bare-metal rebuild │ │ └─ No → Data drive failure? │ │ ├─ Yes → Restore from backup │ │ └─ No → Configuration drift? │ │ └─ Re-apply from version control │ └─ No → Secrets compromise? │ └─ Rotate all credentials
Each branch terminates in a specific recovery procedure with explicit verification steps (not just “check if it works”), pre-flight conditions, and post-recovery validation. The decision tree structure is standard. The specific containers, services, and verification commands are engagement-specific.
Verification and stale-check
Pre-recovery checks
- Container count matches expected inventory
- Configuration files match documented state
- Backup freshness within engagement-specific threshold
- Offsite backup reachability
- Secrets bundle integrity
Post-recovery checks
- All services reachable and responding
- Database connections established
- Inference units accepting requests
- Tool integrations functional
- Performance within expected bounds
Drift detection
- Stale-check script runs before any recovery procedure
- Compares current state against documented state
- Fails hard if drift exceeds tolerance — forces playbook update before execution
- Warns on soft drift (e.g., retired containers, updated versions)
The tooling is generated from the engagement’s source of truth. When components change, the verification suite is regenerated — not manually updated.
Recovery scenarios covered
The full playbook covers seven recovery scenarios in detail. Each includes a diagnostic decision tree, explicit pre-flight checks, step-by-step recovery with verification at each stage, automated post-recovery validation, and common failure modes with workarounds.
- Single container failure — isolate, restart, verify
- Multi-container service degradation — dependency analysis, ordered restart
- Stack-wide recovery — full shutdown/startup sequence
- Data drive failure — restore from local backup
- Full bare-metal rebuild — OS reinstall through application restore
- Secrets compromise — credential rotation, access revocation
- Data loss event — restore from offsite backup, verify completeness
How to access the full playbook
The full 54-page DR Playbook is available during scoping conversations:
- Submit a scope request — cplt.online/contact
- Initial conversation about your requirements and stack
- Standard mutual NDA execution
- Full document shared under NDA
The playbook is part of every CPLT engagement deliverable, not a standalone product.
What you also get in an engagement
Every CPLT engagement ships the documentation set below alongside the production stack. None of these are standalone products — they’re part of the deliverable package, shared under NDA during scoping conversations.
Full DR Playbook
Complete container BOM and dependency graph, systemd unit inventory with restart procedures, SSOT compiler source structure, all 7 DR scenarios with step-by-step commands, bare-metal rebuild procedure (tested < 2h), and stale-check tooling with current verification results.
Architecture Decision Records
The technical decisions that shape your stack — why Docker Compose over Kubernetes, why LiteLLM over direct provider APIs, why SSOT compilation over manual config. Each ADR documents the decision, alternatives considered, and the tradeoff accepted.
Operator Runbook
Day-to-day operations: container restart procedures, log inspection workflows, backup verification, alert triage, performance investigation, and routine maintenance. Written so your team can run the stack independently after Stage 3 handover.
OCR Engine Comparison
Side-by-side comparison of the four OCR engines in the CPLT stack — Vision LLM, Surya, Tesseract, gemma4-ocr. Accuracy benchmarks per document class, speed comparison, routing decision matrix, GPU vs CPU performance notes, and the engagement-specific routing logic for your document mix.
The stack behind the documentation
Every CPLT engagement ships documentation at this standard — plus the production stack it documents. Scoped, built, and handed over with zero retention.