Sovereign OCR: When Scanned PDFs Eat Your AI Pipeline (And Your Compliance Posture)

You bought a private LLM. You routed every chat through your own infrastructure. You wrote a DPA addendum your legal team actually signed. And then you piped your documents through a SaaS OCR API to extract the text — because nobody on your team wanted to debug Tesseract at 2 a.m.

Congratulations: the most sensitive part of your document workflow is now the leakiest part of your stack.

Every contract, every patient record, every internal memo passed through that OCR API is now in a third party's logs, retention store, and incident-response perimeter. Your sovereign LLM dream has a backdoor — and your auditors will eventually find it.

This post is about how to actually close that hole. Not in theory, with a "just self-host everything" hand-wave, but in practice: which open-weight components hold up under regulated workloads, where the in-memory contract has to be enforced at the container level (not the application level), and what an auditable purge looks like when "delete" has to mean provably gone.

Why most "self-hosted AI" stacks aren't

Walk into any RAG pipeline diagram from 2024–25 and you'll see the same shape: a vector store, an LLM, an embedding model — and a thin grey arrow labelled "ingestion" pointing into the side, with no detail.

That arrow is doing a lot of unspoken work. In the median deployment we audit, that ingestion path:

Uploads PDFs to AWS Textract, Google Document AI, or Azure Document Intelligence for OCR.
Receives the extracted text back as JSON.
Embeds the text using a third-party embedding API (often the same vendor's).
Persists the embeddings in a vector DB.

Steps 1 and 3 send the regulated payload to a vendor outside your perimeter. Step 4 is a permanent record of the embedding of that payload — which, for short documents, is partially recoverable. The "self-hosted" LLM at the end of the pipeline is irrelevant; the data has already left the building.

The Article 28 question your DPO asks — "name every processor that touches this data" — has a longer answer than you advertised.

The non-negotiable rule

A sovereign document AI pipeline has exactly one rule that everything else hangs off:

Raw document bytes never persist outside the active job's RAM.

Not "encrypted at rest." Not "deleted on completion." Not "stored in a region you control." Those are all weaker than what regulated document workflows actually need. The bytes go in, get processed, and the only thing that survives the job is the project-scoped retrieval index — the embeddings — written to disk under a lifecycle you control end-to-end.

Building to this rule constrains the architecture in three places, in this order: the OCR stage, the embedding stage, and the cleanup contract.

Stage 1: OCR — Surya for layout, MiniCPM-V for visual reasoning

The naive answer is "run Tesseract." Tesseract is fine for clean, English-language, single-column text. It is not fine for:

Multi-column legal pleadings
Scanned medical records with handwriting
Financial statements with embedded tables
Documents in non-Latin scripts
Anything with a stamp, a signature, or a scribbled annotation

This is where most on-prem OCR efforts fail and quietly route the "hard" documents back to a SaaS vendor — defeating the whole point. The fix is a two-engine pipeline.

Surya is the workhorse. It's a modern open-weight OCR stack that handles layout-aware extraction (column ordering, tables, reading order) on a single consumer GPU. It speaks ~90 languages, runs as a Python service you can dockerise, and produces structured output (text + bounding boxes + reading order) — not just a flat string.

MiniCPM-V (or a similarly-sized vision-language model) is the escalation path. When Surya returns low-confidence regions, hits handwriting, or encounters a form whose layout heuristics don't apply, the document — or just the relevant page region — is handed to an open-weight VLM running through vLLM for high-throughput inference. The VLM reads the page like a human does: "this is a signature block, this is a date stamp, this hand-written annotation says 'rejected'."

Both engines run as decoupled MCP tools behind your assistant, which means three things for operations:

You can scale OCR independently of the rest of the stack (throw GPUs at it for a quarterly review, tear them down after).
You can swap Surya for <better-engine-launching-next-quarter> without touching the LLM, the embedding model, or the vector store.
Each tool gets its own audit log — who triggered an OCR job, which document, when, what came out.

If your current OCR is a boto3.client('textract') call, none of those properties hold. You can't scale it independently, you can't swap it, and your audit log lives in someone else's S3 bucket.

Stage 2: Embedding — local model, project-scoped index

Once Surya + MiniCPM-V have produced clean text, the temptation is to call OpenAI's text-embedding-3-large because it's cheap and excellent. Don't. Embedding APIs see the full payload too. The vector you store is a byproduct of a request that revealed the document.

Run nomic-embed-text-v1.5 (or bge-large-en-v1.5, or your team's preferred open-weight model) on the same hardware as the OCR stack. It's an order of magnitude smaller than the LLM and runs comfortably on the same GPU during off-peak hours, or on a separate small GPU that does nothing else.

Two architectural notes that matter more than the choice of model:

Project-scoped isolation. A user reviewing Project A's documents must not be able to surface content from Project B — not because someone might leak it, but because regulated workflows often have legal walls between matters even within the same firm. Enforce this at the retrieval-query layer, not as a UI convention. Every vector carries a project tag; every query is filtered by the requesting user's authorised project list. If the filter logic lives in the application, an auditor will reasonably ask what happens when the application has a bug. If it lives in the storage layer (separate collections per project), the answer is "nothing," because the wrong vectors are not reachable.

The retrieval index is the only thing that survives. Surya's output? RAM. MiniCPM-V's intermediate tensors? RAM. The full extracted text used to compute embeddings? Held just long enough to chunk + embed, then dropped. Only the embeddings + their chunk metadata are written to disk. This is the line where "we processed it" stops and "we retain it" starts — and you should be able to point at the line in code.

Stage 3: The cleanup contract

This is the part most "sovereign AI" architectures get wrong, because it's the unsexy one.

When a project ends — a matter closes, a contract expires, a patient is discharged — the embeddings and every derived artifact must be provably deleted. Not "marked as deleted in a tombstone column." Not "expired by a TTL we hope ran." Provably gone, with a log entry and a checksum-of-nothing your auditor can verify.

Concretely:

The vector store collection for that project is dropped (not soft-deleted).
Any cached chunk metadata in Redis / SQLite is purged with explicit DELETE + verification read returning empty.
The OCR job logs are rotated — you keep the fact that processing happened (date, operator, document count) for your audit trail, but not the document content or the page text.
A signed log entry records the purge: "project X embeddings purged at T, by operator U, returning empty-set verification."
Backups are rolled forward past the deletion point on the next cycle (which means your backup retention has to be designed around this — see our DR posture summary).

The difference between "we deleted it" and "we can prove we deleted it" is the difference between a five-minute audit conversation and a six-month investigation. Most pipelines optimise for the wrong one.

What this buys you, in board-meeting language

If you've followed the architecture above, your DPO can say all of these things truthfully:

No third-party data processor for document content. The DPA list does not grow when this pipeline is adopted. Whatever processors you had before, you still have. No new ones.
Complete chain of custody. Ingestion, OCR, embedding, query, and purge are each logged with timestamps, project IDs, and the operator account that initiated them. The auditor doesn't need to scrape it together from five vendors' dashboards — it's one log stream.
Project-scoped erasure on demand. When a matter closes, the regulator asks for a deletion attestation, or a client exercises their GDPR Article 17 rights, you have a button — not a six-week project.
Independent scaling. A quarterly compliance review with a 10× document spike doesn't trigger a renegotiation of a SaaS contract. You add GPUs for two weeks and remove them.
No vendor kill-switch. Surya, MiniCPM-V, the embedding model, the vector store — every layer has at least two replaceable open-weight options. A vendor pulling a model from public availability does not break your pipeline.

The honest caveats

A few things this architecture does not do, and you should know about them up front:

It is not "set up in an afternoon." A production-grade ingestion pipeline with the cleanup contract above takes 2–4 weeks of engineering for a competent team that knows what they're building. If your team has never run a GPU workload, double that. This is the trade-off for sovereignty: you stop renting and you start operating.

It is not faster than Textract. Cloud OCR services are exceptional at cold-start latency because they have warm capacity sitting on tap. A self-hosted Surya + MiniCPM stack on shared GPUs has cold-start cost. For interactive single-document workflows, you're typically 1.5–3× slower per document. For batch ingestion of a corpus, the gap closes (or reverses, when you're not paying per-page).

It is not a frontier-class VLM. MiniCPM-V is excellent, but it's not GPT-4o-vision. For documents where you genuinely need state-of-the-art visual reasoning — say, pixel-level layout extraction from a hand-drawn diagram — you may need a hybrid posture where unclassified documents go to a frontier API and regulated ones stay in-house. That's a perfectly reasonable design; just be explicit about the data classification gate.

What we do here

Most of the engagements we run at CPLT for regulated mid-market clients include some version of this pipeline — because the moment a buyer says "we can't put our documents in OpenAI", the ingestion path is the first thing that fails the audit. The Sovereign OCR stack is one of the add-ons we scope per-engagement: Surya + MiniCPM-V + vLLM + nomic-embed + project-scoped retrieval, all packaged as MCP tools your assistant can call, with the cleanup contract tested and runbook'd as part of Stage 2.

If your existing pipeline still routes documents through a SaaS OCR API and you're not sure how to close that gap without a 6-month project: tell us what you're working with. We'll respond within 5 business days with a written scope — or an honest "no" if the shape doesn't fit.

Want the full Architecture Decision Matrix before you build? Our 8-page PDF covers build-vs-buy economics, hardware sizing tables, and a vendor-neutral RFP template you can use against any provider in the space (including us). Compare options →