Spinning up a local Large Language Model is trivial today. A Docker Compose file, a consumer GPU, and you have a private AI assistant. The self-hosted AI dream is real — until you invite your team to use it.
Past roughly 25 registered users — or 5 to 8 concurrent requests — the physics of bare-metal infrastructure asserts itself.
Most tutorials and "AI experts" will tell you the answer is more VRAM. That's a fundamental misunderstanding of hardware architecture. You aren't running out of compute. Your infrastructure is collapsing under mismanaged state, I/O bottlenecks, and scheduler contention — problems that no amount of GPU shopping will fix.
If you're building production-grade, privacy-first AI infrastructure, you need an architectural perspective. Here are the three failure modes that will crash your self-hosted LLM at scale, the real numbers behind them, and why the standard cloud-native playbook won't save you on bare metal.
1. KV Cache Fragmentation — The Memory Bleed
The myth: You need 80 GB of VRAM because the model weights are huge.
The reality: Your model weights are static. Your KV cache is dynamic, and it's fragmenting VRAM into unusable Swiss cheese.
Every time a user generates a token, the inference engine stores attention tensors — the Key-Value cache — in VRAM so it doesn't recompute the entire prompt on the next token. The numbers are not small.
On a 20B parameter Llama-3-style model (8 KV heads via Grouped-Query Attention) with 32K context and f16 KV cache, a single user slot consumes approximately 5.2 GB of VRAM in attention state alone — on top of the ~15 GB of model weights at Q5 quantization. Run two concurrent slots and you need 25+ GB just for one inference instance. Older architectures without GQA (40+ KV heads) push this past 25 GB per slot. Now imagine 8 concurrent requests with wildly different context lengths: a 200-token chat query next to a 28,000-token document summarization.
Standard inference servers allocate VRAM in contiguous blocks. When User A finishes a short prompt and releases 800 MB, and User B requests a 4 GB context window, that freed 800 MB is useless — it's the wrong shape. You might have 20 GB of technically free VRAM, but because it's heavily fragmented, the engine throws an Out-Of-Memory error and kills the container.
At 2 concurrent requests, this is invisible. At 8+ concurrent requests with unpredictable conversational lengths, it's the single most common cause of "mysterious" OOM crashes on self-hosted stacks.
The fix: PagedAttention and dynamic memory pooling
Move away from naive inference servers. Deploy engines built for high-concurrency memory management — engines like vLLM that implement PagedAttention, which treats VRAM exactly like an operating system handles virtual memory pages: dynamic, non-contiguous, tightly garbage-collected. The original Kwon et al. paper documents 2–4× throughput improvement versus naive allocation on the same hardware — that's the headroom you reclaim before buying another GPU. (How CPLT integrates vLLM into the broader stack →)
A minimal production-ready vLLM serve command for a 24 GB GPU:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--max-num-seqs 16 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--max-model-len 8192
For smaller stacks running llama.cpp, the server's cache_reuse configuration parameter enables KV prefix sharing across requests, reducing redundant computation on shared prompt prefixes. The tradeoff is that llama.cpp's KV management is simpler than vLLM's — it works well for moderate concurrency (3–5 concurrent slots) but wasn't designed for the regime where PagedAttention's block-level allocation pays off.
The architectural principle is the same: treat VRAM like virtual memory, not like a fixed-size array.
2. The Context Window I/O Trap
The myth: LLM performance is entirely bottlenecked by GPU TFLOPS.
The reality: RAG at scale stresses your storage I/O and PCIe bus long before your GPU breaks a sweat.
When you connect your LLM to a local filesystem — using a Model Context Protocol (MCP) server to ingest enterprise documents, OCR pipelines to extract text from scanned PDFs, or embedding models for vector search — you're moving massive amounts of text into the context window. Every token that enters the prompt had to be read from somewhere.
Here's what 25 concurrent RAG queries actually looks like on the I/O path:
- Each user triggers a vector search across a document corpus — say, 200 PDFs totalling 1.4 GB of extracted text
- The retrieval layer reads, chunks, and embeds document fragments — that's thousands of small random reads hitting your NVMe simultaneously
- The top-k results (typically 5–10 chunks per query) get assembled into a prompt — 25 users × 10 chunks × ~2,000 tokens each = 500,000 tokens queued for inference
- The GPU sits idle, waiting for the CPU to finish assembling prompts
Consumer NVMe SSDs peak around 1M IOPS at QD32 on synthetic benchmarks, but a real RAG workload — small random reads, mixed with LUKS encryption overhead (common in EU-regulated environments for data-at-rest compliance), competing with logging, monitoring, and document ingestion writes — typically saturates at 500K–800K sustained IOPS with materially worse tail latency. If your storage layer uses default ext4 mount options without noatime, every read also triggers a metadata write, compounding the contention.
The GPU isn't the bottleneck. The storage subsystem and its tail latency are. And if your inference engine's request queue fills up while waiting for I/O, your token-per-second rate collapses even though the GPU has capacity to spare.
The fix: Async, multi-tiered ingestion pipelines
Your storage layer and PCIe lane topology must be mapped before you design the ingestion path:
- Heavy lifting offloaded to dedicated workers. OCR, document parsing, and embedding should run in isolated containers or processes — not in the same event loop that serves inference requests.
- Pre-cached markdown over raw PDF. Store extracted, chunked text in a fast retrieval layer (PostgreSQL with pgvector, or a dedicated vector DB) so the inference path never touches the filesystem directly. (CPLT's OCR + ingestion architecture →)
- NVMe mount tuning.
noatime,discard=async, and proper I/O scheduler selection (nonefor NVMe, notmq-deadline) eliminate unnecessary metadata writes and reduce tail latency. - PCIe lane awareness. On multi-GPU setups, lane sharing between the NVMe controller and the GPU can silently halve your bandwidth. Check your motherboard's PCIe topology before assuming your hardware can sustain concurrent I/O and inference.
The architectural principle: the data path to the context window is as important as the model itself.
3. Kernel Scheduler Thrashing
Fairness is the enemy of throughput in asymmetric workloads.
The myth: "It's a Docker container issue. Give it more CPU limits."
The reality: The Linux scheduler's fairness-first design is the opposite of what a high-concurrency LLM workload needs.
Self-hosted LLMs are not traditional web servers. They are heavily asymmetric workloads: a GPU doing heavy matrix math, dependent on CPU threads to feed it instructions, surrounded by websockets, API gateways, embedding models, and orchestration containers — all competing for CPU time.
A Linux context switch takes 1–10 microseconds — not milliseconds, as some tutorials claim. But the damage isn't the switch itself. It's what happens after: CPU cache pollution. When the scheduler preempts your inference orchestrator to run a background logging task, the L1/L2 cache is flushed. When the orchestrator gets the core back, it has to reload its working set: L3 hits are ~30–50 nanoseconds, local DRAM ~100 nanoseconds, and cross-NUMA-socket reloads exceed 200 nanoseconds — and a hot inference loop touches thousands of cache lines per token.
Multiply this micro-stutter by every concurrent inflight request, and your effective throughput drops while your tail latency spikes. The GPU pipeline stalls waiting for the CPU to finish feeding it the next batch.
This gets worse on containerized stacks. Default Docker configurations put every container on the same CPU scheduler domain. A noisy neighbor — a Prometheus scrape, a healthcheck, a log rotation — can preempt your inference thread at the worst possible moment.
The fix: Core isolation, cgroup weights, and scheduler tuning
Deep Linux kernel tuning becomes mandatory at this scale. The single highest-leverage change — isolating CPU cores for inference at boot time — is one line in your bootloader config:
# /etc/default/grub — reserve cores 4-15 for the inference hot path
GRUB_CMDLINE_LINUX_DEFAULT="isolcpus=4-15 nohz_full=4-15 rcu_nocbs=4-15"
# Then: update-grub && reboot
This is documented in the Linux kernel admin guide. Background tasks land on cores 0–3; nothing the OS does can preempt cores 4–15 unless you explicitly pin work there. The remaining tuning layers build on this foundation:
cpusetcgroups to pin your inference orchestrator and critical I/O threads to the isolated cores. Background tasks (monitoring, logging, health checks) run on the remaining cores and are physically prevented from preempting the hot path.cpu.weightinstead of hard CPU limits. Hard limits cause throttling — exactly the wrong behavior when an inference burst arrives. Weight gives inference containers 4–8× the scheduling priority of auxiliary services without capping them.SCHED_FIFOfor the GPU-feeding threads. Surgical real-time scheduling — only the threads that feed the GPU, not the entire process. They're never preempted by normal-priority tasks.- NUMA-aware placement. On multi-socket systems, ensure inference processes run on the NUMA node closest to the GPU. Cross-socket memory access adds 100+ nanoseconds per access — invisible at low load, visible under concurrency.
The Cloud Contrast: When You Should NOT Self-Host
If you're running on AWS Bedrock, Azure AI, or Google Vertex, none of this touches you — because hyperscalers solve these problems with proprietary inference stacks tuned for their specific workloads, fronted by sales engineers, and over-provisioned to make the failure modes invisible. They have an Anyscale-documented advantage on bursty workloads: paying per-token means you don't pay during idle hours.
That's a real advantage, and self-hosting is the wrong call for some workloads. Cloud inference is the right answer when:
- Your usage is bursty and unpredictable. If you spike from 0 to 200 concurrent requests for an hour a day and idle the rest of the time, cloud's pay-per-token model beats bare-metal utilization economics.
- You have no compliance constraint. If your data doesn't trigger GDPR Article 28 concerns, sector-specific data residency rules, or contractual exfiltration prohibitions, the cloud's convenience may dominate.
- You don't have the in-house Linux engineering depth. The fixes above require kernel-level tuning. If you don't have that capability and can't bring in someone who does, the cloud is honest infrastructure.
Self-hosting wins when the opposite holds: sustained load, strict data-residency requirements, and either internal Linux expertise or a partner who brings it. At sustained load, bare metal pays back the kernel-tuning investment within months and produces lower tail latencies than any cloud inference API. The failure modes above are the cost of that trade. Fix them properly and your bare-metal stack will outperform the cloud on both cost and latency. Ignore them and you'll spend more time firefighting OOM crashes than you would have spent on the cloud bill.
When you've done it right
These three failure modes share a property: none of them show up in a single-user demo. Self-hosted AI looks "production-ready" at 5 concurrent requests and falls apart at 25. The only way to know whether your stack survives the 26th user is to build it correctly the first time — or to load-test it under realistic concurrency before launch.
That's the work we do at CPLT.
What we ship at CPLT:
| Layer | What's included | Engagement type |
|---|---|---|
| Inference | PagedAttention-capable engine, KV cache reuse, concurrent slot management sized to your user count | Foundation |
| Storage | NVMe-tuned I/O pipeline, async document ingestion, retrieval layer separated from the inference path | Foundation |
| Resilience | Sub-2h bare-metal recovery, git-versioned configuration, operator-grade DR runbooks | Foundation |
| Kernel tuning | Core isolation, cgroup weights, NUMA-aware placement — tuned to your hardware profile | Add-on |
| Compliance attestation | LUKS-encrypted, EU-hosted, GDPR Article 28 evidence pack, auditor-ready documentation | Add-on |
Foundation engagements start at €5K–€10K and cover inference, storage, and resilience — enough to survive the 26th user. Kernel-level tuning and compliance attestation are priced as add-ons once we've scoped your hardware and regulatory profile.
31 containers. 6+ LLM providers. 4 OCR engines. 13 tool integrations. All independently restartable, systemd-supervised where stateful, provider calls fronted by an OpenAI-compatible router with per-call cost tracking. No cloud dependency, no vendor lock-in, no mandatory retainer.
If your self-hosted AI stack hasn't been stress-tested past concurrent-slot exhaustion, it will fail in production. Get a written architectural review in 5 business days: scope a deployment. Or grab the DR summary PDF — the runbook we follow when something does fail.