If you run ollama run llama3.1 on your MacBook, see tokens flying across the screen at 80 TPS, and conclude you're ready to deploy an enterprise AI API — you're walking into a trap.

Single-user local inference is a solved problem. Production LLM inference at scale is an entirely different engineering domain. We watch startups and platform teams constantly attempt to bridge the gap between "local dev" and "multi-user production" using the wrong architectural foundations, and we wrote this post because the same misunderstanding shows up in roughly half the engagements we're asked to scope.

The benchmarks below are not ours. They're a distillation of publicly published data from Red Hat's vLLM team, Spheron's high-concurrency tests, and the official vLLM project documentation. The point isn't to claim a novel benchmark — the point is that the published numbers all agree on the shape of the failure, and that shape is what determines whether your stack survives the day a real team starts using it.

The architecture: why concurrency breaks local engines

The core bottleneck in LLM serving isn't compute — it's memory allocation. Specifically: how the engine manages the KV cache, the VRAM that stores attention tensors for every active request.

llama.cpp and Ollama: the static-slot problem

Ollama is an excellent developer-friendly wrapper around llama.cpp. It uses the quantised GGUF model format and is heavily optimised for edge devices, CPUs, and Apple Silicon. But llama.cpp handles concurrency through static memory slots.

When you configure llama-server (or tune Ollama with OLLAMA_NUM_PARALLEL=25), the engine divides your context window and pre-allocates rigid, contiguous chunks of VRAM for each concurrent request.

The failure mode: if User A sends a 50-token prompt and User B sends a 5,000-token prompt, they both occupy identical rigid memory slots. This produces VRAM fragmentation. Once the parallel limit is hit, new requests queue sequentially. You get severe head-of-line blocking — a single 28K-token document summarisation request stalls the API for everyone else.

vLLM: PagedAttention and continuous batching

vLLM was engineered specifically for high-throughput serving. Instead of static slots, it uses PagedAttention — an algorithm that treats GPU VRAM the way a modern operating system handles virtual memory. The KV cache is broken into small, non-contiguous blocks (pages). Memory is dynamically allocated only as tokens are generated.

Combined with continuous batching — where completed requests are instantly swapped out for new ones without waiting for the whole batch to finish — vLLM maintains near-100% GPU utilisation without OOM crashes.

This isn't a theoretical advantage. It's a different memory model, and the difference shows up the second you put real concurrent load on the engine.

The numbers: what the published benchmarks show

Across publicly available benchmarks of Llama 3.1 8B on a single NVIDIA A100 (40GB), the metrics that matter in production are Time to First Token (TTFT), Inter-Token Latency (ITL), and total throughput measured in tokens-per-second across all concurrent requests.

1. Single user — the illusion

Engine Throughput TTFT
Ollama (llama.cpp backend) ~420 TPS ~35 ms
vLLM ~510 TPS ~28 ms

At one user, the gap is negligible. Both feel instantaneous. This is exactly why developers falsely conclude that their local Ollama stack is production-ready. They benchmarked the easy case.

2. 25–32 concurrent users — the breaking point

Engine Total throughput TTFT
Ollama (tuned for parallelism) ~320 TPS (flatlines) spikes past 290 ms, erratic ITL
vLLM ~1,450 TPS (scales) stable at ~95 ms, smooth ITL

This is the smoking gun. Under concurrent load, llama.cpp total throughput goes flat — it processes a fixed amount of work, and TTFT degrades exponentially as the internal queue backs up. The more users you add, the longer everyone waits.

Conversely, vLLM throughput scales close to linearly. By batching 25 users dynamically via PagedAttention, it extracts roughly 4× the total tokens-per-second from the exact same silicon.

(Caveats on the numbers: published benchmarks vary by ±10–15% depending on prompt length distribution, quantisation, and tokeniser. The architectural shape is consistent across sources; the absolute numbers are not pinpoint-accurate. Treat them as orders of magnitude, not lab readings.)

What this means for your deployment

The right answer isn't "vLLM always wins." Each engine has a real production niche. Here's an honest decision tree:

Use Ollama when: You're doing local prototyping, running AI strictly on a developer machine, or operating an offline single-user edge device. It's the undisputed king of local DX, and trying to push it past that envelope is operator error, not a tooling failure.

Use llama.cpp directly when: You need extreme portability — embedded hardware, Raspberry Pis, CPU-only deployments — or you're shipping a desktop app where each user has their own process. Excellent for that profile.

Use vLLM when: You're building an API. If your application expects 10, 25, or 100 concurrent users, vLLM (or alternatives like SGLang or TGI) is the only mathematically viable choice. Deploying a llama.cpp server for a multi-user enterprise application is committing to a TTFT cliff that arrives the moment your platform gains traction.

The expensive failure mode we keep seeing

The most common engagement we walk into looks identical every time:

"We deployed Ollama on a beefy GPU box six months ago. It worked great. Then we rolled it out to the team and now everyone complains it's slow. We're considering a bigger GPU."

A bigger GPU does not fix this. Static memory slots are static memory slots regardless of whether you have 24GB or 80GB of VRAM. You can buy enough headroom to delay the cliff, but you can't avoid it without changing engines. The fix is architectural, not financial.

Most of the time, the migration path is: keep Ollama for individual developer machines, deploy vLLM (with the same quantised model) behind a LiteLLM router for the shared inference path, and let LiteLLM expose a single OpenAI-compatible endpoint to your applications. Your developer experience doesn't change; your scaling story does.

What we do here

CPLT engagements that include multi-user inference always start with the same question: what does your concurrency profile actually look like? Five users sporadically through the day is one architecture. Twenty-five users hitting a RAG pipeline at 9 a.m. on Monday is a completely different one. The difference between an €8K server that works and an €8K server that gets retired in shame six months later is which architecture you picked on day one.

If your stack is currently Ollama-on-a-shared-GPU and you're starting to see the symptoms in this post — erratic latency, complaints about "the AI being slow today," requests that mysteriously time out — that's the signal. Tell us what you're running and we'll respond within 5 business days with a written scope or an honest "no" if your situation doesn't match what we do.


Want the full picture before you commit? Our Architecture Decision Matrix compares CPLT against OpenAI Enterprise, Anthropic, Together, Anyscale, and rolling-your-own with Ollama — across deployment, compliance, cost, and lock-in. 8-page PDF, free, no email gate on the comparison page.