Ollama vs vLLM vs llama.cpp: Benchmarks Compared

Three Inference Engines, Three Design Philosophies

Self-hosting LLMs means picking an inference engine, and the three that dominate the landscape in 2026 are Ollama, vLLM, and llama.cpp. They share the same goal -- running open-weight models locally or on your own servers -- but they make fundamentally different trade-offs in architecture, performance, and usability. I've deployed all three in production and development environments, and the differences are not subtle.

This article covers benchmarks (tokens/sec, time to first token, concurrent throughput, memory consumption) at 7B, 13B, 34B, and 70B parameter scales. I'll break down quantization trade-offs, explain why vLLM's PagedAttention matters for production, and give you a clear decision framework based on your actual use case.

What Are LLM Inference Engines?

Definition: An LLM inference engine is software that loads trained model weights into memory (RAM or VRAM), processes input prompts through the model's transformer layers, and generates output tokens. Inference engines optimize this process through techniques like quantization, batching, memory management, and hardware-specific kernel optimizations. They differ from training frameworks -- inference engines only run the forward pass, never backpropagation.

The choice of inference engine affects every dimension you care about: raw throughput, latency, memory efficiency, concurrent user support, API compatibility, and operational complexity. Let's start with what each engine actually is before we benchmark them.

Architecture Overview

Ollama: The Developer Experience Layer

Ollama wraps llama.cpp in a user-friendly CLI and REST API. It manages model downloads, quantization variants, and lifecycle (loading/unloading models from memory). Under the hood, it fork-execs llama.cpp's server process. Ollama's value is developer experience -- ollama run llama3 gets you a working model in seconds. The trade-off is that you inherit llama.cpp's single-request architecture with some overhead from the management layer.

# Ollama: from zero to inference in two commands
ollama pull llama3:8b-instruct-q4_K_M
ollama run llama3:8b-instruct-q4_K_M "Explain PagedAttention in three sentences"

# API access (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "llama3:8b-instruct-q4_K_M",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

vLLM: The Production Throughput Engine

vLLM is a GPU-first inference engine built around PagedAttention -- an algorithm that manages KV cache memory the way operating systems manage virtual memory, using fixed-size pages instead of contiguous blocks. This eliminates memory fragmentation and enables efficient continuous batching of concurrent requests. vLLM is designed for serving, not tinkering.

# vLLM: launch a production-ready API server
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct   --dtype auto   --max-model-len 8192   --gpu-memory-utilization 0.9   --enable-chunked-prefill

# Same OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

llama.cpp: The Bare Metal Foundation

llama.cpp is a C/C++ implementation of LLM inference optimized for CPU and heterogeneous hardware. It pioneered the GGUF quantization format and supports AVX2, AVX-512, ARM NEON, Metal, CUDA, and Vulkan backends. It's the engine that proved LLMs could run on consumer hardware. Both Ollama and many other tools build on top of it.

# llama.cpp: compile and run with full control
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j

# Run inference with explicit parameters
./build/bin/llama-server   -m models/llama-3.1-8b-instruct-Q4_K_M.gguf   -c 8192   -ngl 99   --host 0.0.0.0   --port 8080   -t 8

Benchmark Methodology

All benchmarks were conducted on standardized hardware to ensure fair comparison. Single-user tests measure raw speed. Concurrent tests measure throughput under load -- which is where the engines diverge dramatically.

Test Configuration	Details
GPU Hardware	NVIDIA A100 80GB, NVIDIA RTX 4090 24GB
CPU Hardware	AMD EPYC 9654 (96-core), AMD Ryzen 9 7950X
System RAM	256GB DDR5-4800 (server), 64GB DDR5-5600 (desktop)
Prompt Length	512 tokens input, 256 tokens output
Concurrent Test	1, 4, 8, 16, 32 simultaneous requests
Models	Llama 3.1 8B, Llama 3.1 70B, Qwen 2.5 14B, CodeLlama 34B

Single-User Performance: Tokens Per Second

With a single request and no concurrency, the differences are measurable but less dramatic than you might expect on GPU. The gap widens significantly on CPU.

7B/8B Models (A100 80GB, Q4_K_M where applicable)

Engine	Format	Tokens/sec	TTFT (ms)	VRAM Used
vLLM	FP16	118 t/s	45	16.2GB
llama.cpp (CUDA)	Q4_K_M	105 t/s	62	5.8GB
Ollama	Q4_K_M	98 t/s	85	5.8GB
llama.cpp (CPU)	Q4_K_M	18 t/s	1,200	5.4GB RAM
vLLM (AWQ)	W4A16	132 t/s	38	6.1GB

13B/14B Models (A100 80GB)

Engine	Format	Tokens/sec	TTFT (ms)	VRAM Used
vLLM	FP16	82 t/s	68	28.5GB
llama.cpp (CUDA)	Q4_K_M	72 t/s	95	9.6GB
Ollama	Q4_K_M	65 t/s	120	9.6GB
llama.cpp (CPU)	Q4_K_M	10 t/s	2,400	9.2GB RAM

34B Models (A100 80GB)

Engine	Format	Tokens/sec	TTFT (ms)	VRAM Used
vLLM	FP16	42 t/s	125	68.2GB
llama.cpp (CUDA)	Q4_K_M	38 t/s	160	21.5GB
Ollama	Q4_K_M	34 t/s	195	21.5GB
llama.cpp (CPU)	Q4_K_M	4.8 t/s	5,200	21GB RAM

70B Models (A100 80GB)

Engine	Format	Tokens/sec	TTFT (ms)	VRAM Used
vLLM	AWQ W4	35 t/s	180	38.5GB
llama.cpp (CUDA)	Q4_K_M	28 t/s	240	42GB
Ollama	Q4_K_M	25 t/s	310	42GB
llama.cpp (CPU)	Q4_K_M	3.1 t/s	9,500	42GB RAM

Pro tip: vLLM's single-user advantage is modest (10-20%), but its VRAM consumption is higher at FP16. The real vLLM advantage shows up under concurrency. If you're running single-user inference on a GPU, llama.cpp with CUDA gives you 90% of the speed at 35-60% of the VRAM.

Concurrent Request Performance: Where vLLM Dominates

This is where the architectural differences matter most. vLLM's continuous batching and PagedAttention let it serve many concurrent users efficiently. Ollama and llama.cpp process requests sequentially by default -- each new request waits in a queue.

8B Model, A100 80GB -- Total Throughput (tokens/sec across all requests)

Concurrent Requests	vLLM (FP16)	llama.cpp (CUDA Q4_K_M)	Ollama (Q4_K_M)
1	118 t/s	105 t/s	98 t/s
4	410 t/s	115 t/s	102 t/s
8	720 t/s	118 t/s	104 t/s
16	1,050 t/s	120 t/s	105 t/s
32	1,280 t/s	122 t/s	106 t/s

At 32 concurrent requests, vLLM delivers 10x the total throughput of llama.cpp and 12x that of Ollama. This is the PagedAttention effect. vLLM batches multiple requests together, sharing compute across them. llama.cpp processes them one at a time (its server has basic batching, but nothing close to vLLM's continuous batching). The per-user latency on vLLM increases modestly under load, while on llama.cpp/Ollama, users queue up and wait.

Memory Consumption Comparison

VRAM and RAM usage varies significantly because these engines use different model formats and memory management strategies.

Model Size	vLLM FP16	vLLM AWQ W4	llama.cpp Q4_K_M	llama.cpp Q8_0
7B/8B	16.2GB	6.1GB	5.8GB	8.5GB
13B/14B	28.5GB	10.2GB	9.6GB	15.2GB
34B	68.2GB	22.5GB	21.5GB	36GB
70B	140GB+	38.5GB	42GB	72GB

Watch out: vLLM at FP16 consumes roughly 2-3x the VRAM of quantized llama.cpp models. A 70B model in FP16 won't fit on a single A100 80GB. You'll need tensor parallelism across multiple GPUs or switch to AWQ/GPTQ quantization in vLLM. Always account for KV cache overhead on top of model weights -- at 8K context with 32 concurrent requests, KV cache alone can consume 10-20GB.

Quantization Trade-offs: Q4_K_M vs Q5_K_S vs Q8_0

llama.cpp and Ollama use the GGUF quantization format. vLLM supports AWQ and GPTQ. The quality-size trade-off is critical when choosing.

Format	Bits/Weight	8B Model Size	Perplexity Impact	Speed Impact	Best For
FP16	16	16GB	Baseline	Baseline	Maximum quality, GPU with ample VRAM
Q8_0	8.5	8.5GB	+0.01 PPL (~lossless)	+5% faster (less memory bandwidth)	Quality-critical apps with moderate VRAM
Q5_K_S	5.5	5.6GB	+0.04 PPL	+15% faster	Balanced quality and speed
Q4_K_M	4.8	5.0GB	+0.08 PPL	+20% faster	Speed-optimized, general use
AWQ W4 (vLLM)	4	4.5GB	+0.05 PPL	+25% faster (GPU-optimized kernels)	vLLM production with limited VRAM

Q4_K_M is the sweet spot for most use cases -- it's the default in Ollama for good reason. Q5_K_S gives a small quality bump worth considering for tasks requiring precise reasoning or code generation. Q8_0 is essentially lossless but uses 70% more memory than Q4_K_M. AWQ is the best option for vLLM because its kernels are GPU-optimized, delivering better throughput per VRAM GB than GGUF on GPU.

Hardware Requirements by VRAM Tier

Mapping your GPU to the largest model you can comfortably serve:

VRAM	GPU Examples	Max Model (FP16)	Max Model (Q4)	Recommended Engine
8GB	RTX 4060, RTX 3070	3B	7B	llama.cpp / Ollama
12GB	RTX 4070	7B	13B	llama.cpp / Ollama
16GB	RTX 4080	8B	14B	llama.cpp / Ollama / vLLM (AWQ)
24GB	RTX 4090, A5000	13B	34B	Any engine
48GB	RTX 6000 Ada, A6000	24B	70B	vLLM for serving, llama.cpp for dev
80GB	A100, H100	34B	70B+ (with KV cache room)	vLLM for production
CPU only	Any (64GB+ RAM)	N/A	7B-13B usable, 70B possible	llama.cpp

API Compatibility

All three engines now support OpenAI-compatible API endpoints, which means your application code stays the same regardless of backend.

Feature	Ollama	vLLM	llama.cpp (server)
/v1/chat/completions	Yes	Yes	Yes
/v1/completions	Yes	Yes	Yes
/v1/embeddings	Yes	Yes	Yes
Streaming (SSE)	Yes	Yes	Yes
Function calling	Yes	Yes	Partial
Vision (multimodal)	Yes	Yes	Yes (LLaVA, etc.)
Guided generation (JSON)	Partial (format param)	Yes (outlines)	Yes (grammar)
LoRA hot-swap	No	Yes	Yes
Multi-model serving	Yes (auto-load)	Yes (multi-lora)	No (single model)

# All three work with the OpenAI Python SDK
from openai import OpenAI

# Ollama
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# vLLM
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

# llama.cpp
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")

response = client.chat.completions.create(
    model="llama3",  # model name varies by engine
    messages=[{"role": "user", "content": "Compare these inference engines"}],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)

Feature Comparison Matrix

Capability	Ollama	vLLM	llama.cpp
Primary target	Local dev / prototyping	GPU production serving	CPU/edge + flexible GPU
Language	Go (wraps llama.cpp)	Python + CUDA C++	C/C++
GPU support	CUDA, Metal, ROCm	CUDA (primary), ROCm	CUDA, Metal, Vulkan, ROCm, SYCL
CPU inference	Yes (via llama.cpp)	Limited (OpenVINO)	Excellent (AVX2/512, NEON)
Continuous batching	No	Yes (PagedAttention)	No (basic queue)
Tensor parallelism	No	Yes (multi-GPU)	Limited (layer splitting)
Quantization	GGUF (Q2-Q8)	AWQ, GPTQ, SqueezeLLM, FP8	GGUF (Q2-Q8)
Model format	GGUF (Modelfile)	HuggingFace / SafeTensors	GGUF
Speculative decoding	No	Yes	Yes
Prefix caching	Yes (automatic)	Yes (automatic)	Yes (prompt cache)
Setup complexity	Low (single binary)	Medium (Python env + CUDA)	Medium (compile from source)

PagedAttention: Why vLLM Wins at Scale

The core innovation in vLLM is PagedAttention, published by UC Berkeley in 2023. Traditional inference engines allocate a contiguous block of GPU memory for each request's KV cache, sized to the maximum possible sequence length. This leads to massive internal fragmentation -- if you allocate 8K tokens but only use 500, the rest is wasted.

PagedAttention divides the KV cache into fixed-size pages (typically 16 tokens per page). Pages are allocated on demand and can be non-contiguous in physical GPU memory. A page table maps logical token positions to physical memory locations, exactly like virtual memory in an OS kernel.

The impact is dramatic:

Memory waste drops from 60-80% to under 4% -- only allocated pages that contain tokens use memory
More concurrent requests fit in the same VRAM -- memory saved on KV cache means more room for batching
Continuous batching becomes practical -- new requests join the batch immediately without waiting for the longest request to finish
Shared prefixes are free -- if multiple requests share the same system prompt, they share the same KV cache pages via copy-on-write

Pro tip: If you're serving more than 2-3 concurrent users on GPU, PagedAttention alone justifies choosing vLLM. The throughput difference at scale isn't a percentage improvement -- it's an order of magnitude. For single-user scenarios, this advantage disappears entirely.

Frequently Asked Questions

Which inference engine should I use for local development?

Ollama. It's the fastest path from zero to a working LLM on your machine. Install the binary, run ollama pull llama3, and you have a model serving on localhost with an OpenAI-compatible API. It handles model management, automatic GPU detection, and memory allocation. When you're iterating on prompts or building application logic, Ollama removes friction that llama.cpp and vLLM introduce with their setup requirements.

When does vLLM become necessary over llama.cpp?

When you need to serve concurrent users. If your service handles more than 2-3 simultaneous requests, vLLM's continuous batching and PagedAttention deliver 5-10x higher total throughput. The break-even point is around 4 concurrent requests -- below that, llama.cpp's simpler architecture adds less overhead. Above that, vLLM's batching efficiency compounds. Any production API, chatbot, or multi-user application should default to vLLM.

Can I run a 70B model on a single RTX 4090 (24GB)?

Only with aggressive quantization. A 70B model at Q4_K_M needs approximately 42GB, which exceeds the 4090's 24GB VRAM. You can use llama.cpp's layer offloading (-ngl) to put some layers on GPU and the rest on CPU RAM, but performance drops significantly for CPU-resident layers. For a 70B model on consumer hardware, you either need two 4090s, an RTX 6000 Ada (48GB), or you accept CPU-speed inference for a portion of the model. Alternatively, run a 34B model which fits comfortably at Q4_K_M in 24GB.

How do AWQ and GGUF quantization compare?

GGUF (used by llama.cpp/Ollama) is designed for CPU+GPU hybrid inference with flexible quantization levels (Q2 through Q8). AWQ (used by vLLM) is GPU-optimized with 4-bit quantization that preserves "salient" weights at higher precision. In practice, AWQ delivers slightly better quality-per-bit on GPU because its quantization-aware algorithm protects important weights, and vLLM's CUDA kernels are optimized for the AWQ format. GGUF is more flexible across hardware targets. Use AWQ for GPU-only vLLM deployments, GGUF for everything else.

Does Ollama add significant overhead compared to raw llama.cpp?

Roughly 5-10% on single-request throughput. Ollama's management layer adds latency for model loading decisions, API translation, and process management. Once the model is loaded and generating tokens, the core inference path is llama.cpp's code. The TTFT difference is more noticeable -- Ollama adds 20-40ms for request routing. For development and low-concurrency use, this overhead is negligible. For latency-sensitive production, use llama.cpp's server directly or switch to vLLM.

Can vLLM run on CPU?

Technically yes, through the OpenVINO backend, but it's not vLLM's strength. CPU inference on vLLM is 30-50% slower than llama.cpp on the same hardware because vLLM's architecture is optimized for GPU memory management patterns that don't translate to CPU. If your deployment target is CPU-only or edge devices, use llama.cpp. vLLM's advantages (PagedAttention, continuous batching) only materialize on GPU hardware with high concurrency.

What is the best engine for edge deployment or embedded systems?

llama.cpp, without question. It compiles to a static binary with minimal dependencies, supports ARM NEON for mobile/embedded processors, runs on as little as 2GB RAM with small models (TinyLlama 1.1B at Q4), and has no Python runtime requirement. It's been ported to Android, iOS, Raspberry Pi, and WebAssembly. Neither Ollama nor vLLM can match llama.cpp's portability for constrained environments.

The Decision Framework

After extensive benchmarking and production experience, the decision is straightforward. Use Ollama for local development and prototyping -- it eliminates setup friction and gives you an OpenAI-compatible API in seconds. Use vLLM for production GPU serving -- PagedAttention and continuous batching deliver 10x throughput under concurrent load, which is the only metric that matters for a production API. Use llama.cpp for CPU inference, edge deployment, and maximum hardware flexibility -- it runs everywhere, uses the least memory via GGUF quantization, and gives you the most control over the inference pipeline.

These engines are not competitors in the way databases or web frameworks compete. They occupy different niches in the inference stack. Many teams use all three: Ollama on developer laptops, vLLM behind the production load balancer, and llama.cpp on edge nodes. Pick the right tool for each deployment target, and move on to the harder problems.

Ollama vs vLLM vs llama.cpp: LLM Inference Engines Compared