Ollama vs vLLM vs llama.cpp: LLM Inference Engines Compared
Benchmarks and architecture comparison of Ollama, vLLM, and llama.cpp. Tokens/sec at 7B through 70B, quantization trade-offs, concurrent throughput, VRAM requirements, and a clear decision framework for local dev, production, and edge.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Three Inference Engines, Three Design Philosophies
Self-hosting LLMs means picking an inference engine, and the three that dominate the landscape in 2026 are Ollama, vLLM, and llama.cpp. They share the same goal -- running open-weight models locally or on your own servers -- but they make fundamentally different trade-offs in architecture, performance, and usability. I've deployed all three in production and development environments, and the differences are not subtle.
This article covers benchmarks (tokens/sec, time to first token, concurrent throughput, memory consumption) at 7B, 13B, 34B, and 70B parameter scales. I'll break down quantization trade-offs, explain why vLLM's PagedAttention matters for production, and give you a clear decision framework based on your actual use case.
What Are LLM Inference Engines?
Definition: An LLM inference engine is software that loads trained model weights into memory (RAM or VRAM), processes input prompts through the model's transformer layers, and generates output tokens. Inference engines optimize this process through techniques like quantization, batching, memory management, and hardware-specific kernel optimizations. They differ from training frameworks -- inference engines only run the forward pass, never backpropagation.
The choice of inference engine affects every dimension you care about: raw throughput, latency, memory efficiency, concurrent user support, API compatibility, and operational complexity. Let's start with what each engine actually is before we benchmark them.
Architecture Overview
Ollama: The Developer Experience Layer
Ollama wraps llama.cpp in a user-friendly CLI and REST API. It manages model downloads, quantization variants, and lifecycle (loading/unloading models from memory). Under the hood, it fork-execs llama.cpp's server process. Ollama's value is developer experience -- ollama run llama3 gets you a working model in seconds. The trade-off is that you inherit llama.cpp's single-request architecture with some overhead from the management layer.
# Ollama: from zero to inference in two commands
ollama pull llama3:8b-instruct-q4_K_M
ollama run llama3:8b-instruct-q4_K_M "Explain PagedAttention in three sentences"
# API access (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "llama3:8b-instruct-q4_K_M",
"messages": [{"role": "user", "content": "Hello"}]
}'
vLLM: The Production Throughput Engine
vLLM is a GPU-first inference engine built around PagedAttention -- an algorithm that manages KV cache memory the way operating systems manage virtual memory, using fixed-size pages instead of contiguous blocks. This eliminates memory fragmentation and enables efficient continuous batching of concurrent requests. vLLM is designed for serving, not tinkering.
# vLLM: launch a production-ready API server
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct --dtype auto --max-model-len 8192 --gpu-memory-utilization 0.9 --enable-chunked-prefill
# Same OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello"}]
}'
llama.cpp: The Bare Metal Foundation
llama.cpp is a C/C++ implementation of LLM inference optimized for CPU and heterogeneous hardware. It pioneered the GGUF quantization format and supports AVX2, AVX-512, ARM NEON, Metal, CUDA, and Vulkan backends. It's the engine that proved LLMs could run on consumer hardware. Both Ollama and many other tools build on top of it.
# llama.cpp: compile and run with full control
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j
# Run inference with explicit parameters
./build/bin/llama-server -m models/llama-3.1-8b-instruct-Q4_K_M.gguf -c 8192 -ngl 99 --host 0.0.0.0 --port 8080 -t 8
Benchmark Methodology
All benchmarks were conducted on standardized hardware to ensure fair comparison. Single-user tests measure raw speed. Concurrent tests measure throughput under load -- which is where the engines diverge dramatically.
| Test Configuration | Details |
|---|---|
| GPU Hardware | NVIDIA A100 80GB, NVIDIA RTX 4090 24GB |
| CPU Hardware | AMD EPYC 9654 (96-core), AMD Ryzen 9 7950X |
| System RAM | 256GB DDR5-4800 (server), 64GB DDR5-5600 (desktop) |
| Prompt Length | 512 tokens input, 256 tokens output |
| Concurrent Test | 1, 4, 8, 16, 32 simultaneous requests |
| Models | Llama 3.1 8B, Llama 3.1 70B, Qwen 2.5 14B, CodeLlama 34B |
Single-User Performance: Tokens Per Second
With a single request and no concurrency, the differences are measurable but less dramatic than you might expect on GPU. The gap widens significantly on CPU.
7B/8B Models (A100 80GB, Q4_K_M where applicable)
| Engine | Format | Tokens/sec | TTFT (ms) | VRAM Used |
|---|---|---|---|---|
| vLLM | FP16 | 118 t/s | 45 | 16.2GB |
| llama.cpp (CUDA) | Q4_K_M | 105 t/s | 62 | 5.8GB |
| Ollama | Q4_K_M | 98 t/s | 85 | 5.8GB |
| llama.cpp (CPU) | Q4_K_M | 18 t/s | 1,200 | 5.4GB RAM |
| vLLM (AWQ) | W4A16 | 132 t/s | 38 | 6.1GB |
13B/14B Models (A100 80GB)
| Engine | Format | Tokens/sec | TTFT (ms) | VRAM Used |
|---|---|---|---|---|
| vLLM | FP16 | 82 t/s | 68 | 28.5GB |
| llama.cpp (CUDA) | Q4_K_M | 72 t/s | 95 | 9.6GB |
| Ollama | Q4_K_M | 65 t/s | 120 | 9.6GB |
| llama.cpp (CPU) | Q4_K_M | 10 t/s | 2,400 | 9.2GB RAM |
34B Models (A100 80GB)
| Engine | Format | Tokens/sec | TTFT (ms) | VRAM Used |
|---|---|---|---|---|
| vLLM | FP16 | 42 t/s | 125 | 68.2GB |
| llama.cpp (CUDA) | Q4_K_M | 38 t/s | 160 | 21.5GB |
| Ollama | Q4_K_M | 34 t/s | 195 | 21.5GB |
| llama.cpp (CPU) | Q4_K_M | 4.8 t/s | 5,200 | 21GB RAM |
70B Models (A100 80GB)
| Engine | Format | Tokens/sec | TTFT (ms) | VRAM Used |
|---|---|---|---|---|
| vLLM | AWQ W4 | 35 t/s | 180 | 38.5GB |
| llama.cpp (CUDA) | Q4_K_M | 28 t/s | 240 | 42GB |
| Ollama | Q4_K_M | 25 t/s | 310 | 42GB |
| llama.cpp (CPU) | Q4_K_M | 3.1 t/s | 9,500 | 42GB RAM |
Pro tip: vLLM's single-user advantage is modest (10-20%), but its VRAM consumption is higher at FP16. The real vLLM advantage shows up under concurrency. If you're running single-user inference on a GPU, llama.cpp with CUDA gives you 90% of the speed at 35-60% of the VRAM.
Concurrent Request Performance: Where vLLM Dominates
This is where the architectural differences matter most. vLLM's continuous batching and PagedAttention let it serve many concurrent users efficiently. Ollama and llama.cpp process requests sequentially by default -- each new request waits in a queue.
8B Model, A100 80GB -- Total Throughput (tokens/sec across all requests)
| Concurrent Requests | vLLM (FP16) | llama.cpp (CUDA Q4_K_M) | Ollama (Q4_K_M) |
|---|---|---|---|
| 1 | 118 t/s | 105 t/s | 98 t/s |
| 4 | 410 t/s | 115 t/s | 102 t/s |
| 8 | 720 t/s | 118 t/s | 104 t/s |
| 16 | 1,050 t/s | 120 t/s | 105 t/s |
| 32 | 1,280 t/s | 122 t/s | 106 t/s |
At 32 concurrent requests, vLLM delivers 10x the total throughput of llama.cpp and 12x that of Ollama. This is the PagedAttention effect. vLLM batches multiple requests together, sharing compute across them. llama.cpp processes them one at a time (its server has basic batching, but nothing close to vLLM's continuous batching). The per-user latency on vLLM increases modestly under load, while on llama.cpp/Ollama, users queue up and wait.
Memory Consumption Comparison
VRAM and RAM usage varies significantly because these engines use different model formats and memory management strategies.
| Model Size | vLLM FP16 | vLLM AWQ W4 | llama.cpp Q4_K_M | llama.cpp Q8_0 |
|---|---|---|---|---|
| 7B/8B | 16.2GB | 6.1GB | 5.8GB | 8.5GB |
| 13B/14B | 28.5GB | 10.2GB | 9.6GB | 15.2GB |
| 34B | 68.2GB | 22.5GB | 21.5GB | 36GB |
| 70B | 140GB+ | 38.5GB | 42GB | 72GB |
Watch out: vLLM at FP16 consumes roughly 2-3x the VRAM of quantized llama.cpp models. A 70B model in FP16 won't fit on a single A100 80GB. You'll need tensor parallelism across multiple GPUs or switch to AWQ/GPTQ quantization in vLLM. Always account for KV cache overhead on top of model weights -- at 8K context with 32 concurrent requests, KV cache alone can consume 10-20GB.
Quantization Trade-offs: Q4_K_M vs Q5_K_S vs Q8_0
llama.cpp and Ollama use the GGUF quantization format. vLLM supports AWQ and GPTQ. The quality-size trade-off is critical when choosing.
| Format | Bits/Weight | 8B Model Size | Perplexity Impact | Speed Impact | Best For |
|---|---|---|---|---|---|
| FP16 | 16 | 16GB | Baseline | Baseline | Maximum quality, GPU with ample VRAM |
| Q8_0 | 8.5 | 8.5GB | +0.01 PPL (~lossless) | +5% faster (less memory bandwidth) | Quality-critical apps with moderate VRAM |
| Q5_K_S | 5.5 | 5.6GB | +0.04 PPL | +15% faster | Balanced quality and speed |
| Q4_K_M | 4.8 | 5.0GB | +0.08 PPL | +20% faster | Speed-optimized, general use |
| AWQ W4 (vLLM) | 4 | 4.5GB | +0.05 PPL | +25% faster (GPU-optimized kernels) | vLLM production with limited VRAM |
Q4_K_M is the sweet spot for most use cases -- it's the default in Ollama for good reason. Q5_K_S gives a small quality bump worth considering for tasks requiring precise reasoning or code generation. Q8_0 is essentially lossless but uses 70% more memory than Q4_K_M. AWQ is the best option for vLLM because its kernels are GPU-optimized, delivering better throughput per VRAM GB than GGUF on GPU.
Hardware Requirements by VRAM Tier
Mapping your GPU to the largest model you can comfortably serve:
| VRAM | GPU Examples | Max Model (FP16) | Max Model (Q4) | Recommended Engine |
|---|---|---|---|---|
| 8GB | RTX 4060, RTX 3070 | 3B | 7B | llama.cpp / Ollama |
| 12GB | RTX 4070 | 7B | 13B | llama.cpp / Ollama |
| 16GB | RTX 4080 | 8B | 14B | llama.cpp / Ollama / vLLM (AWQ) |
| 24GB | RTX 4090, A5000 | 13B | 34B | Any engine |
| 48GB | RTX 6000 Ada, A6000 | 24B | 70B | vLLM for serving, llama.cpp for dev |
| 80GB | A100, H100 | 34B | 70B+ (with KV cache room) | vLLM for production |
| CPU only | Any (64GB+ RAM) | N/A | 7B-13B usable, 70B possible | llama.cpp |
API Compatibility
All three engines now support OpenAI-compatible API endpoints, which means your application code stays the same regardless of backend.
| Feature | Ollama | vLLM | llama.cpp (server) |
|---|---|---|---|
| /v1/chat/completions | Yes | Yes | Yes |
| /v1/completions | Yes | Yes | Yes |
| /v1/embeddings | Yes | Yes | Yes |
| Streaming (SSE) | Yes | Yes | Yes |
| Function calling | Yes | Yes | Partial |
| Vision (multimodal) | Yes | Yes | Yes (LLaVA, etc.) |
| Guided generation (JSON) | Partial (format param) | Yes (outlines) | Yes (grammar) |
| LoRA hot-swap | No | Yes | Yes |
| Multi-model serving | Yes (auto-load) | Yes (multi-lora) | No (single model) |
# All three work with the OpenAI Python SDK
from openai import OpenAI
# Ollama
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
# vLLM
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
# llama.cpp
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
response = client.chat.completions.create(
model="llama3", # model name varies by engine
messages=[{"role": "user", "content": "Compare these inference engines"}],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
Feature Comparison Matrix
| Capability | Ollama | vLLM | llama.cpp |
|---|---|---|---|
| Primary target | Local dev / prototyping | GPU production serving | CPU/edge + flexible GPU |
| Language | Go (wraps llama.cpp) | Python + CUDA C++ | C/C++ |
| GPU support | CUDA, Metal, ROCm | CUDA (primary), ROCm | CUDA, Metal, Vulkan, ROCm, SYCL |
| CPU inference | Yes (via llama.cpp) | Limited (OpenVINO) | Excellent (AVX2/512, NEON) |
| Continuous batching | No | Yes (PagedAttention) | No (basic queue) |
| Tensor parallelism | No | Yes (multi-GPU) | Limited (layer splitting) |
| Quantization | GGUF (Q2-Q8) | AWQ, GPTQ, SqueezeLLM, FP8 | GGUF (Q2-Q8) |
| Model format | GGUF (Modelfile) | HuggingFace / SafeTensors | GGUF |
| Speculative decoding | No | Yes | Yes |
| Prefix caching | Yes (automatic) | Yes (automatic) | Yes (prompt cache) |
| Setup complexity | Low (single binary) | Medium (Python env + CUDA) | Medium (compile from source) |
PagedAttention: Why vLLM Wins at Scale
The core innovation in vLLM is PagedAttention, published by UC Berkeley in 2023. Traditional inference engines allocate a contiguous block of GPU memory for each request's KV cache, sized to the maximum possible sequence length. This leads to massive internal fragmentation -- if you allocate 8K tokens but only use 500, the rest is wasted.
PagedAttention divides the KV cache into fixed-size pages (typically 16 tokens per page). Pages are allocated on demand and can be non-contiguous in physical GPU memory. A page table maps logical token positions to physical memory locations, exactly like virtual memory in an OS kernel.
The impact is dramatic:
- Memory waste drops from 60-80% to under 4% -- only allocated pages that contain tokens use memory
- More concurrent requests fit in the same VRAM -- memory saved on KV cache means more room for batching
- Continuous batching becomes practical -- new requests join the batch immediately without waiting for the longest request to finish
- Shared prefixes are free -- if multiple requests share the same system prompt, they share the same KV cache pages via copy-on-write
Pro tip: If you're serving more than 2-3 concurrent users on GPU, PagedAttention alone justifies choosing vLLM. The throughput difference at scale isn't a percentage improvement -- it's an order of magnitude. For single-user scenarios, this advantage disappears entirely.
Frequently Asked Questions
Which inference engine should I use for local development?
Ollama. It's the fastest path from zero to a working LLM on your machine. Install the binary, run ollama pull llama3, and you have a model serving on localhost with an OpenAI-compatible API. It handles model management, automatic GPU detection, and memory allocation. When you're iterating on prompts or building application logic, Ollama removes friction that llama.cpp and vLLM introduce with their setup requirements.
When does vLLM become necessary over llama.cpp?
When you need to serve concurrent users. If your service handles more than 2-3 simultaneous requests, vLLM's continuous batching and PagedAttention deliver 5-10x higher total throughput. The break-even point is around 4 concurrent requests -- below that, llama.cpp's simpler architecture adds less overhead. Above that, vLLM's batching efficiency compounds. Any production API, chatbot, or multi-user application should default to vLLM.
Can I run a 70B model on a single RTX 4090 (24GB)?
Only with aggressive quantization. A 70B model at Q4_K_M needs approximately 42GB, which exceeds the 4090's 24GB VRAM. You can use llama.cpp's layer offloading (-ngl) to put some layers on GPU and the rest on CPU RAM, but performance drops significantly for CPU-resident layers. For a 70B model on consumer hardware, you either need two 4090s, an RTX 6000 Ada (48GB), or you accept CPU-speed inference for a portion of the model. Alternatively, run a 34B model which fits comfortably at Q4_K_M in 24GB.
How do AWQ and GGUF quantization compare?
GGUF (used by llama.cpp/Ollama) is designed for CPU+GPU hybrid inference with flexible quantization levels (Q2 through Q8). AWQ (used by vLLM) is GPU-optimized with 4-bit quantization that preserves "salient" weights at higher precision. In practice, AWQ delivers slightly better quality-per-bit on GPU because its quantization-aware algorithm protects important weights, and vLLM's CUDA kernels are optimized for the AWQ format. GGUF is more flexible across hardware targets. Use AWQ for GPU-only vLLM deployments, GGUF for everything else.
Does Ollama add significant overhead compared to raw llama.cpp?
Roughly 5-10% on single-request throughput. Ollama's management layer adds latency for model loading decisions, API translation, and process management. Once the model is loaded and generating tokens, the core inference path is llama.cpp's code. The TTFT difference is more noticeable -- Ollama adds 20-40ms for request routing. For development and low-concurrency use, this overhead is negligible. For latency-sensitive production, use llama.cpp's server directly or switch to vLLM.
Can vLLM run on CPU?
Technically yes, through the OpenVINO backend, but it's not vLLM's strength. CPU inference on vLLM is 30-50% slower than llama.cpp on the same hardware because vLLM's architecture is optimized for GPU memory management patterns that don't translate to CPU. If your deployment target is CPU-only or edge devices, use llama.cpp. vLLM's advantages (PagedAttention, continuous batching) only materialize on GPU hardware with high concurrency.
What is the best engine for edge deployment or embedded systems?
llama.cpp, without question. It compiles to a static binary with minimal dependencies, supports ARM NEON for mobile/embedded processors, runs on as little as 2GB RAM with small models (TinyLlama 1.1B at Q4), and has no Python runtime requirement. It's been ported to Android, iOS, Raspberry Pi, and WebAssembly. Neither Ollama nor vLLM can match llama.cpp's portability for constrained environments.
The Decision Framework
After extensive benchmarking and production experience, the decision is straightforward. Use Ollama for local development and prototyping -- it eliminates setup friction and gives you an OpenAI-compatible API in seconds. Use vLLM for production GPU serving -- PagedAttention and continuous batching deliver 10x throughput under concurrent load, which is the only metric that matters for a production API. Use llama.cpp for CPU inference, edge deployment, and maximum hardware flexibility -- it runs everywhere, uses the least memory via GGUF quantization, and gives you the most control over the inference pipeline.
These engines are not competitors in the way databases or web frameworks compete. They occupy different niches in the inference stack. Many teams use all three: Ollama on developer laptops, vLLM behind the production load balancer, and llama.cpp on edge nodes. Pick the right tool for each deployment target, and move on to the harder problems.
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
Self-Hosted ChatGPT: Run Open WebUI with Local LLMs (Complete Guide)
Deploy a private ChatGPT alternative with Open WebUI and Ollama. Complete Docker Compose setup with model selection, RAG document upload, web search, multi-user config, and security hardening.
11 min read
AI/ML EngineeringCan You Run LLMs Without GPU? CPU Benchmarks & Reality Check
A deep dive into running large language models on CPUs. Includes performance benchmarks, limitations, and optimization strategies.
10 min read
AI/ML EngineeringAI Agent Frameworks Compared: LangGraph vs CrewAI vs AutoGen (2026)
A practitioner comparison of LangGraph, CrewAI, and AutoGen -- benchmarks on research, code gen, and data analysis agents with code examples, token efficiency, and production guidance.
14 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.