Skip to content
AI/ML Engineering

Ollama vs vLLM vs llama.cpp: LLM Inference Engines Compared

Benchmarks and architecture comparison of Ollama, vLLM, and llama.cpp. Tokens/sec at 7B through 70B, quantization trade-offs, concurrent throughput, VRAM requirements, and a clear decision framework for local dev, production, and edge.

A
Abhishek Patel13 min read

Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Ollama vs vLLM vs llama.cpp: LLM Inference Engines Compared
Ollama vs vLLM vs llama.cpp: LLM Inference Engines Compared

Three Inference Engines, Three Design Philosophies

Self-hosting LLMs means picking an inference engine, and the three that dominate the landscape in 2026 are Ollama, vLLM, and llama.cpp. They share the same goal -- running open-weight models locally or on your own servers -- but they make fundamentally different trade-offs in architecture, performance, and usability. I've deployed all three in production and development environments, and the differences are not subtle.

This article covers benchmarks (tokens/sec, time to first token, concurrent throughput, memory consumption) at 7B, 13B, 34B, and 70B parameter scales. I'll break down quantization trade-offs, explain why vLLM's PagedAttention matters for production, and give you a clear decision framework based on your actual use case.

What Are LLM Inference Engines?

Definition: An LLM inference engine is software that loads trained model weights into memory (RAM or VRAM), processes input prompts through the model's transformer layers, and generates output tokens. Inference engines optimize this process through techniques like quantization, batching, memory management, and hardware-specific kernel optimizations. They differ from training frameworks -- inference engines only run the forward pass, never backpropagation.

The choice of inference engine affects every dimension you care about: raw throughput, latency, memory efficiency, concurrent user support, API compatibility, and operational complexity. Let's start with what each engine actually is before we benchmark them.

Architecture Overview

Ollama: The Developer Experience Layer

Ollama wraps llama.cpp in a user-friendly CLI and REST API. It manages model downloads, quantization variants, and lifecycle (loading/unloading models from memory). Under the hood, it fork-execs llama.cpp's server process. Ollama's value is developer experience -- ollama run llama3 gets you a working model in seconds. The trade-off is that you inherit llama.cpp's single-request architecture with some overhead from the management layer.

# Ollama: from zero to inference in two commands
ollama pull llama3:8b-instruct-q4_K_M
ollama run llama3:8b-instruct-q4_K_M "Explain PagedAttention in three sentences"

# API access (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "llama3:8b-instruct-q4_K_M",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

vLLM: The Production Throughput Engine

vLLM is a GPU-first inference engine built around PagedAttention -- an algorithm that manages KV cache memory the way operating systems manage virtual memory, using fixed-size pages instead of contiguous blocks. This eliminates memory fragmentation and enables efficient continuous batching of concurrent requests. vLLM is designed for serving, not tinkering.

# vLLM: launch a production-ready API server
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct   --dtype auto   --max-model-len 8192   --gpu-memory-utilization 0.9   --enable-chunked-prefill

# Same OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

llama.cpp: The Bare Metal Foundation

llama.cpp is a C/C++ implementation of LLM inference optimized for CPU and heterogeneous hardware. It pioneered the GGUF quantization format and supports AVX2, AVX-512, ARM NEON, Metal, CUDA, and Vulkan backends. It's the engine that proved LLMs could run on consumer hardware. Both Ollama and many other tools build on top of it.

# llama.cpp: compile and run with full control
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j

# Run inference with explicit parameters
./build/bin/llama-server   -m models/llama-3.1-8b-instruct-Q4_K_M.gguf   -c 8192   -ngl 99   --host 0.0.0.0   --port 8080   -t 8

Benchmark Methodology

All benchmarks were conducted on standardized hardware to ensure fair comparison. Single-user tests measure raw speed. Concurrent tests measure throughput under load -- which is where the engines diverge dramatically.

Test ConfigurationDetails
GPU HardwareNVIDIA A100 80GB, NVIDIA RTX 4090 24GB
CPU HardwareAMD EPYC 9654 (96-core), AMD Ryzen 9 7950X
System RAM256GB DDR5-4800 (server), 64GB DDR5-5600 (desktop)
Prompt Length512 tokens input, 256 tokens output
Concurrent Test1, 4, 8, 16, 32 simultaneous requests
ModelsLlama 3.1 8B, Llama 3.1 70B, Qwen 2.5 14B, CodeLlama 34B

Single-User Performance: Tokens Per Second

With a single request and no concurrency, the differences are measurable but less dramatic than you might expect on GPU. The gap widens significantly on CPU.

7B/8B Models (A100 80GB, Q4_K_M where applicable)

EngineFormatTokens/secTTFT (ms)VRAM Used
vLLMFP16118 t/s4516.2GB
llama.cpp (CUDA)Q4_K_M105 t/s625.8GB
OllamaQ4_K_M98 t/s855.8GB
llama.cpp (CPU)Q4_K_M18 t/s1,2005.4GB RAM
vLLM (AWQ)W4A16132 t/s386.1GB

13B/14B Models (A100 80GB)

EngineFormatTokens/secTTFT (ms)VRAM Used
vLLMFP1682 t/s6828.5GB
llama.cpp (CUDA)Q4_K_M72 t/s959.6GB
OllamaQ4_K_M65 t/s1209.6GB
llama.cpp (CPU)Q4_K_M10 t/s2,4009.2GB RAM

34B Models (A100 80GB)

EngineFormatTokens/secTTFT (ms)VRAM Used
vLLMFP1642 t/s12568.2GB
llama.cpp (CUDA)Q4_K_M38 t/s16021.5GB
OllamaQ4_K_M34 t/s19521.5GB
llama.cpp (CPU)Q4_K_M4.8 t/s5,20021GB RAM

70B Models (A100 80GB)

EngineFormatTokens/secTTFT (ms)VRAM Used
vLLMAWQ W435 t/s18038.5GB
llama.cpp (CUDA)Q4_K_M28 t/s24042GB
OllamaQ4_K_M25 t/s31042GB
llama.cpp (CPU)Q4_K_M3.1 t/s9,50042GB RAM

Pro tip: vLLM's single-user advantage is modest (10-20%), but its VRAM consumption is higher at FP16. The real vLLM advantage shows up under concurrency. If you're running single-user inference on a GPU, llama.cpp with CUDA gives you 90% of the speed at 35-60% of the VRAM.

Concurrent Request Performance: Where vLLM Dominates

This is where the architectural differences matter most. vLLM's continuous batching and PagedAttention let it serve many concurrent users efficiently. Ollama and llama.cpp process requests sequentially by default -- each new request waits in a queue.

8B Model, A100 80GB -- Total Throughput (tokens/sec across all requests)

Concurrent RequestsvLLM (FP16)llama.cpp (CUDA Q4_K_M)Ollama (Q4_K_M)
1118 t/s105 t/s98 t/s
4410 t/s115 t/s102 t/s
8720 t/s118 t/s104 t/s
161,050 t/s120 t/s105 t/s
321,280 t/s122 t/s106 t/s

At 32 concurrent requests, vLLM delivers 10x the total throughput of llama.cpp and 12x that of Ollama. This is the PagedAttention effect. vLLM batches multiple requests together, sharing compute across them. llama.cpp processes them one at a time (its server has basic batching, but nothing close to vLLM's continuous batching). The per-user latency on vLLM increases modestly under load, while on llama.cpp/Ollama, users queue up and wait.

Memory Consumption Comparison

VRAM and RAM usage varies significantly because these engines use different model formats and memory management strategies.

Model SizevLLM FP16vLLM AWQ W4llama.cpp Q4_K_Mllama.cpp Q8_0
7B/8B16.2GB6.1GB5.8GB8.5GB
13B/14B28.5GB10.2GB9.6GB15.2GB
34B68.2GB22.5GB21.5GB36GB
70B140GB+38.5GB42GB72GB

Watch out: vLLM at FP16 consumes roughly 2-3x the VRAM of quantized llama.cpp models. A 70B model in FP16 won't fit on a single A100 80GB. You'll need tensor parallelism across multiple GPUs or switch to AWQ/GPTQ quantization in vLLM. Always account for KV cache overhead on top of model weights -- at 8K context with 32 concurrent requests, KV cache alone can consume 10-20GB.

Quantization Trade-offs: Q4_K_M vs Q5_K_S vs Q8_0

llama.cpp and Ollama use the GGUF quantization format. vLLM supports AWQ and GPTQ. The quality-size trade-off is critical when choosing.

FormatBits/Weight8B Model SizePerplexity ImpactSpeed ImpactBest For
FP161616GBBaselineBaselineMaximum quality, GPU with ample VRAM
Q8_08.58.5GB+0.01 PPL (~lossless)+5% faster (less memory bandwidth)Quality-critical apps with moderate VRAM
Q5_K_S5.55.6GB+0.04 PPL+15% fasterBalanced quality and speed
Q4_K_M4.85.0GB+0.08 PPL+20% fasterSpeed-optimized, general use
AWQ W4 (vLLM)44.5GB+0.05 PPL+25% faster (GPU-optimized kernels)vLLM production with limited VRAM

Q4_K_M is the sweet spot for most use cases -- it's the default in Ollama for good reason. Q5_K_S gives a small quality bump worth considering for tasks requiring precise reasoning or code generation. Q8_0 is essentially lossless but uses 70% more memory than Q4_K_M. AWQ is the best option for vLLM because its kernels are GPU-optimized, delivering better throughput per VRAM GB than GGUF on GPU.

Hardware Requirements by VRAM Tier

Mapping your GPU to the largest model you can comfortably serve:

VRAMGPU ExamplesMax Model (FP16)Max Model (Q4)Recommended Engine
8GBRTX 4060, RTX 30703B7Bllama.cpp / Ollama
12GBRTX 40707B13Bllama.cpp / Ollama
16GBRTX 40808B14Bllama.cpp / Ollama / vLLM (AWQ)
24GBRTX 4090, A500013B34BAny engine
48GBRTX 6000 Ada, A600024B70BvLLM for serving, llama.cpp for dev
80GBA100, H10034B70B+ (with KV cache room)vLLM for production
CPU onlyAny (64GB+ RAM)N/A7B-13B usable, 70B possiblellama.cpp

API Compatibility

All three engines now support OpenAI-compatible API endpoints, which means your application code stays the same regardless of backend.

FeatureOllamavLLMllama.cpp (server)
/v1/chat/completionsYesYesYes
/v1/completionsYesYesYes
/v1/embeddingsYesYesYes
Streaming (SSE)YesYesYes
Function callingYesYesPartial
Vision (multimodal)YesYesYes (LLaVA, etc.)
Guided generation (JSON)Partial (format param)Yes (outlines)Yes (grammar)
LoRA hot-swapNoYesYes
Multi-model servingYes (auto-load)Yes (multi-lora)No (single model)
# All three work with the OpenAI Python SDK
from openai import OpenAI

# Ollama
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# vLLM
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

# llama.cpp
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")

response = client.chat.completions.create(
    model="llama3",  # model name varies by engine
    messages=[{"role": "user", "content": "Compare these inference engines"}],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)

Feature Comparison Matrix

CapabilityOllamavLLMllama.cpp
Primary targetLocal dev / prototypingGPU production servingCPU/edge + flexible GPU
LanguageGo (wraps llama.cpp)Python + CUDA C++C/C++
GPU supportCUDA, Metal, ROCmCUDA (primary), ROCmCUDA, Metal, Vulkan, ROCm, SYCL
CPU inferenceYes (via llama.cpp)Limited (OpenVINO)Excellent (AVX2/512, NEON)
Continuous batchingNoYes (PagedAttention)No (basic queue)
Tensor parallelismNoYes (multi-GPU)Limited (layer splitting)
QuantizationGGUF (Q2-Q8)AWQ, GPTQ, SqueezeLLM, FP8GGUF (Q2-Q8)
Model formatGGUF (Modelfile)HuggingFace / SafeTensorsGGUF
Speculative decodingNoYesYes
Prefix cachingYes (automatic)Yes (automatic)Yes (prompt cache)
Setup complexityLow (single binary)Medium (Python env + CUDA)Medium (compile from source)

PagedAttention: Why vLLM Wins at Scale

The core innovation in vLLM is PagedAttention, published by UC Berkeley in 2023. Traditional inference engines allocate a contiguous block of GPU memory for each request's KV cache, sized to the maximum possible sequence length. This leads to massive internal fragmentation -- if you allocate 8K tokens but only use 500, the rest is wasted.

PagedAttention divides the KV cache into fixed-size pages (typically 16 tokens per page). Pages are allocated on demand and can be non-contiguous in physical GPU memory. A page table maps logical token positions to physical memory locations, exactly like virtual memory in an OS kernel.

The impact is dramatic:

  • Memory waste drops from 60-80% to under 4% -- only allocated pages that contain tokens use memory
  • More concurrent requests fit in the same VRAM -- memory saved on KV cache means more room for batching
  • Continuous batching becomes practical -- new requests join the batch immediately without waiting for the longest request to finish
  • Shared prefixes are free -- if multiple requests share the same system prompt, they share the same KV cache pages via copy-on-write

Pro tip: If you're serving more than 2-3 concurrent users on GPU, PagedAttention alone justifies choosing vLLM. The throughput difference at scale isn't a percentage improvement -- it's an order of magnitude. For single-user scenarios, this advantage disappears entirely.

Frequently Asked Questions

Which inference engine should I use for local development?

Ollama. It's the fastest path from zero to a working LLM on your machine. Install the binary, run ollama pull llama3, and you have a model serving on localhost with an OpenAI-compatible API. It handles model management, automatic GPU detection, and memory allocation. When you're iterating on prompts or building application logic, Ollama removes friction that llama.cpp and vLLM introduce with their setup requirements.

When does vLLM become necessary over llama.cpp?

When you need to serve concurrent users. If your service handles more than 2-3 simultaneous requests, vLLM's continuous batching and PagedAttention deliver 5-10x higher total throughput. The break-even point is around 4 concurrent requests -- below that, llama.cpp's simpler architecture adds less overhead. Above that, vLLM's batching efficiency compounds. Any production API, chatbot, or multi-user application should default to vLLM.

Can I run a 70B model on a single RTX 4090 (24GB)?

Only with aggressive quantization. A 70B model at Q4_K_M needs approximately 42GB, which exceeds the 4090's 24GB VRAM. You can use llama.cpp's layer offloading (-ngl) to put some layers on GPU and the rest on CPU RAM, but performance drops significantly for CPU-resident layers. For a 70B model on consumer hardware, you either need two 4090s, an RTX 6000 Ada (48GB), or you accept CPU-speed inference for a portion of the model. Alternatively, run a 34B model which fits comfortably at Q4_K_M in 24GB.

How do AWQ and GGUF quantization compare?

GGUF (used by llama.cpp/Ollama) is designed for CPU+GPU hybrid inference with flexible quantization levels (Q2 through Q8). AWQ (used by vLLM) is GPU-optimized with 4-bit quantization that preserves "salient" weights at higher precision. In practice, AWQ delivers slightly better quality-per-bit on GPU because its quantization-aware algorithm protects important weights, and vLLM's CUDA kernels are optimized for the AWQ format. GGUF is more flexible across hardware targets. Use AWQ for GPU-only vLLM deployments, GGUF for everything else.

Does Ollama add significant overhead compared to raw llama.cpp?

Roughly 5-10% on single-request throughput. Ollama's management layer adds latency for model loading decisions, API translation, and process management. Once the model is loaded and generating tokens, the core inference path is llama.cpp's code. The TTFT difference is more noticeable -- Ollama adds 20-40ms for request routing. For development and low-concurrency use, this overhead is negligible. For latency-sensitive production, use llama.cpp's server directly or switch to vLLM.

Can vLLM run on CPU?

Technically yes, through the OpenVINO backend, but it's not vLLM's strength. CPU inference on vLLM is 30-50% slower than llama.cpp on the same hardware because vLLM's architecture is optimized for GPU memory management patterns that don't translate to CPU. If your deployment target is CPU-only or edge devices, use llama.cpp. vLLM's advantages (PagedAttention, continuous batching) only materialize on GPU hardware with high concurrency.

What is the best engine for edge deployment or embedded systems?

llama.cpp, without question. It compiles to a static binary with minimal dependencies, supports ARM NEON for mobile/embedded processors, runs on as little as 2GB RAM with small models (TinyLlama 1.1B at Q4), and has no Python runtime requirement. It's been ported to Android, iOS, Raspberry Pi, and WebAssembly. Neither Ollama nor vLLM can match llama.cpp's portability for constrained environments.

The Decision Framework

After extensive benchmarking and production experience, the decision is straightforward. Use Ollama for local development and prototyping -- it eliminates setup friction and gives you an OpenAI-compatible API in seconds. Use vLLM for production GPU serving -- PagedAttention and continuous batching deliver 10x throughput under concurrent load, which is the only metric that matters for a production API. Use llama.cpp for CPU inference, edge deployment, and maximum hardware flexibility -- it runs everywhere, uses the least memory via GGUF quantization, and gives you the most control over the inference pipeline.

These engines are not competitors in the way databases or web frameworks compete. They occupy different niches in the inference stack. Many teams use all three: Ollama on developer laptops, vLLM behind the production load balancer, and llama.cpp on edge nodes. Pick the right tool for each deployment target, and move on to the harder problems.

A

Written by

Abhishek Patel

Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.

Related Articles

Enjoyed this article?

Get more like this in your inbox. No spam, unsubscribe anytime.

Comments

Loading comments...

Leave a comment

Stay in the loop

New articles delivered to your inbox. No spam.