LLM API Pricing Compared 2026: Full Cost Breakdown

LLM API Pricing Has Never Been More Confusing

Eighteen months ago you had two serious options: OpenAI or Anthropic. Today there are at least ten providers worth evaluating, each with its own pricing model, discount mechanics, and hidden surcharges. Per-token rates only tell half the story -- caching credits, batch discounts, fine-tuning hosting fees, and rate-limit tiers change the effective cost by 2-5x depending on how you call the API.

I maintain LLM cost dashboards for multiple production systems, and the pattern I see repeatedly is teams choosing a provider based on headline price, then discovering that their actual bill is 40-60% higher because of output-heavy workloads, cache misses, or rate-limit-driven over-provisioning. This guide gives you the full picture -- every major provider, every pricing lever, and four real workload simulations so you can model your own costs accurately.

Understanding LLM API Pricing Models

Definition: LLM API pricing is the cost structure for accessing large language model inference through a hosted API. Providers charge per token (input and output separately), with rates varying by model size, capability tier, and volume. Additional costs include prompt caching fees, batch processing discounts, fine-tuning compute, rate-limit upgrades, and platform markups on third-party hosting.

Every provider bills on a per-token basis, but input and output tokens are priced differently. Output tokens cost 2-5x more because each one requires a full sequential forward pass through the model, while input tokens are processed in parallel. This distinction matters enormously: a summarization workload (high input, low output) has a fundamentally different cost profile than a code-generation workload (low input, high output).

Per-Token Pricing: Flagship Models Compared

These are the frontier-class models you would reach for when accuracy and reasoning quality matter most. Prices are per million tokens as of April 2026.

Provider / Model	Input (per 1M)	Output (per 1M)	Context Window	Notes
OpenAI GPT-4.1	$2.00	$8.00	1M	Long-context flagship
OpenAI o3	$10.00	$40.00	200K	Reasoning model; thinking tokens billed as output
OpenAI o4-mini	$1.10	$4.40	200K	Compact reasoning model
Anthropic Claude Opus 4	$15.00	$75.00	200K	Highest capability; extended thinking extra
Anthropic Claude Sonnet 4	$3.00	$15.00	200K	Best quality-to-cost ratio in class
Anthropic Claude Haiku 3.5	$0.80	$4.00	200K	Speed-optimized
Google Gemini 2.5 Pro	$1.25	$10.00	1M	Free tier available (rate-limited)
Google Gemini 2.5 Flash	$0.15	$0.60	1M	Thinking tokens: $0.70/1M output
Mistral Large	$2.00	$6.00	128K	Strong multilingual support
DeepSeek V3	$0.27	$1.10	128K	Cache hits: $0.07/1M input
DeepSeek R1	$0.55	$2.19	128K	Reasoning model; open-weights

Watch out: Reasoning models (o3, o4-mini, DeepSeek R1) generate internal "thinking" tokens that are billed as output tokens but never shown to the user. A single reasoning call can produce 5,000-20,000 thinking tokens before the visible response begins. This makes reasoning models 3-10x more expensive per request than their headline output price suggests.

Inference Hosting Platforms: Open-Source Model Pricing

If you want to run open-weight models (Llama 3.1, Mistral, DeepSeek, Qwen) without managing GPUs, several inference platforms offer per-token pricing that undercuts first-party APIs significantly.

Platform	Model Example	Input (per 1M)	Output (per 1M)	Rate Limits
Together AI	Llama 3.1 405B	$3.50	$3.50	600 RPM default
Together AI	Llama 3.1 70B	$0.88	$0.88	600 RPM default
Fireworks AI	Llama 3.1 405B	$3.00	$3.00	600 RPM, 10M TPM
Fireworks AI	Llama 3.1 70B	$0.90	$0.90	600 RPM, 10M TPM
Groq	Llama 3.1 70B	$0.59	$0.79	30 RPM free, 1000 RPM paid
Groq	Llama 3.3 70B	$0.59	$0.79	Ultra-low latency (LPU)
AWS Bedrock	Claude Sonnet 4	$3.00	$15.00	Provisioned throughput available
GCP Vertex AI	Claude Sonnet 4	$3.00	$15.00	Committed use discounts
AWS Bedrock	Llama 3.1 70B	$0.72	$0.72	On-demand or provisioned

A few things jump out. Open-weight models on third-party platforms cost 60-80% less than frontier proprietary models for comparable quality tiers. Groq's LPU-based inference delivers the lowest latency in the market, but their free-tier rate limits (30 RPM) make them impractical for production without a paid plan. Bedrock and Vertex charge the same per-token rates as the first-party APIs but offer provisioned throughput options that guarantee capacity -- critical for production workloads that can't tolerate 429 errors.

Hidden Costs That Change the Math

Prompt Caching Credits

Both OpenAI and Anthropic offer prompt caching -- if your requests share a common prefix (system prompt, few-shot examples, or large document context), the cached portion is billed at a reduced rate. OpenAI's cached input tokens cost 50% of standard input price. Anthropic charges 90% less for cache reads but adds a 25% surcharge for cache writes (the first request that populates the cache). DeepSeek offers an automatic cache with 90% discount on cache hits.

For a RAG application sending a 4,000-token system prompt on every request, caching can reduce system-prompt costs by 50-90%. But if your prompts vary significantly between requests and cache hit rates stay below 30%, the write surcharge on Anthropic can actually increase your costs.

Batch API Discounts

OpenAI's Batch API offers 50% off input and output tokens for requests that can tolerate 24-hour turnaround. Anthropic's Message Batches API similarly provides 50% discounts. If your workload is offline processing -- document classification, bulk summarization, content moderation at scale -- batch APIs cut costs in half with zero code changes beyond swapping the endpoint.

Fine-Tuning Hosting Costs

Fine-tuning is not just a training cost. OpenAI charges $25/1M training tokens for GPT-4o fine-tuning, but the ongoing hosting cost is the real expense -- fine-tuned GPT-4o inference costs $3.75/1M input and $15.00/1M output, a 50% premium over the base model. Anthropic does not offer public fine-tuning. Google's Gemini fine-tuning is priced per compute-hour during training, with no inference surcharge on tuned models.

Rate Limits and Throughput Tiers

Rate limits are a stealth cost. OpenAI's free tier caps at 500 RPM and 30,000 TPM. Tier 5 (requires $1,000+ cumulative spend) provides 10,000 RPM and 12M TPM. If you need guaranteed throughput above published limits, you're looking at provisioned throughput contracts -- Anthropic's start at 1-month commitments, and AWS Bedrock offers per-model-unit provisioning starting around $50/hour for Claude Sonnet.

Four Workloads Modeled: Real Monthly Costs

Headline per-token rates are meaningless without workload context. Here are four common production patterns with estimated monthly costs across providers.

Workload 1: Customer Support Chatbot

Specs: 50,000 conversations/month, avg 4 turns each, 1,500 input tokens per turn (system prompt + history + retrieval), 400 output tokens per turn. System prompt cacheable.

Provider / Model	Input Cost	Output Cost	Cache Savings	Monthly Total
OpenAI GPT-4.1	$600	$640	-$180 (cache)	$1,060
Anthropic Sonnet 4	$900	$1,200	-$405 (cache)	$1,695
Anthropic Haiku 3.5	$240	$320	-$108 (cache)	$452
Gemini 2.5 Flash	$45	$48	-$20 (cache)	$73
Llama 3.1 70B (Together)	$264	$70	N/A	$334

Gemini 2.5 Flash dominates on cost, but quality matters for customer-facing interactions. Many teams run Haiku 3.5 or GPT-4.1 for quality, with Flash as a fallback for simple queries -- a model-routing strategy that can cut costs by 40-60% without sacrificing quality on hard questions.

Workload 2: Code Generation Pipeline

Specs: 10,000 requests/day, 2,000 input tokens (context + instructions), 1,500 output tokens (generated code). No caching benefit (varied inputs).

Provider / Model	Input Cost	Output Cost	Monthly Total
OpenAI GPT-4.1	$1,200	$3,600	$4,800
Anthropic Sonnet 4	$1,800	$6,750	$8,550
Anthropic Opus 4	$9,000	$33,750	$42,750
Gemini 2.5 Pro	$750	$4,500	$5,250
DeepSeek V3	$162	$495	$657

Output-heavy workloads expose the asymmetry between input and output pricing. DeepSeek V3 is an order of magnitude cheaper, but latency and reliability may be concerns for latency-sensitive pipelines. GPT-4.1 offers the best price-to-capability balance for production code generation among frontier models.

Workload 3: Document Summarization (Batch)

Specs: 100,000 documents/month, avg 8,000 input tokens each, 500 output tokens per summary. Batch API eligible.

Provider / Model	Standard Cost	Batch Cost (50% off)	Monthly Savings
OpenAI GPT-4.1	$2,000	$1,000	$1,000
Anthropic Sonnet 4	$3,150	$1,575	$1,575
Anthropic Haiku 3.5	$840	$420	$420
Gemini 2.5 Flash	$150	N/A	N/A

Batch APIs make a significant difference at scale. Haiku 3.5 in batch mode costs half of what Gemini Flash costs at standard rates. If turnaround time is flexible, always evaluate batch pricing first.

Workload 4: Agentic Workflow (Multi-Step Reasoning)

Specs: 5,000 tasks/month, avg 8 LLM calls per task (tool use, chain-of-thought), 3,000 input tokens and 2,000 output tokens per call. Reasoning model with thinking tokens.

Provider / Model	Visible Token Cost	Thinking Token Cost	Monthly Total
OpenAI o3	$4,400	~$12,800	$17,200
OpenAI o4-mini	$1,760	~$4,400	$6,160
Anthropic Sonnet 4 (ext. thinking)	$2,520	~$6,000	$8,520
DeepSeek R1	$792	~$2,200	$2,992
Gemini 2.5 Flash (thinking)	$120	~$560	$680

Agentic workloads multiply costs because every tool call or chain step is a separate inference call, and reasoning models add hidden thinking-token overhead. Gemini 2.5 Flash with thinking mode is the most cost-effective option for autonomous agent loops, though o3 and Sonnet 4 remain the quality leaders for complex multi-step reasoning tasks.

Cost Optimization Strategies

Prompt Caching

If your system prompt exceeds 1,024 tokens and you're sending more than 100 requests per minute, caching should be your first optimization. Anthropic's cache requires a minimum prefix of 1,024 tokens for Haiku and 2,048 for Sonnet/Opus. OpenAI's automatic caching activates on prompts over 1,024 tokens with no code changes needed. At 50,000 daily requests, caching a 3,000-token system prompt saves $150-$450/month depending on the model.

Batch Processing

Move any workload that can tolerate latency to batch APIs. Document processing, content moderation, data extraction, and evaluation pipelines are natural candidates. The 50% discount is the single biggest cost lever available. Structure your pipeline to queue requests and submit batches on a 1-6 hour cadence for near-real-time results at batch pricing.

Model Routing

Route requests to different models based on complexity. A classifier (or even a regex-based heuristic) can identify simple queries that a cheap model handles well versus complex ones that need a frontier model. A typical routing split is 60% Flash/Haiku, 30% Sonnet/GPT-4.1, and 10% Opus/o3. This alone can reduce costs by 50-70% compared to routing everything through a single frontier model.

Self-Hosting Break-Even Analysis

Self-hosting open-weight models makes economic sense at scale. The break-even point depends on GPU costs and utilization. Running Llama 3.1 70B on a single A100 80GB (roughly $2/hour on cloud GPU providers) supports about 15-30 requests per second at short context. At 1 million requests per month, self-hosting costs approximately $1,440/month versus $880 on Together AI for the same model. Self-hosting wins only when you exceed 2-3 million requests per month, maintain >70% GPU utilization, and have the engineering capacity to operate inference infrastructure. Below that threshold, managed APIs are cheaper when you factor in engineering time.

Frequently Asked Questions

Which LLM API is cheapest for production use in 2026?

It depends on the workload. For high-volume, quality-flexible tasks, Gemini 2.5 Flash offers the lowest per-token rates among production-grade models at $0.15/$0.60 per million input/output tokens. For tasks requiring frontier reasoning quality, OpenAI GPT-4.1 at $2/$8 offers the best cost-to-capability ratio. DeepSeek V3 is cheapest in absolute terms but has higher latency and availability concerns outside of China-region endpoints.

How do reasoning model costs compare to standard models?

Reasoning models (OpenAI o3/o4-mini, DeepSeek R1, Gemini with thinking) generate internal thinking tokens billed as output. A typical reasoning call produces 5,000-20,000 thinking tokens in addition to the visible response. This makes the effective per-request cost 3-10x higher than standard models. Use reasoning models selectively for tasks that genuinely benefit from multi-step logic -- math, complex analysis, planning -- and route simpler tasks to standard models.

Is prompt caching worth implementing?

Yes, if you meet two conditions: your system prompt or shared context exceeds the minimum cacheable length (1,024-2,048 tokens depending on provider), and you send at least 50-100 requests per minute to maintain cache residency. At high request volumes with a 4,000-token system prompt, caching saves 50-90% on input token costs for the cached portion. For low-volume or highly variable prompts, cache hit rates drop below 20% and the benefit disappears.

Should I use Bedrock or Vertex instead of direct APIs?

Per-token pricing is identical on Bedrock and Vertex compared to direct Anthropic or Google APIs. The value proposition is operational: unified billing through your existing cloud account, VPC endpoints for network isolation, provisioned throughput guarantees, compliance certifications (HIPAA, SOC 2) inherited from the cloud provider, and consolidated IAM. If you're already on AWS or GCP and need enterprise controls, the platform markup is zero on per-token rates -- the cost is purely in the cloud infrastructure (VPC endpoints, NAT gateways) surrounding it.

When does self-hosting LLMs become cheaper than API access?

The break-even point for self-hosting a 70B parameter model is roughly 2-3 million requests per month, assuming you achieve 70%+ GPU utilization and have SRE capacity to maintain the inference stack. Below that volume, managed APIs are almost always cheaper when you account for GPU idle time, engineering overhead, and the operational cost of model updates. Fine-tuned models shift this calculation -- if you need a custom model that no provider hosts, self-hosting is your only option regardless of cost.

How much do rate limits actually cost in practice?

Rate limits impose indirect costs through queuing delays, dropped requests, and over-provisioning. If your application needs 5,000 RPM but your tier caps at 3,500 RPM, you either queue (adding latency), distribute across multiple API keys (adding complexity), or upgrade your tier (often requiring minimum monthly commits of $1,000-$5,000). Some teams run parallel accounts across providers as a rate-limit arbitrage strategy -- routing overflow traffic to a secondary provider during peak hours.

What is the most cost-effective approach for agentic AI workloads?

Model routing combined with caching is the most effective strategy. Use a cheap, fast model (Gemini Flash, Haiku 3.5) for tool-use orchestration and simple decisions within the agent loop, and escalate to a frontier reasoning model (o3, Sonnet 4) only for steps that require complex analysis. Cache the agent's system prompt and tool definitions aggressively. This hybrid approach typically costs 60-75% less than routing all agent steps through a single frontier model, with minimal quality degradation on the orchestration steps.

The Bottom Line on LLM API Costs

Per-token pricing is the starting point, not the answer. Your actual LLM spend is determined by the interaction of model choice, caching behavior, output-to-input ratio, batch eligibility, and rate-limit tier. The providers with the lowest headline rates (DeepSeek, Gemini Flash) are not always cheapest in practice -- latency constraints, reliability requirements, and quality thresholds push many production workloads toward mid-tier pricing.

Start by profiling your workload: measure your input/output token ratio, cache hit potential, batch eligibility percentage, and peak request rate. Then model costs across 3-4 providers using the tables above. The teams that treat LLM cost optimization as an ongoing practice -- not a one-time provider selection -- consistently spend 50-70% less than those that pick a model and never revisit the decision.

LLM API Pricing Compared (2026): OpenAI vs Anthropic vs Google vs Open Source