LLM API Pricing Compared (2026): OpenAI vs Anthropic vs Google vs Open Source
Per-token pricing, caching credits, batch discounts, and hidden costs across OpenAI, Anthropic, Google, and open-source LLM providers. Includes four real workload simulations and cost optimization strategies.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

LLM API Pricing Has Never Been More Confusing
Eighteen months ago you had two serious options: OpenAI or Anthropic. Today there are at least ten providers worth evaluating, each with its own pricing model, discount mechanics, and hidden surcharges. Per-token rates only tell half the story -- caching credits, batch discounts, fine-tuning hosting fees, and rate-limit tiers change the effective cost by 2-5x depending on how you call the API.
I maintain LLM cost dashboards for multiple production systems, and the pattern I see repeatedly is teams choosing a provider based on headline price, then discovering that their actual bill is 40-60% higher because of output-heavy workloads, cache misses, or rate-limit-driven over-provisioning. This guide gives you the full picture -- every major provider, every pricing lever, and four real workload simulations so you can model your own costs accurately.
Understanding LLM API Pricing Models
Definition: LLM API pricing is the cost structure for accessing large language model inference through a hosted API. Providers charge per token (input and output separately), with rates varying by model size, capability tier, and volume. Additional costs include prompt caching fees, batch processing discounts, fine-tuning compute, rate-limit upgrades, and platform markups on third-party hosting.
Every provider bills on a per-token basis, but input and output tokens are priced differently. Output tokens cost 2-5x more because each one requires a full sequential forward pass through the model, while input tokens are processed in parallel. This distinction matters enormously: a summarization workload (high input, low output) has a fundamentally different cost profile than a code-generation workload (low input, high output).
Per-Token Pricing: Flagship Models Compared
These are the frontier-class models you would reach for when accuracy and reasoning quality matter most. Prices are per million tokens as of April 2026.
| Provider / Model | Input (per 1M) | Output (per 1M) | Context Window | Notes |
|---|---|---|---|---|
| OpenAI GPT-4.1 | $2.00 | $8.00 | 1M | Long-context flagship |
| OpenAI o3 | $10.00 | $40.00 | 200K | Reasoning model; thinking tokens billed as output |
| OpenAI o4-mini | $1.10 | $4.40 | 200K | Compact reasoning model |
| Anthropic Claude Opus 4 | $15.00 | $75.00 | 200K | Highest capability; extended thinking extra |
| Anthropic Claude Sonnet 4 | $3.00 | $15.00 | 200K | Best quality-to-cost ratio in class |
| Anthropic Claude Haiku 3.5 | $0.80 | $4.00 | 200K | Speed-optimized |
| Google Gemini 2.5 Pro | $1.25 | $10.00 | 1M | Free tier available (rate-limited) |
| Google Gemini 2.5 Flash | $0.15 | $0.60 | 1M | Thinking tokens: $0.70/1M output |
| Mistral Large | $2.00 | $6.00 | 128K | Strong multilingual support |
| DeepSeek V3 | $0.27 | $1.10 | 128K | Cache hits: $0.07/1M input |
| DeepSeek R1 | $0.55 | $2.19 | 128K | Reasoning model; open-weights |
Watch out: Reasoning models (o3, o4-mini, DeepSeek R1) generate internal "thinking" tokens that are billed as output tokens but never shown to the user. A single reasoning call can produce 5,000-20,000 thinking tokens before the visible response begins. This makes reasoning models 3-10x more expensive per request than their headline output price suggests.
Inference Hosting Platforms: Open-Source Model Pricing
If you want to run open-weight models (Llama 3.1, Mistral, DeepSeek, Qwen) without managing GPUs, several inference platforms offer per-token pricing that undercuts first-party APIs significantly.
| Platform | Model Example | Input (per 1M) | Output (per 1M) | Rate Limits |
|---|---|---|---|---|
| Together AI | Llama 3.1 405B | $3.50 | $3.50 | 600 RPM default |
| Together AI | Llama 3.1 70B | $0.88 | $0.88 | 600 RPM default |
| Fireworks AI | Llama 3.1 405B | $3.00 | $3.00 | 600 RPM, 10M TPM |
| Fireworks AI | Llama 3.1 70B | $0.90 | $0.90 | 600 RPM, 10M TPM |
| Groq | Llama 3.1 70B | $0.59 | $0.79 | 30 RPM free, 1000 RPM paid |
| Groq | Llama 3.3 70B | $0.59 | $0.79 | Ultra-low latency (LPU) |
| AWS Bedrock | Claude Sonnet 4 | $3.00 | $15.00 | Provisioned throughput available |
| GCP Vertex AI | Claude Sonnet 4 | $3.00 | $15.00 | Committed use discounts |
| AWS Bedrock | Llama 3.1 70B | $0.72 | $0.72 | On-demand or provisioned |
A few things jump out. Open-weight models on third-party platforms cost 60-80% less than frontier proprietary models for comparable quality tiers. Groq's LPU-based inference delivers the lowest latency in the market, but their free-tier rate limits (30 RPM) make them impractical for production without a paid plan. Bedrock and Vertex charge the same per-token rates as the first-party APIs but offer provisioned throughput options that guarantee capacity -- critical for production workloads that can't tolerate 429 errors.
Hidden Costs That Change the Math
Prompt Caching Credits
Both OpenAI and Anthropic offer prompt caching -- if your requests share a common prefix (system prompt, few-shot examples, or large document context), the cached portion is billed at a reduced rate. OpenAI's cached input tokens cost 50% of standard input price. Anthropic charges 90% less for cache reads but adds a 25% surcharge for cache writes (the first request that populates the cache). DeepSeek offers an automatic cache with 90% discount on cache hits.
For a RAG application sending a 4,000-token system prompt on every request, caching can reduce system-prompt costs by 50-90%. But if your prompts vary significantly between requests and cache hit rates stay below 30%, the write surcharge on Anthropic can actually increase your costs.
Batch API Discounts
OpenAI's Batch API offers 50% off input and output tokens for requests that can tolerate 24-hour turnaround. Anthropic's Message Batches API similarly provides 50% discounts. If your workload is offline processing -- document classification, bulk summarization, content moderation at scale -- batch APIs cut costs in half with zero code changes beyond swapping the endpoint.
Fine-Tuning Hosting Costs
Fine-tuning is not just a training cost. OpenAI charges $25/1M training tokens for GPT-4o fine-tuning, but the ongoing hosting cost is the real expense -- fine-tuned GPT-4o inference costs $3.75/1M input and $15.00/1M output, a 50% premium over the base model. Anthropic does not offer public fine-tuning. Google's Gemini fine-tuning is priced per compute-hour during training, with no inference surcharge on tuned models.
Rate Limits and Throughput Tiers
Rate limits are a stealth cost. OpenAI's free tier caps at 500 RPM and 30,000 TPM. Tier 5 (requires $1,000+ cumulative spend) provides 10,000 RPM and 12M TPM. If you need guaranteed throughput above published limits, you're looking at provisioned throughput contracts -- Anthropic's start at 1-month commitments, and AWS Bedrock offers per-model-unit provisioning starting around $50/hour for Claude Sonnet.
Four Workloads Modeled: Real Monthly Costs
Headline per-token rates are meaningless without workload context. Here are four common production patterns with estimated monthly costs across providers.
Workload 1: Customer Support Chatbot
Specs: 50,000 conversations/month, avg 4 turns each, 1,500 input tokens per turn (system prompt + history + retrieval), 400 output tokens per turn. System prompt cacheable.
| Provider / Model | Input Cost | Output Cost | Cache Savings | Monthly Total |
|---|---|---|---|---|
| OpenAI GPT-4.1 | $600 | $640 | -$180 (cache) | $1,060 |
| Anthropic Sonnet 4 | $900 | $1,200 | -$405 (cache) | $1,695 |
| Anthropic Haiku 3.5 | $240 | $320 | -$108 (cache) | $452 |
| Gemini 2.5 Flash | $45 | $48 | -$20 (cache) | $73 |
| Llama 3.1 70B (Together) | $264 | $70 | N/A | $334 |
Gemini 2.5 Flash dominates on cost, but quality matters for customer-facing interactions. Many teams run Haiku 3.5 or GPT-4.1 for quality, with Flash as a fallback for simple queries -- a model-routing strategy that can cut costs by 40-60% without sacrificing quality on hard questions.
Workload 2: Code Generation Pipeline
Specs: 10,000 requests/day, 2,000 input tokens (context + instructions), 1,500 output tokens (generated code). No caching benefit (varied inputs).
| Provider / Model | Input Cost | Output Cost | Monthly Total |
|---|---|---|---|
| OpenAI GPT-4.1 | $1,200 | $3,600 | $4,800 |
| Anthropic Sonnet 4 | $1,800 | $6,750 | $8,550 |
| Anthropic Opus 4 | $9,000 | $33,750 | $42,750 |
| Gemini 2.5 Pro | $750 | $4,500 | $5,250 |
| DeepSeek V3 | $162 | $495 | $657 |
Output-heavy workloads expose the asymmetry between input and output pricing. DeepSeek V3 is an order of magnitude cheaper, but latency and reliability may be concerns for latency-sensitive pipelines. GPT-4.1 offers the best price-to-capability balance for production code generation among frontier models.
Workload 3: Document Summarization (Batch)
Specs: 100,000 documents/month, avg 8,000 input tokens each, 500 output tokens per summary. Batch API eligible.
| Provider / Model | Standard Cost | Batch Cost (50% off) | Monthly Savings |
|---|---|---|---|
| OpenAI GPT-4.1 | $2,000 | $1,000 | $1,000 |
| Anthropic Sonnet 4 | $3,150 | $1,575 | $1,575 |
| Anthropic Haiku 3.5 | $840 | $420 | $420 |
| Gemini 2.5 Flash | $150 | N/A | N/A |
Batch APIs make a significant difference at scale. Haiku 3.5 in batch mode costs half of what Gemini Flash costs at standard rates. If turnaround time is flexible, always evaluate batch pricing first.
Workload 4: Agentic Workflow (Multi-Step Reasoning)
Specs: 5,000 tasks/month, avg 8 LLM calls per task (tool use, chain-of-thought), 3,000 input tokens and 2,000 output tokens per call. Reasoning model with thinking tokens.
| Provider / Model | Visible Token Cost | Thinking Token Cost | Monthly Total |
|---|---|---|---|
| OpenAI o3 | $4,400 | ~$12,800 | $17,200 |
| OpenAI o4-mini | $1,760 | ~$4,400 | $6,160 |
| Anthropic Sonnet 4 (ext. thinking) | $2,520 | ~$6,000 | $8,520 |
| DeepSeek R1 | $792 | ~$2,200 | $2,992 |
| Gemini 2.5 Flash (thinking) | $120 | ~$560 | $680 |
Agentic workloads multiply costs because every tool call or chain step is a separate inference call, and reasoning models add hidden thinking-token overhead. Gemini 2.5 Flash with thinking mode is the most cost-effective option for autonomous agent loops, though o3 and Sonnet 4 remain the quality leaders for complex multi-step reasoning tasks.
Cost Optimization Strategies
Prompt Caching
If your system prompt exceeds 1,024 tokens and you're sending more than 100 requests per minute, caching should be your first optimization. Anthropic's cache requires a minimum prefix of 1,024 tokens for Haiku and 2,048 for Sonnet/Opus. OpenAI's automatic caching activates on prompts over 1,024 tokens with no code changes needed. At 50,000 daily requests, caching a 3,000-token system prompt saves $150-$450/month depending on the model.
Batch Processing
Move any workload that can tolerate latency to batch APIs. Document processing, content moderation, data extraction, and evaluation pipelines are natural candidates. The 50% discount is the single biggest cost lever available. Structure your pipeline to queue requests and submit batches on a 1-6 hour cadence for near-real-time results at batch pricing.
Model Routing
Route requests to different models based on complexity. A classifier (or even a regex-based heuristic) can identify simple queries that a cheap model handles well versus complex ones that need a frontier model. A typical routing split is 60% Flash/Haiku, 30% Sonnet/GPT-4.1, and 10% Opus/o3. This alone can reduce costs by 50-70% compared to routing everything through a single frontier model.
Self-Hosting Break-Even Analysis
Self-hosting open-weight models makes economic sense at scale. The break-even point depends on GPU costs and utilization. Running Llama 3.1 70B on a single A100 80GB (roughly $2/hour on cloud GPU providers) supports about 15-30 requests per second at short context. At 1 million requests per month, self-hosting costs approximately $1,440/month versus $880 on Together AI for the same model. Self-hosting wins only when you exceed 2-3 million requests per month, maintain >70% GPU utilization, and have the engineering capacity to operate inference infrastructure. Below that threshold, managed APIs are cheaper when you factor in engineering time.
Frequently Asked Questions
Which LLM API is cheapest for production use in 2026?
It depends on the workload. For high-volume, quality-flexible tasks, Gemini 2.5 Flash offers the lowest per-token rates among production-grade models at $0.15/$0.60 per million input/output tokens. For tasks requiring frontier reasoning quality, OpenAI GPT-4.1 at $2/$8 offers the best cost-to-capability ratio. DeepSeek V3 is cheapest in absolute terms but has higher latency and availability concerns outside of China-region endpoints.
How do reasoning model costs compare to standard models?
Reasoning models (OpenAI o3/o4-mini, DeepSeek R1, Gemini with thinking) generate internal thinking tokens billed as output. A typical reasoning call produces 5,000-20,000 thinking tokens in addition to the visible response. This makes the effective per-request cost 3-10x higher than standard models. Use reasoning models selectively for tasks that genuinely benefit from multi-step logic -- math, complex analysis, planning -- and route simpler tasks to standard models.
Is prompt caching worth implementing?
Yes, if you meet two conditions: your system prompt or shared context exceeds the minimum cacheable length (1,024-2,048 tokens depending on provider), and you send at least 50-100 requests per minute to maintain cache residency. At high request volumes with a 4,000-token system prompt, caching saves 50-90% on input token costs for the cached portion. For low-volume or highly variable prompts, cache hit rates drop below 20% and the benefit disappears.
Should I use Bedrock or Vertex instead of direct APIs?
Per-token pricing is identical on Bedrock and Vertex compared to direct Anthropic or Google APIs. The value proposition is operational: unified billing through your existing cloud account, VPC endpoints for network isolation, provisioned throughput guarantees, compliance certifications (HIPAA, SOC 2) inherited from the cloud provider, and consolidated IAM. If you're already on AWS or GCP and need enterprise controls, the platform markup is zero on per-token rates -- the cost is purely in the cloud infrastructure (VPC endpoints, NAT gateways) surrounding it.
When does self-hosting LLMs become cheaper than API access?
The break-even point for self-hosting a 70B parameter model is roughly 2-3 million requests per month, assuming you achieve 70%+ GPU utilization and have SRE capacity to maintain the inference stack. Below that volume, managed APIs are almost always cheaper when you account for GPU idle time, engineering overhead, and the operational cost of model updates. Fine-tuned models shift this calculation -- if you need a custom model that no provider hosts, self-hosting is your only option regardless of cost.
How much do rate limits actually cost in practice?
Rate limits impose indirect costs through queuing delays, dropped requests, and over-provisioning. If your application needs 5,000 RPM but your tier caps at 3,500 RPM, you either queue (adding latency), distribute across multiple API keys (adding complexity), or upgrade your tier (often requiring minimum monthly commits of $1,000-$5,000). Some teams run parallel accounts across providers as a rate-limit arbitrage strategy -- routing overflow traffic to a secondary provider during peak hours.
What is the most cost-effective approach for agentic AI workloads?
Model routing combined with caching is the most effective strategy. Use a cheap, fast model (Gemini Flash, Haiku 3.5) for tool-use orchestration and simple decisions within the agent loop, and escalate to a frontier reasoning model (o3, Sonnet 4) only for steps that require complex analysis. Cache the agent's system prompt and tool definitions aggressively. This hybrid approach typically costs 60-75% less than routing all agent steps through a single frontier model, with minimal quality degradation on the orchestration steps.
The Bottom Line on LLM API Costs
Per-token pricing is the starting point, not the answer. Your actual LLM spend is determined by the interaction of model choice, caching behavior, output-to-input ratio, batch eligibility, and rate-limit tier. The providers with the lowest headline rates (DeepSeek, Gemini Flash) are not always cheapest in practice -- latency constraints, reliability requirements, and quality thresholds push many production workloads toward mid-tier pricing.
Start by profiling your workload: measure your input/output token ratio, cache hit potential, batch eligibility percentage, and peak request rate. Then model costs across 3-4 providers using the tables above. The teams that treat LLM cost optimization as an ongoing practice -- not a one-time provider selection -- consistently spend 50-70% less than those that pick a model and never revisit the decision.
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
Self-Hosted ChatGPT: Run Open WebUI with Local LLMs (Complete Guide)
Deploy a private ChatGPT alternative with Open WebUI and Ollama. Complete Docker Compose setup with model selection, RAG document upload, web search, multi-user config, and security hardening.
11 min read
AI/ML EngineeringCan You Run LLMs Without GPU? CPU Benchmarks & Reality Check
A deep dive into running large language models on CPUs. Includes performance benchmarks, limitations, and optimization strategies.
10 min read
AI/ML EngineeringAI Agent Frameworks Compared: LangGraph vs CrewAI vs AutoGen (2026)
A practitioner comparison of LangGraph, CrewAI, and AutoGen -- benchmarks on research, code gen, and data analysis agents with code examples, token efficiency, and production guidance.
14 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.