LLM APIs charge per token — but not all tokens cost the same, not all models charge the same rates, and the gap between the cheapest and most expensive options is wider than most developers realize. If you're building on top of LLM APIs and haven't looked closely at the pricing structure, there's a good chance you're overpaying. Understanding how pricing actually works is the first step to fixing that.
How LLM Pricing Works
Every LLM API call is billed based on the number of tokens processed — not characters, not words, not requests. A token is roughly 4 characters of English text, or about ¾ of a word. A 500-word response contains approximately 375 output tokens. A 200-word system prompt contains roughly 150 input tokens.
Pricing is split into two separate rates: input tokens (what you send to the model) and output tokens (what the model generates back). This distinction matters enormously because output tokens cost 3 to 4 times more than input tokens across virtually every provider.
The reason for this asymmetry is computational: generating each output token requires running a full forward pass through the model. Processing input tokens is comparatively cheap — it's a single parallelized pass. So a call where you send a 500-token prompt and receive a 500-token response isn't billed evenly: the output half costs 3-4x more per token than the input half.
2025 Pricing Comparison
Here's the current landscape across the major providers and models. All prices are per 1 million tokens:
| Model | Provider | Input $/1M tokens | Output $/1M tokens |
|---|---|---|---|
| GPT-4o | OpenAI | $2.50 | $10.00 |
| GPT-4o Mini | OpenAI | $0.15 | $0.60 |
| Claude 3.5 Sonnet | Anthropic | $3.00 | $15.00 |
| Claude 3 Haiku | Anthropic | $0.25 | $1.25 |
| Gemini 1.5 Pro | $1.25 | $5.00 | |
| Gemini 1.5 Flash | $0.075 | $0.30 | |
| Llama 3.1 70B | OpenRouter | $0.52 | $0.75 |
| Mixtral 8x7B | OpenRouter | $0.24 | $0.24 |
The spread is staggering. Claude 3.5 Sonnet output costs $15 per million tokens. Gemini 1.5 Flash output costs $0.30 per million tokens — that's a 50x difference. GPT-4o vs GPT-4o Mini is a 16x difference on input and also 16x on output. Even within a single family of models, the pricing tiers are designed to reflect very different capability levels — and very different cost profiles.
The Hidden Multiplier — Output Tokens
Let's make this concrete. Suppose your application averages 500 output tokens per request — a typical-length response. You're making 1,000 requests per day, or roughly 30,000 per month. Here's what that costs across three models at scale:
| Requests / month | GPT-4o | GPT-4o Mini | Claude 3 Haiku |
|---|---|---|---|
| 10,000 | $50.00 | $3.00 | $6.25 |
| 100,000 | $500.00 | $30.00 | $62.50 |
| 1,000,000 | $5,000.00 | $300.00 | $625.00 |
(Calculation: 500 output tokens × output rate × request count. Input costs excluded for simplicity.)
At 100,000 requests per month, the difference between GPT-4o and GPT-4o Mini is $470 per month — on output tokens alone. At one million requests, that's a $4,700 monthly gap from a single pricing decision. This is the hidden multiplier: output tokens compound at scale in ways that are easy to underestimate when you're prototyping with low volumes.
Why You're Probably Overpaying
The most common reason teams overpay for LLM API calls is simple: they default to the best available model and never revisit that choice. You pick GPT-4o during prototyping because it works well and gives you confidence in the output. Then you ship, you scale, and the model choice stays locked in because changing it feels risky.
But here's the thing: research consistently shows that roughly 70% of real-world LLM requests are simple tasks — summarizations, classifications, entity extractions, short-form Q&A, format conversions, spam detection. These tasks don't require frontier reasoning. They require a model that's reliable, fast, and accurate at pattern-based work — which describes every competent small model available today.
The waste is compounding. You're paying GPT-4o prices ($10/M output tokens) for tasks where GPT-4o Mini ($0.60/M output tokens) would produce an identical result. You're paying Claude 3.5 Sonnet prices ($15/M output tokens) for tasks where Claude 3 Haiku ($1.25/M output tokens) is perfectly sufficient. At scale, this inefficiency becomes a very large number.
The problem isn't that teams are making a bad decision — it's that they're making no decision at all. They're applying a blanket model choice uniformly across requests that have wildly different complexity profiles.
How to Stop Overpaying
Reducing your LLM API bill doesn't require a major overhaul. There are three practical steps:
- Audit your request types. Log a sample of 100–500 real requests from your application and categorize them by complexity. You'll likely find that the majority fall into a handful of simple categories: classify this, summarize that, extract these fields. Identify what percentage of your requests genuinely need frontier-level reasoning versus what percentage are pattern-based tasks a smaller model handles equally well.
- Match model to complexity. Use the pricing table above as your guide. For classification, extraction, and summarization tasks, GPT-4o Mini, Claude 3 Haiku, or Gemini 1.5 Flash are all strong options at a fraction of the cost. Reserve GPT-4o, Claude 3.5 Sonnet, or Gemini 1.5 Pro for the minority of requests that genuinely benefit from deeper reasoning.
- Automate with routing. Manual model selection is fragile — you can't classify every request in real time without adding complexity to your application logic. The most scalable approach is to use a routing layer that automatically evaluates each incoming request and dispatches it to the appropriate model. You define the quality thresholds; the router handles the dispatch decision for every call.
If you want to go deeper on specific tactics, these posts walk through the details:
- 5 Ways to Reduce OpenAI Costs — practical techniques beyond model selection, including prompt compression, caching, and batching
- LLM Cost Calculator — run the numbers for your specific usage pattern across all major models
- What Is LLM Routing? — how automated routing works and when it makes sense to adopt it
The core insight is simple: not all LLM requests are created equal, and not all models need to cost the same. Once you stop treating every request as if it requires the most powerful model available, the savings at scale become hard to ignore.
Stop overpaying for every request
TokenSurf automatically routes each call to the cheapest model that meets your quality bar. One line change, no SDK required.
Get Started Free