LLM APIs charge per token — but not all tokens cost the same, not all models charge the same rates, and the gap between the cheapest and most expensive options is wider than most developers realize. If you're building on top of LLM APIs and haven't looked closely at the pricing structure, there's a good chance you're overpaying. Understanding how pricing actually works is the first step to fixing that.

How LLM Pricing Works

Every LLM API call is billed based on the number of tokens processed — not characters, not words, not requests. A token is roughly 4 characters of English text, or about ¾ of a word. A 500-word response contains approximately 375 output tokens. A 200-word system prompt contains roughly 150 input tokens.

Pricing is split into two separate rates: input tokens (what you send to the model) and output tokens (what the model generates back). This distinction matters enormously because output tokens cost 3 to 4 times more than input tokens across virtually every provider.

The reason for this asymmetry is computational: generating each output token requires running a full forward pass through the model. Processing input tokens is comparatively cheap — it's a single parallelized pass. So a call where you send a 500-token prompt and receive a 500-token response isn't billed evenly: the output half costs 3-4x more per token than the input half.

Rule of thumb: output tokens dominate your bill. If you want to cut costs, start by looking at how many output tokens your typical response generates — not just how long your prompts are.

2025 Pricing Comparison

Here's the current landscape across the major providers and models. All prices are per 1 million tokens:

Model Provider Input $/1M tokens Output $/1M tokens
GPT-4o OpenAI $2.50 $10.00
GPT-4o Mini OpenAI $0.15 $0.60
Claude 3.5 Sonnet Anthropic $3.00 $15.00
Claude 3 Haiku Anthropic $0.25 $1.25
Gemini 1.5 Pro Google $1.25 $5.00
Gemini 1.5 Flash Google $0.075 $0.30
Llama 3.1 70B OpenRouter $0.52 $0.75
Mixtral 8x7B OpenRouter $0.24 $0.24

The spread is staggering. Claude 3.5 Sonnet output costs $15 per million tokens. Gemini 1.5 Flash output costs $0.30 per million tokens — that's a 50x difference. GPT-4o vs GPT-4o Mini is a 16x difference on input and also 16x on output. Even within a single family of models, the pricing tiers are designed to reflect very different capability levels — and very different cost profiles.

The Hidden Multiplier — Output Tokens

Let's make this concrete. Suppose your application averages 500 output tokens per request — a typical-length response. You're making 1,000 requests per day, or roughly 30,000 per month. Here's what that costs across three models at scale:

Requests / month GPT-4o GPT-4o Mini Claude 3 Haiku
10,000 $50.00 $3.00 $6.25
100,000 $500.00 $30.00 $62.50
1,000,000 $5,000.00 $300.00 $625.00

(Calculation: 500 output tokens × output rate × request count. Input costs excluded for simplicity.)

At 100,000 requests per month, the difference between GPT-4o and GPT-4o Mini is $470 per month — on output tokens alone. At one million requests, that's a $4,700 monthly gap from a single pricing decision. This is the hidden multiplier: output tokens compound at scale in ways that are easy to underestimate when you're prototyping with low volumes.

Why You're Probably Overpaying

The most common reason teams overpay for LLM API calls is simple: they default to the best available model and never revisit that choice. You pick GPT-4o during prototyping because it works well and gives you confidence in the output. Then you ship, you scale, and the model choice stays locked in because changing it feels risky.

But here's the thing: research consistently shows that roughly 70% of real-world LLM requests are simple tasks — summarizations, classifications, entity extractions, short-form Q&A, format conversions, spam detection. These tasks don't require frontier reasoning. They require a model that's reliable, fast, and accurate at pattern-based work — which describes every competent small model available today.

The waste is compounding. You're paying GPT-4o prices ($10/M output tokens) for tasks where GPT-4o Mini ($0.60/M output tokens) would produce an identical result. You're paying Claude 3.5 Sonnet prices ($15/M output tokens) for tasks where Claude 3 Haiku ($1.25/M output tokens) is perfectly sufficient. At scale, this inefficiency becomes a very large number.

The problem isn't that teams are making a bad decision — it's that they're making no decision at all. They're applying a blanket model choice uniformly across requests that have wildly different complexity profiles.

How to Stop Overpaying

Reducing your LLM API bill doesn't require a major overhaul. There are three practical steps:

  1. Audit your request types. Log a sample of 100–500 real requests from your application and categorize them by complexity. You'll likely find that the majority fall into a handful of simple categories: classify this, summarize that, extract these fields. Identify what percentage of your requests genuinely need frontier-level reasoning versus what percentage are pattern-based tasks a smaller model handles equally well.
  2. Match model to complexity. Use the pricing table above as your guide. For classification, extraction, and summarization tasks, GPT-4o Mini, Claude 3 Haiku, or Gemini 1.5 Flash are all strong options at a fraction of the cost. Reserve GPT-4o, Claude 3.5 Sonnet, or Gemini 1.5 Pro for the minority of requests that genuinely benefit from deeper reasoning.
  3. Automate with routing. Manual model selection is fragile — you can't classify every request in real time without adding complexity to your application logic. The most scalable approach is to use a routing layer that automatically evaluates each incoming request and dispatches it to the appropriate model. You define the quality thresholds; the router handles the dispatch decision for every call.

If you want to go deeper on specific tactics, these posts walk through the details:

The core insight is simple: not all LLM requests are created equal, and not all models need to cost the same. Once you stop treating every request as if it requires the most powerful model available, the savings at scale become hard to ignore.

Stop overpaying for every request

TokenSurf automatically routes each call to the cheapest model that meets your quality bar. One line change, no SDK required.

Get Started Free