Before you can optimize your LLM costs, you need to know what you're actually spending. Most developers have a rough idea — "we use GPT-4o and it's getting expensive" — but don't have precise numbers. These tables give you those numbers. Find your model, find your request volume, and you'll know your monthly bill before you open a single invoice.
Assumptions
All calculations in this post use the following baseline:
- 200 input tokens per request — your prompt, system message, and any context
- 300 output tokens per request — the model's response
Pricing used reflects current published rates: GPT-4o at $2.50/$10.00 per million input/output tokens, GPT-4o Mini at $0.15/$0.60, Claude 3.5 Sonnet at $3.00/$15.00, Claude 3 Haiku at $0.25/$1.25, Gemini 1.5 Pro at $1.25/$5.00, and Gemini 1.5 Flash at $0.075/$0.30. For a deeper breakdown of how these prices work and what drives them, see LLM API Costs Explained.
Monthly Cost by Model
The table below shows estimated monthly costs at different request volumes. Each cell is calculated as: (200 × input_price + 300 × output_price) ÷ 1,000,000 × requests.
| Model | 1K / mo | 10K / mo | 100K / mo | 500K / mo | 1M / mo |
|---|---|---|---|---|---|
| GPT-4o | $3.50 | $35.00 | $350.00 | $1,750.00 | $3,500.00 |
| GPT-4o Mini | $0.21 | $2.10 | $21.00 | $105.00 | $210.00 |
| Claude 3.5 Sonnet | $5.10 | $51.00 | $510.00 | $2,550.00 | $5,100.00 |
| Claude 3 Haiku | $0.43 | $4.25 | $42.50 | $212.50 | $425.00 |
| Gemini 1.5 Pro | $1.75 | $17.50 | $175.00 | $875.00 | $1,750.00 |
| Gemini 1.5 Flash | $0.11 | $1.05 | $10.50 | $52.50 | $105.00 |
The spread is enormous. At 100K requests per month, always using Claude 3.5 Sonnet costs $510. Always using Gemini 1.5 Flash costs $10.50. That's a 48x difference for the same number of API calls. For tasks where either model would produce acceptable output — simple Q&A, classification, short summarization — you're burning 48x more than you need to.
Savings with Routing
The routing scenario below assumes a 70/30 split: 70% of requests go to GPT-4o Mini (cheap, fast, handles most tasks), and 30% go to GPT-4o (frontier, for requests that genuinely need it). The blended per-request cost is 0.70 × $0.000210 + 0.30 × $0.003500 = $0.001197.
| Scenario | 1K / mo | 10K / mo | 100K / mo | 500K / mo | 1M / mo |
|---|---|---|---|---|---|
| Always GPT-4o | $3.50 | $35.00 | $350.00 | $1,750.00 | $3,500.00 |
| Routed (70% Mini / 30% GPT-4o) | $1.20 | $11.97 | $119.70 | $598.50 | $1,197.00 |
| Savings | $2.30 (66%) | $23.03 (66%) | $230.30 (66%) | $1,151.50 (66%) | $2,303.00 (66%) |
A 66% reduction in LLM spend without changing anything your users see. At 100K requests per month, that's $230 saved. At 1M requests, it's over $2,300 per month. These numbers get even larger if you include cheaper models like Gemini Flash or Claude Haiku in the routing mix.
The 70/30 split is conservative. In practice, most teams find that 80-90% of their requests can go to cheaper models once they actually analyze their workload. The 10-20% that genuinely need frontier capability are usually the complex, high-value interactions — exactly where you want to spend the extra budget.
Cost Per Conversation
For chatbot and assistant use cases, it's useful to think in terms of cost per conversation rather than cost per request. This table assumes 8 turns per conversation, with each turn consuming 200 input and 300 output tokens, for a total of 1,600 input tokens and 2,400 output tokens per conversation.
| Model | Cost per Conversation | Cost per 1,000 Conversations |
|---|---|---|
| GPT-4o | $0.0280 | $28.00 |
| GPT-4o Mini | $0.0017 | $1.68 |
| Claude 3.5 Sonnet | $0.0408 | $40.80 |
| Claude 3 Haiku | $0.0034 | $3.40 |
| Gemini 1.5 Pro | $0.0140 | $14.00 |
| Gemini 1.5 Flash | $0.00084 | $0.84 |
A customer support bot handling 10,000 conversations per month on GPT-4o costs $280. The same bot on GPT-4o Mini costs $16.80 — a $263 monthly difference. For most support use cases, Mini handles routine questions just fine. You can reserve the more expensive model for escalations or complex technical queries where users are more likely to notice the quality difference.
What Drives Your Cost
Four levers determine your LLM bill, in order of impact:
- Model choice. This is by far the biggest factor — as the tables above show, the difference between the cheapest and most expensive model is 48x or more. Before optimizing anything else, make sure you're using the right model for each task. Many teams default to GPT-4o for everything without questioning whether a cheaper model would work just as well.
- Output length. Output tokens are consistently more expensive than input tokens — often 3-4x more per token. If your application generates long responses, trimming unnecessary verbosity has a meaningful impact. Instructing models to be concise, or using structured output formats that don't pad responses, can reduce output tokens by 20-40% without degrading quality.
- Input length. System prompts that are hundreds of tokens long, or RAG pipelines that include large context chunks in every request, add up quickly at scale. Audit your average input length and look for opportunities to reduce it — especially in system prompts that rarely change but get sent with every request.
- Volume. The obvious one: more requests mean more cost. But volume also interacts with the other levers. At low volume, model choice barely matters in absolute terms. At high volume, even small per-request optimizations compound into significant monthly savings.
For a more detailed breakdown of how these pricing mechanisms work under the hood, see LLM API Costs Explained. For specific tactics to reduce your bill without degrading quality, see 5 Ways to Reduce OpenAI Costs.
TokenSurf automates this analysis and routing for you — it classifies each incoming request and routes it to the cheapest model that meets your quality requirements. You set the threshold, it handles the switching. The tables above represent what's possible; TokenSurf makes it automatic.
Turn these estimates into actual savings
TokenSurf routes every request to the cheapest model that fits. Drop-in API, no infrastructure work, no lock-in.
Get Started Free