LLM prices have dropped over 90% in two years. GPT-4 launched in March 2023 at $30 per million output tokens — a number that felt reasonable at the time because there was nothing to compare it to. Today, GPT-4o Mini does comparable work on a vast range of tasks for $0.60 per million output tokens. That's a 98% reduction in less than 18 months. Here's where prices are heading next — and what it means for how you should architect your AI applications.

The Price Drop Timeline

The decline in LLM pricing has been relentless and, if anything, accelerating. To understand where things are going, it helps to trace how we got here.

Model Release Input (per 1M tokens) Output (per 1M tokens) vs. GPT-4 Launch
GPT-4 (8K) March 2023 $30.00 $60.00
GPT-4 Turbo Late 2023 $10.00 $30.00 ~3x cheaper
Claude 2 Mid 2023 $8.00 $24.00 ~2.5x cheaper
GPT-4o Mid 2024 $5.00 $15.00 ~4x cheaper
Claude 3 Haiku Early 2024 $0.25 $1.25 ~48x cheaper
GPT-4o Mini Mid 2024 $0.15 $0.60 ~100x cheaper
Gemini 1.5 Flash Mid 2024 $0.075 $0.30 ~200x cheaper

The pattern is clear: roughly every six to nine months, a new tier of pricing emerges that makes the previous generation feel expensive. GPT-4 Turbo made the original GPT-4 look overpriced. GPT-4o made Turbo look overpriced. GPT-4o Mini made GPT-4o look overpriced for most tasks. This cycle isn't slowing down.

Why Prices Keep Falling

Three structural forces are driving the decline, and all three are likely to intensify over the next 12-18 months.

Competition. OpenAI, Anthropic, Google, and Meta are engaged in an arms race where pricing is a primary weapon. When Google launched Gemini 1.5 Flash at $0.075 per million input tokens — cheaper than any comparable API model at the time — it forced OpenAI to respond. When Meta releases Llama models for free, it changes what "expensive" means for API providers. Each new entrant and each new model release creates pressure on the entire market.

Efficiency gains. The cost to serve a model keeps falling through distillation (training smaller models to mimic larger ones), quantization (reducing model precision without proportional quality loss), and hardware improvements (Nvidia H100s, Google TPUs, and increasingly custom silicon from model providers themselves). What cost $10 to run in 2023 costs $1 to run today — and the engineering teams at these companies haven't stopped optimizing.

Open-source pressure. Llama 3.1 and Mixtral demonstrated that open-source models could match or approach the quality of frontier proprietary models on a wide range of tasks. When a company can self-host a near-equivalent model for infrastructure costs alone, it creates a hard ceiling on what API providers can charge. The price floor for LLM inference is trending toward near-zero at scale.

The Commoditization Thesis

For many categories of tasks, LLMs are becoming commodities. When five different models can all classify customer sentiment with 94% accuracy, the differentiator isn't intelligence — it's cost, reliability, latency, and tooling. The "best model wins" era is giving way to "cheapest adequate model wins."

This doesn't mean all tasks are commoditized. Complex reasoning, creative writing, and nuanced judgment still show meaningful differences between frontier models and their cheaper alternatives. But these represent a minority of real-world LLM workloads. Most production AI applications are doing tasks that any capable model can handle: summarization, classification, extraction, simple Q&A, formatting, translation.

"In a commodity market, the winners are the companies that help you buy smart, not the companies that sell the commodity."

This is why model providers are racing to differentiate on dimensions beyond raw intelligence: context windows, multimodality, tool use, fine-tuning options, latency guarantees, and enterprise features. Pure text inference on standard tasks is becoming a race to the bottom on price — which is excellent news if you're a buyer.

Open Source Changes Everything

The release of Llama 3.1 405B in mid-2024 was a landmark moment. For the first time, an open-source model was genuinely competitive with GPT-4 class performance on many benchmarks — and you could run it yourself. Combined with Mixtral, DeepSeek, Qwen, and a growing ecosystem of fine-tuned derivatives, the open-source tier has matured into a real alternative.

The economics of self-hosting look like this: at low request volumes, self-hosting is more expensive than API access because you're paying for idle GPU time. At high volumes, it flips dramatically — you're paying only for hardware, not per-token margin.

Monthly Volume API Cost (GPT-4o Mini) Self-Hosting Cost (est.) Better Option
1M tokens ~$1 ~$300+ (idle GPU) API
10M tokens ~$10 ~$300 (light usage) API
100M tokens ~$100 ~$350 (moderate usage) API (close)
1B tokens ~$1,000 ~$500 (heavy usage) Self-host
10B tokens ~$10,000 ~$2,000 (multi-GPU) Self-host

The break-even point is somewhere around 500M-1B tokens per month for a single-GPU setup — roughly the scale of a mid-sized production application. Below that, APIs win on economics. Above it, self-hosting starts to look attractive.

But the more important effect of open source isn't that it makes self-hosting viable — it's that it creates a price ceiling for API providers. No API provider can sustainably charge $10 per million output tokens for a task that a free Llama model can handle adequately. The existence of open-source alternatives anchors the entire market's pricing expectations downward.

Why Routing Gets More Valuable

Here's the counterintuitive part: as prices fall, routing becomes more valuable, not less. The reason is abundance.

In 2023, there were a handful of models worth considering. Today there are dozens — multiple tiers from OpenAI, Anthropic, Google, and Mistral, plus open-source options via hosting providers like Together AI and Fireworks. By 2026, there will likely be hundreds of viable options across price points, specializations, and capability profiles.

Choosing the right model for each request manually at this scale is impossible. And the cost of making the wrong choice is real: send too many simple requests to frontier models and you're overpaying significantly; send too many complex requests to cheap models and you're degrading quality. The sweet spot moves constantly as providers reprice their models, new options enter the market, and the relative capabilities of models shift.

Routing is the answer to abundance. Instead of picking one model and living with its tradeoffs, you let a routing layer make the decision per-request, optimizing continuously against the current pricing landscape. As new cheaper models emerge, your costs automatically drop. As a provider has an outage, traffic automatically reroutes. The more complex the model landscape becomes, the more value routing captures.

For a deeper look at how routing works in practice, see What Is LLM Routing?

What to Do Now

Given where LLM pricing is heading, here are three concrete recommendations for how to architect your AI applications today.

1. Don't lock into one provider. The single biggest mistake teams make is hardcoding a specific model into their application. When GPT-4o Mini launched, teams that had architected around a single OpenAI model couldn't easily take advantage of the price drop without significant refactoring. Design your system to be provider-agnostic from day one — use an abstraction layer, an OpenAI-compatible router, or at minimum keep your model selection configurable.

2. Use routing to automatically benefit from price drops. Every time a cheaper model launches that meets your quality bar, a routing layer can automatically start sending eligible requests to it — without any code changes on your end. What would otherwise require a sprint of testing and deployment becomes a configuration update. Over a 12-month horizon, this is worth more than any initial optimization effort.

3. Watch open source — it's catching up faster than expected. The gap between frontier proprietary models and the best open-source alternatives is narrowing at every benchmark cycle. For many tasks, it has already effectively closed. If you're building at scale and haven't seriously evaluated whether Llama 3.1 or its successors could handle your workload, it's worth revisiting. The economics at high volume are compelling, and the quality story improves with every release.

For practical guidance on building a flexible multi-provider architecture, see Building a Multi-Model LLM Architecture. If you're still in the early stages of thinking about this, LLM Routing for Startups covers how to approach these decisions without over-engineering before you need to.

The bottom line: LLM pricing will continue falling, the model landscape will continue expanding, and the teams that build flexibility into their architecture now will benefit automatically from every wave of price cuts. The question isn't whether to prepare for a cheaper, more competitive market — it's whether your architecture is ready to capture the savings when they arrive.

Stay ahead of LLM pricing shifts

TokenSurf routes your requests to the cheapest model that meets your quality bar — automatically adapting as the market changes.

Get Started Free