OpenAI API costs add up fast. At small scale, the bill feels manageable. But as soon as your product gets traction — 10,000 requests a day, 100,000, a million — the cost curve becomes your biggest infrastructure problem. The good news: most teams are dramatically overpaying, not because OpenAI is expensive per se, but because they haven't applied a handful of well-known optimizations.

Here are five things you can do today to reduce your OpenAI bill, with code examples and rough savings estimates for each. None of these require major architecture changes. All of them compound.

1. Optimize Your Prompts

Every token you send costs money. System prompts, few-shot examples, context preambles — it all adds up before the model has answered a single word. Most production prompts have significant fat that can be trimmed without affecting output quality.

The most common waste categories are: overly verbose system instructions ("You are an extremely helpful, knowledgeable, and professional assistant who always strives to..."), large few-shot example sets where 2 examples would work as well as 6, and irrelevant context stuffed into every request regardless of whether the task needs it.

Here's a before-and-after example for a sentiment classification prompt:

# BEFORE — verbose prompt (187 tokens)
system = """
You are a professional sentiment analysis expert with years of experience
analyzing customer feedback. Your job is to carefully read the provided text
and determine whether the overall sentiment expressed is positive, negative,
or neutral. Please analyze the text thoroughly before providing your answer.
Always respond with a single word: Positive, Negative, or Neutral.

Here are some examples to guide you:
Example 1: "I love this product!" → Positive
Example 2: "This is terrible." → Negative
Example 3: "It arrived on time." → Neutral
Example 4: "Best purchase I've made." → Positive
Example 5: "Complete waste of money." → Negative
"""

# AFTER — concise prompt (28 tokens)
system = """
Classify sentiment as Positive, Negative, or Neutral. Reply with one word.
"""

The shorter prompt produces identical classification accuracy for this task. That's a 85% reduction in system prompt tokens on every single request. If you're making 50,000 requests per day, that's roughly 8 million tokens you're no longer paying for daily.

Audit your longest system prompts. Ask: does each sentence actually change model behavior? If removing it doesn't change output quality in your evals, cut it.

Estimated savings: 20–40% reduction in input token costs.

2. Use the Right Model for Each Task

This is the single highest-leverage cost lever available. GPT-4o is an extraordinary model, but at $2.50 per million input tokens and $10.00 per million output tokens, it's dramatically overpriced for the majority of tasks teams throw at it. GPT-4o Mini costs $0.15 per million input tokens and $0.60 per million output tokens — a 16x difference on input and nearly 17x on output.

The question isn't whether GPT-4o is better than Mini — it obviously is on hard benchmarks. The question is whether your specific tasks require that margin of capability. For the majority of real-world workloads — text classification, entity extraction, short summarizations, simple Q&A, format conversions, code completions under 50 lines — GPT-4o Mini produces outputs that are functionally indistinguishable from GPT-4o to end users.

See our detailed breakdown: GPT-4o vs GPT-4o Mini: When to Use Each

A practical decision framework:

  • Use GPT-4o Mini for: classification, tagging, extraction, simple Q&A, short summaries, basic code generation, data formatting
  • Use GPT-4o for: complex multi-step reasoning, detailed technical explanations, nuanced creative writing, difficult debugging, chain-of-thought tasks requiring deep context

If 70% of your requests are Mini-eligible (a conservative estimate for most products), switching those requests drops your average cost per request by roughly 70%:

# Model selection by task type
def get_model(task_type: str) -> str:
    MINI_TASKS = {
        "classify", "extract", "summarize_short",
        "tag", "format", "qa_simple"
    }
    return "gpt-4o-mini" if task_type in MINI_TASKS else "gpt-4o"

# Usage
model = get_model("classify")   # → "gpt-4o-mini"  ($0.15/1M input)
model = get_model("reasoning")  # → "gpt-4o"        ($2.50/1M input)

Estimated savings: 50–80% on requests that can be routed to Mini.

3. Cache Repeated Requests

Many LLM applications send the same prompt — or nearly the same prompt — repeatedly. A customer support bot that fields the same five questions all day. A data pipeline that processes records with identical structure. A coding assistant where the same boilerplate generation is triggered hundreds of times per day. All of these are paying full price for requests that have already been answered.

The fix is straightforward: hash your prompt, cache the response, return the cached response on subsequent identical calls.

import hashlib
import json

# Simple in-memory cache (swap for Redis in production)
_cache = {}

def cached_completion(client, messages: list, model: str, **kwargs):
    # Build a stable cache key from the request
    key_data = json.dumps({"messages": messages, "model": model}, sort_keys=True)
    cache_key = hashlib.sha256(key_data.encode()).hexdigest()

    if cache_key in _cache:
        return _cache[cache_key]  # Cache hit — $0.00

    # Cache miss — make the real API call
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        **kwargs
    )
    _cache[cache_key] = response
    return response

For production use, replace the in-memory dict with Redis and add a TTL appropriate to your use case. A 24-hour TTL is reasonable for most content that doesn't change frequently. For truly static content (FAQ answers, boilerplate generation), a week or more is fine.

OpenAI also has a built-in prompt caching feature that automatically discounts repeated prompt prefixes. For requests where the first 1,024+ tokens of the prompt are identical, you get a 50% discount on those cached tokens automatically. This works particularly well for large system prompts that don't change between requests — OpenAI caches the prompt prefix server-side and charges you half price for it.

Estimated savings: Up to 100% on cache hits; 50% on OpenAI's built-in prompt prefix caching.

4. Batch Your Requests

OpenAI's Batch API offers a 50% discount on all requests that don't need to be processed in real time. If you have any workloads that could tolerate a completion window of up to 24 hours — data enrichment pipelines, nightly processing jobs, large-scale classification runs, content generation queues — you're leaving half your money on the table by calling the synchronous API.

The Batch API accepts a JSONL file of up to 50,000 requests, processes them asynchronously, and returns results within 24 hours. Here's how to submit a batch:

# Step 1: Create a JSONL file with your requests
# Each line is a JSON object with a custom_id for tracking
cat requests.jsonl
{"custom_id": "req-001", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Classify this review: 'Great product, fast shipping'"}]}}
{"custom_id": "req-002", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Classify this review: 'Arrived broken, terrible packaging'"}]}}

# Step 2: Upload the file
curl https://api.openai.com/v1/files \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F "purpose=batch" \
  -F "file=@requests.jsonl"
# → {"id": "file-abc123", ...}

# Step 3: Create the batch job
curl https://api.openai.com/v1/batches \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input_file_id": "file-abc123", "endpoint": "/v1/chat/completions", "completion_window": "24h"}'

# Step 4: Check status and retrieve results
curl https://api.openai.com/v1/batches/batch-xyz \
  -H "Authorization: Bearer $OPENAI_API_KEY"

The 50% discount applies to both input and output tokens. If you're running a nightly pipeline that processes 100,000 records at $0.15/1M tokens with GPT-4o Mini, batching cuts that to $0.075/1M. At scale, the savings are significant — and the only cost is a delay in processing that your pipeline already tolerates.

Estimated savings: 50% on all batch-eligible requests.

5. Route with a Proxy

The four techniques above all require you to make individual decisions: which prompt to shorten, which requests to cache, which calls to batch. A routing proxy automates the most important of those decisions — model selection — for every single request, without you having to think about it.

A routing proxy sits between your application and the LLM providers. It analyzes each incoming request and automatically selects the cheapest model capable of handling it. You don't change your code to handle each request type — the router figures it out.

The integration is a two-field change:

# Without routing — hardcoded GPT-4o on everything
curl https://api.openai.com/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Summarize this email in one sentence."}]}'
# Cost: ~$2.50/1M input tokens

# With routing — auto-selects cheapest capable model
curl https://api.tokensurf.io/v1/chat/completions \
  -H "Authorization: Bearer $TOKENSURF_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Summarize this email in one sentence."}]}'
# Router detects simple summarization → routes to GPT-4o Mini
# Cost: ~$0.15/1M input tokens

The routing proxy is OpenAI API-compatible, so your existing code, SDK calls, and response parsing all work without modification. You specify the model you want as a ceiling; the router uses the cheapest model that can meet that quality bar.

For a deeper explanation of how routing works: What Is LLM Routing? A Beginner's Guide

Estimated savings: 40–60% average cost reduction across a mixed workload.

Putting It All Together

These five techniques aren't mutually exclusive — they stack. Here's what the combined impact looks like for a team sending 100,000 GPT-4o requests per day at 500 input tokens and 300 output tokens per request:

Combined savings estimate (100K requests/day):

Baseline cost (all GPT-4o): ~$1,600/month
After prompt optimization (30% fewer tokens): ~$1,120/month
After model routing (70% to Mini): ~$380/month
After caching (20% hit rate): ~$304/month
After batching eligible requests (30%): ~$258/month

Total reduction: ~84% — from $1,600 to ~$258/month.

The exact numbers will vary based on your workload composition, but the direction is consistent: layering these techniques compounds the savings significantly beyond what any single technique achieves alone.

To see how the math works for your specific usage volume and model mix, use our LLM Cost Calculator — it lets you model the impact of routing, caching, and batching on your actual numbers.

Stop overpaying for every request

TokenSurf automatically routes each request to the cheapest model that fits. One URL change, instant savings.

Get Started Free