The key insight behind LLM routing is deceptively simple: not all prompts are created equal. A request to classify an email as spam doesn't need a model that costs $10 per million output tokens. A request to reason through a multi-step legal analysis probably does. The difference between those two tasks — and knowing which model to use for each — is what prompt complexity routing is all about.

When you route based on complexity, you're making a specific claim: that the quality of the output for a simpler task is indistinguishable whether it was produced by a $0.60/M-token model or a $10/M-token model. For the tasks where that's true — and it's true for a large fraction of real-world LLM workloads — you capture the savings without any user-visible tradeoff.

This post covers how to classify prompt complexity, how to build a routing system that acts on that classification, and where the accuracy vs. cost tradeoffs live.

What Is Prompt Complexity?

Prompt complexity is a measure of how much reasoning, knowledge synthesis, and nuanced judgment a request requires to answer well. It exists on a spectrum, but it's useful to think in two broad tiers: simple and complex.

Simple prompts have a clear, deterministic answer that doesn't require chaining multiple reasoning steps. Examples:

  • "Is this email spam or not spam?" — binary classification
  • "Translate this paragraph to Spanish." — transformation with a single well-defined output
  • "Extract the name, date, and amount from this invoice." — structured extraction
  • "Summarize this 200-word product description in one sentence." — compression with minimal judgment
  • "What is the capital of France?" — factual recall

Complex prompts require multi-step reasoning, synthesis across multiple concepts, or nuanced judgment calls with no single right answer. Examples:

  • "Explain why this Python code has a race condition and suggest three different ways to fix it." — causal analysis plus solution generation
  • "Compare and contrast these two architectural approaches for a high-throughput event processing system." — trade-off analysis across multiple dimensions
  • "Write a compelling 800-word case study about this product for a technical audience." — long-form creative with quality requirements
  • "Analyze the sentiment trajectory of this customer conversation and recommend how the support agent should respond." — multi-variable analysis plus recommendation

The boundary between simple and complex is fuzzy in practice, and that's exactly why routing classification is a non-trivial problem. The goal isn't perfect classification — it's accurate-enough classification to capture savings on the obvious cases while being conservative with the ambiguous ones.

Heuristic-Based Classification

Rule-based classification is fast, transparent, and works well for the obvious ends of the complexity spectrum. It won't catch every edge case, but it can accurately classify 60–70% of requests with near-zero overhead.

Here are the most reliable heuristics:

  • Token count threshold — Prompts under ~100 tokens are usually simple. Long prompts (500+ tokens with a complex system prompt) are usually complex.
  • Complexity trigger phrases — Words and phrases like "explain why", "compare and contrast", "analyze", "evaluate", "design", "reason through", "what are the trade-offs" signal complex tasks.
  • Simple task indicators — Phrases like "extract", "translate", "classify", "summarize in one sentence", "is this a", "what is the" signal simple tasks.
  • Code block presence — A prompt containing a large code block that's being asked to be debugged, reviewed, or extended is almost always complex.
  • System prompt length and specificity — Long, highly specific system prompts often indicate an application that's already done the complexity classification for you.

Here's a JavaScript implementation of basic heuristic classification:

function classifyComplexity(messages) {
  const userMessage = messages
    .filter(m => m.role === 'user')
    .map(m => m.content)
    .join(' ');

  const totalTokens = estimateTokens(userMessage);

  // Simple indicators — if any match, lean simple
  const simplePatterns = [
    /\b(classify|translate|extract|summarize in one|is this|what is the|convert)\b/i,
    /\b(yes or no|true or false|spam or not|positive or negative)\b/i,
  ];

  // Complex indicators — if any match, route to capable model
  const complexPatterns = [
    /\b(explain why|analyze|compare and contrast|evaluate|design|reason through)\b/i,
    /\b(trade-?offs?|pros and cons|multi-?step|step by step)\b/i,
    /\b(debug|refactor|architect|review this code)\b/i,
    /```[\s\S]{200,}```/,  // Code block with 200+ chars
  ];

  // Hard thresholds
  if (totalTokens > 800) return 'complex';
  if (totalTokens < 50) return 'simple';

  // Pattern matching
  for (const pattern of complexPatterns) {
    if (pattern.test(userMessage)) return 'complex';
  }

  for (const pattern of simplePatterns) {
    if (pattern.test(userMessage)) return 'simple';
  }

  // Default: ambiguous — handled by ML classifier or conservative fallback
  return 'ambiguous';
}

function estimateTokens(text) {
  // Rough approximation: ~4 chars per token
  return Math.ceil(text.length / 4);
}

function routeToModel(complexity) {
  switch (complexity) {
    case 'simple':    return 'gpt-4o-mini';
    case 'complex':   return 'gpt-4o';
    case 'ambiguous': return 'gpt-4o'; // conservative fallback
  }
}

This implementation handles the obvious cases immediately. The ambiguous bucket is where the interesting work happens — and where ML classification pays off.

ML-Based Classification

For the 30–40% of requests that fall in the ambiguous middle, a machine learning classifier can make smarter decisions than keyword rules. The most practical approach is using a small, fast model as the classifier itself — something like Claude Haiku or GPT-4o Mini — to evaluate whether a request needs a more capable model.

Here's the classifier prompt:

const CLASSIFIER_SYSTEM_PROMPT = `You are a task complexity classifier for LLM routing.
Analyze the user's request and classify it as one of:
- SIMPLE: Clear factual answer, classification, extraction, translation, or short summary.
  No multi-step reasoning required. Output is deterministic or near-deterministic.
- COMPLEX: Requires multi-step reasoning, synthesis across concepts, code analysis,
  trade-off evaluation, or nuanced creative judgment.

Respond with only: SIMPLE or COMPLEX`;

async function mlClassify(messages) {
  const classifierResponse = await fetch('https://api.tokensurf.io/v1/chat/completions', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${process.env.TOKENSURF_KEY}`
    },
    body: JSON.stringify({
      model: 'gpt-4o-mini',  // Use the cheapest model for classification
      messages: [
        { role: 'system', content: CLASSIFIER_SYSTEM_PROMPT },
        { role: 'user', content: JSON.stringify(messages) }
      ],
      max_tokens: 5,
      temperature: 0
    })
  });

  const data = await classifierResponse.json();
  const classification = data.choices[0].message.content.trim();
  return classification === 'SIMPLE' ? 'simple' : 'complex';
}

This adds roughly 100–200ms of latency on ambiguous requests — the time it takes for a fast model to return a single token. For most applications, that's an acceptable trade: you're trading 150ms of additional time-to-first-token on some requests for significantly better cost optimization across your whole traffic profile.

The classifier itself is cheap: GPT-4o Mini at max_tokens=5 costs a fraction of a cent per classification. Even at 100,000 classifications per month, the overhead is negligible compared to the savings on the traffic you correctly downgrade.

The Hybrid Approach

In production, the most effective architecture combines both methods: heuristics first for the obvious cases, ML classification for the ambiguous ones. This gives you the best of both approaches — zero latency for the clear cases, accurate classification for the edge cases.

async function routeRequest(messages) {
  // Step 1: Fast heuristic classification
  const heuristicResult = classifyComplexity(messages);

  if (heuristicResult === 'simple') {
    // Clear simple case — route immediately, no ML needed
    return { model: 'gpt-4o-mini', classifiedBy: 'heuristic' };
  }

  if (heuristicResult === 'complex') {
    // Clear complex case — route immediately, no ML needed
    return { model: 'gpt-4o', classifiedBy: 'heuristic' };
  }

  // Step 2: ML classification for ambiguous cases only
  const mlResult = await mlClassify(messages);
  return {
    model: mlResult === 'simple' ? 'gpt-4o-mini' : 'gpt-4o',
    classifiedBy: 'ml'
  };
}

// Usage
async function callLLM(messages) {
  const { model, classifiedBy } = await routeRequest(messages);

  const response = await fetch('https://api.tokensurf.io/v1/chat/completions', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${process.env.TOKENSURF_KEY}`
    },
    body: JSON.stringify({ model, messages })
  });

  return response.json();
}

In a typical workload, this hybrid approach classifies roughly 60–70% of requests via fast heuristics with no latency overhead, and routes the remaining 30–40% through the ML classifier.

The decision tree looks like this:

  1. Does the prompt match a clear simple pattern? → Route to cheap model immediately
  2. Does the prompt match a clear complex pattern? → Route to capable model immediately
  3. Is the prompt ambiguous? → Run ML classifier → Route based on result

Accuracy vs. Cost Trade-off

A misrouted complex request doesn't just produce a bad response — it can produce a confidently wrong response. The cost of routing everything to capable models is measurable in dollars. The cost of routing complex requests to weak models is measurable in user trust.

This asymmetry is important. If you over-route to cheap models (too aggressive), you'll occasionally get low-quality outputs on complex tasks. If you under-route to cheap models (too conservative), you just spend more money than necessary. The failure modes are not symmetric.

The practical implication: when in doubt, route to the more capable model. The economics still work in your favor as long as you're correctly classifying the obvious simple cases. You don't need to push the boundaries of the ambiguous middle — you capture most of the savings just by getting the clear cases right.

Classification Strategy % Requests Downgraded Misrouting Risk Typical Cost Reduction
Heuristics only (conservative) 40–50% Very low 30–40%
Heuristics only (aggressive) 60–70% Moderate 45–55%
Hybrid (heuristics + ML) 55–65% Low 40–50%
ML only 50–60% Low-moderate 35–45%

The sweet spot is the hybrid approach at conservative thresholds — you capture 40–50% cost reduction with very low misrouting risk. If you want to push further, you can tune the classifier thresholds based on your specific domain and workload characteristics.

Measuring Routing Quality

Once you have a routing system in place, you need to know whether it's working well. Three metrics matter:

  • Downgrade rate — what percentage of requests are being routed to cheaper models? If it's below 20%, your classifier may be too conservative. If it's above 80%, check for quality regressions.
  • Output quality sampling — periodically compare outputs from downgraded requests against what the expensive model would have produced. You can do this by routing a small percentage (1–5%) of nominally "simple" requests to both models and comparing. Automated scoring with an LLM judge works well here.
  • User-reported quality — if you have thumbs up/down or any quality signal from users, segment it by which model handled the request. A divergence in quality ratings between model tiers is a signal to tighten your thresholds.

For a deeper look at how to structure a system that handles multiple models in production — including fallback logic, provider redundancy, and observability — see Building a Multi-Model LLM Architecture. And if you're trying to decide whether a specific model like GPT-4o Mini is appropriate for your downgraded tier, GPT-4o vs GPT-4o Mini: When to Use Each covers the capability differences in detail.

Prompt complexity routing is the foundational technique behind most effective LLM cost optimization strategies. If you're curious about the broader concept, What Is LLM Routing? is a good place to start. The key takeaway: you don't need a perfect classifier to see meaningful savings. You just need to correctly identify the obvious simple cases — and then be conservative with everything else.

Let TokenSurf handle the routing for you

We classify prompt complexity automatically and route each request to the right model. One API endpoint, zero configuration required.

Get Started Free