Building a Multi-Model LLM Architecture

If you're locked into one LLM provider, you're at their mercy on pricing, uptime, and rate limits. A provider can raise prices overnight, suffer an outage during your peak traffic hours, or throttle your requests when you need them most. A multi-model architecture fixes all three problems at once — and it's not as complicated to build as it sounds.

Why Multi-Model?

There are three compelling reasons to run requests across multiple providers rather than committing to one:

Price competition. LLM providers actively undercut each other. Google dropped Gemini Flash pricing by 80% in a single announcement. OpenAI responded with aggressive Mini pricing. Anthropic followed with Haiku. When you're tied to one provider, you can't take advantage of these price wars. A multi-model setup lets you route to whoever is cheapest for a given task — and update that routing as prices change.

Redundancy. Provider outages happen. OpenAI has had high-profile incidents. Anthropic has had degraded performance windows. Google Cloud has had regional issues. When your entire application runs through a single API endpoint, any upstream incident becomes your incident. Distributing across providers means a single provider going down doesn't take your product down with it.

Best model per task. No single provider wins at everything. Anthropic's Claude tends to perform better on instruction-following and long-context tasks. OpenAI's GPT-4o is strong at code generation. Google's Gemini has advantages on multimodal tasks. A multi-model architecture lets you route each request to the model that's genuinely best at that type of task — not just the model your vendor happens to offer.

The Architecture

A production multi-model architecture has four layers:

Your Application
       |
       v
  Routing Layer   <--- the brain
       |
  _____|_____________________________________
  |           |           |                 |
  v           v           v                 v
OpenAI    Anthropic    Google         OpenRouter
(GPT-4o,  (Claude      (Gemini        (fallback +
 Mini)     Sonnet,      Pro, Flash)    additional
           Haiku)                      models)

Your application never calls provider APIs directly. Instead, it sends every request to the routing layer, which makes the model selection decision and forwards the request to the appropriate provider. The response comes back through the same layer, normalized to a consistent format.

The routing layer is the brain of the system. It holds the logic for which model to use when, handles failover when a provider is down, tracks costs, and gives you a single place to update routing decisions without touching application code.

OpenRouter serves a useful role as a catch-all fallback: it aggregates access to dozens of models through a single API, so you can route to obscure or newly-released models without individually integrating each provider.

Building the API Abstraction Layer

The key challenge with multiple providers is that each has a slightly different API format. OpenAI uses messages with role and content. Anthropic requires a system parameter separate from the messages array. Google uses a completely different request schema. Your abstraction layer needs to handle these differences transparently.

The approach is to define a unified internal request format — essentially OpenAI-compatible, since it's the closest thing to a standard — and then translate from that format to each provider's native format before sending:

// Unified request format (OpenAI-compatible)
// { model, messages: [{role, content}], max_tokens, temperature }

async function callProvider(provider, unifiedRequest) {
  switch (provider) {
    case 'openai':
      // OpenAI is our canonical format — pass through directly
      return await openaiClient.chat.completions.create(unifiedRequest);

    case 'anthropic':
      // Anthropic separates system messages and uses different field names
      const systemMsg = unifiedRequest.messages.find(m => m.role === 'system');
      const userMessages = unifiedRequest.messages.filter(m => m.role !== 'system');
      return await anthropicClient.messages.create({
        model: unifiedRequest.model,
        system: systemMsg?.content,
        messages: userMessages,
        max_tokens: unifiedRequest.max_tokens ?? 1024,
      });

    case 'google':
      // Google uses a different schema entirely
      const contents = unifiedRequest.messages.map(m => ({
        role: m.role === 'assistant' ? 'model' : 'user',
        parts: [{ text: m.content }],
      }));
      return await googleClient.generateContent({
        model: unifiedRequest.model,
        contents,
        generationConfig: { maxOutputTokens: unifiedRequest.max_tokens },
      });

    default:
      throw new Error(`Unknown provider: ${provider}`); }
}

You'll also want a response normalizer that converts each provider's response format back to your unified format. This keeps your application code clean — it always deals with a consistent response object regardless of which provider actually handled the request.

Routing Logic

Once you have the abstraction layer in place, you need logic to decide which provider and model to use for each request. There are three main strategies:

Cost-based routing always picks the cheapest model that meets a minimum quality bar. This is the simplest approach and often the most impactful for reducing bills. You maintain a sorted list of models by cost and route to the cheapest one by default, escalating only when the request signals it needs more capability.

Complexity-based routing analyzes the incoming prompt to estimate how much reasoning power it needs. Simple requests — classification, extraction, short Q&A — go to fast cheap models. Complex requests — multi-step reasoning, long-form generation, code review — go to frontier models. See Prompt Complexity Routing for a deep dive on how to build the classifier.

Latency-based routing picks the fastest provider for time-sensitive requests. If a user is waiting for an autocomplete suggestion, you want sub-200ms responses — pick the model with the lowest current latency, not the cheapest. For background batch jobs, you can optimize purely for cost.

Here's a simple cost-based router:

// Models sorted cheapest to most expensive (per request)
const MODEL_PRIORITY = [
  { provider: 'google',    model: 'gemini-1.5-flash',  costPerReq: 0.000105 },
  { provider: 'openai',    model: 'gpt-4o-mini',       costPerReq: 0.000210 },
  { provider: 'anthropic', model: 'claude-3-haiku',    costPerReq: 0.000425 },
  { provider: 'google',    model: 'gemini-1.5-pro',    costPerReq: 0.00175  },
  { provider: 'openai',    model: 'gpt-4o',            costPerReq: 0.00350  },
  { provider: 'anthropic', model: 'claude-3-5-sonnet', costPerReq: 0.00510  },
];

function selectModel(request) {
  // Check for explicit quality hints from the application
  const qualityHint = request.headers?.['x-quality-hint'] ?? 'standard';

  if (qualityHint === 'high') {
    // Route to frontier models only
    return MODEL_PRIORITY.find(m => m.costPerReq >= 0.003);
  }

  if (qualityHint === 'low') {
    // Route to cheapest available
    return MODEL_PRIORITY[0];
  }

  // Default: cost-based with a mid-tier floor
  return MODEL_PRIORITY.find(m => m.costPerReq >= 0.0002) ?? MODEL_PRIORITY[0];
}

Failover and Fallback Chains

Provider failures come in three forms: rate limit errors (429), server errors (500, 503), and timeouts. Your routing layer needs to handle all three gracefully by automatically retrying with a different provider. This is one of the biggest reliability wins of multi-model architecture — users never see a failed request just because one upstream service had a hiccup.

For more background on why this matters, see What Is LLM Routing?

async function routeWithFallback(request, fallbackChain) {
  const errors = [];

  for (const { provider, model } of fallbackChain) {
    try {
      const result = await callProvider(provider, { ...request, model });
      // Success — log which provider handled it and return
      console.log(`Handled by ${provider}/${model}`);
      return result;
    } catch (err) {
      const isRetryable =
        err.status === 429 ||   // rate limit
        err.status >= 500 ||    // server error
        err.code === 'ETIMEDOUT' || err.code === 'ECONNRESET'; // timeout

      errors.push({ provider, model, error: err.message, status: err.status });

      if (!isRetryable) {
        // Hard error (e.g. invalid request) — don't try other providers
        throw err;
      }

      // Retryable — log and continue to next provider in chain
      console.warn(`${provider}/${model} failed (${err.status ?? err.code}), trying next...`);

      // Brief backoff before next attempt
      await new Promise(r => setTimeout(r, 200)); }
  }

  // All providers failed
  throw new Error(`All providers failed: ${JSON.stringify(errors)}`);
}

// Usage: primary is GPT-4o, fallbacks are Claude then Gemini
const chain = [
  { provider: 'openai',    model: 'gpt-4o' },
  { provider: 'anthropic', model: 'claude-3-5-sonnet' },
  { provider: 'google',    model: 'gemini-1.5-pro' },
];
const response = await routeWithFallback(userRequest, chain);

One important detail: only retry on retryable errors. A 400 Bad Request means your prompt is malformed — retrying with a different provider will just give you the same error. Only 429, 5xx, and network errors are worth retrying.

Load Balancing Across Providers

Each provider enforces rate limits — requests per minute, tokens per minute, or both. At scale, you can hit these limits even with a single provider before you hit any cost concerns. Multi-provider load balancing lets you distribute traffic to stay under each provider's limits.

Round-robin is the simplest approach: rotate through providers in order, one request at a time. It distributes load evenly and is trivial to implement. The downside is that it doesn't account for cost differences — you might route 33% of requests to GPT-4o even when GPT-4o Mini would work fine for most of them.

Weighted distribution is more sophisticated. Assign weights to each provider based on your cost targets and quality requirements. Route 60% of requests to cheap models, 30% to mid-tier, 10% to frontier. Adjust the weights based on observed error rates — if one provider starts returning frequent errors, reduce its weight dynamically.

You'll also want per-provider rate limit tracking. Keep a sliding window counter of requests per minute for each provider, and skip any provider that's approaching its limit. This prevents cascading 429 errors when one provider is under load.

Monitoring and Observability

A multi-model architecture gives you more levers to pull, but only if you can see what's happening. The metrics that matter most:

Cost per provider per model — broken down by time period. This tells you whether your routing logic is working as intended and catches unexpected spend spikes early.
Error rates — by provider, by error type. A sudden spike in 429s from one provider means you're hitting rate limits and need to rebalance. A spike in 500s might indicate a provider incident before their status page updates.
Latency percentiles — p50, p95, p99 per provider per model. Don't just track averages; tail latency is what your slowest users experience. If one provider's p99 starts degrading, you want to know before your users complain.
Routing decisions — log which model handled each request class. This lets you verify that your routing rules are firing correctly and helps you tune them over time.

Alert on: daily spend exceeding a threshold (catches runaway loops or unexpected traffic spikes), error rate from any single provider exceeding 5% over a 5-minute window (catches provider incidents), and p95 latency increasing more than 2x from baseline (catches performance degradation).

Observability isn't optional in a multi-provider setup. Without visibility into routing decisions and per-provider metrics, you're flying blind — you won't know if your cost savings are materializing or if a provider is silently degrading.

Getting Started

The best way to start is simple: integrate two providers (OpenAI and Anthropic are the natural first pair) and add a basic failover. That alone gives you the redundancy benefit. Then add routing logic once you have real traffic data — you'll be able to see which requests are going to expensive models and make informed decisions about where to add routing rules.

Don't try to build the full architecture in one shot. A routing layer that handles two providers with basic failover is already a significant reliability and cost improvement. You can layer in complexity — weighted distribution, complexity-based routing, per-provider rate limit tracking — as your needs grow.

For related reading:

Prompt Complexity Routing — how to classify prompts by difficulty to make smarter routing decisions
Claude vs GPT Cost Comparison — a detailed breakdown of which provider wins on cost for different workloads

Or skip the infrastructure work entirely and use TokenSurf — a drop-in routing layer that handles provider abstraction, failover, and intelligent routing out of the box. Two-field change to your existing OpenAI API calls, no lock-in.

Skip the infrastructure work

TokenSurf gives you multi-model routing, failover, and cost optimization in a single drop-in API. No infrastructure to maintain.

Get Started Free