OpenAI, Anthropic, Google Gemini, OpenRouter with 300+ models. Unified OpenAI-compatible API format across all providers.
Live
Fallback Chains
If a provider is down, automatically retry with an equivalent model on another provider. Cross-provider model mapping (e.g. gpt-4o to claude-sonnet-4-6).
Live
Retry with Exponential Backoff
Auto-retry on 429/500/502/503 with jitter. Respects Retry-After headers. Up to 2 retries before failing.
Live
Rate Limiting & Abuse Detection
Token bucket rate limiter (60 req/min per key). Automatic abuse detection with throttle and block escalation on anomalous patterns.
Live
Model Quality Scoring
5% of responses auto-scored by Gemini Flash-Lite (1-10 scale). Per-model quality aggregation, downgraded vs original comparison in dashboard.
Live
Teams & Organizations
Create orgs, invite members with roles (owner/admin/member). Labeled API keys with per-key budgets, rate limits, and model allowlists.
Live
System Health & Observability
Real-time health endpoint, circuit breaker per provider, latency percentiles (p50/p95/p99), cache hit rates, structured audit logging.
Live
Multi-Region Deployment
Proxy deployed in US, EU, and Asia (us-central1, europe-west1, asia-northeast1). Redis caching, connection pooling, API key rotation with 24h grace period.
Live
Semantic Response Cache
Exact-match caching of identical requests. Cache hit = instant response, zero provider cost, zero credits. Configurable TTL, toggle per account.
Live
Volume Subscription Plans
Growth ($400/mo, 500K requests) and Scale ($3K/mo, 5M requests) plans with volume discounts up to 40%. Annual billing saves 15% more. Enterprise custom pricing.
Live
Request Logs & Cost Breakdown
Per-request log viewer in dashboard with model routing, tokens, cost, savings, and latency. Per-model cost breakdown with bar charts.
Live
Teams Dashboard
Full org management UI: create teams, invite members, manage roles, create labeled API keys with per-key budget caps and rate limits.
Live
Content-Based Routing Rules
Custom regex/pattern rules for routing overrides. "If prompt contains code block, never downgrade." Rules evaluated in order, first match wins.
Live
Billing History
Full payment history in the dashboard — dates, amounts, credits, and transaction IDs. Supports enterprise expense reporting.
Live
Usage Alerts
In-dashboard alerts for low credits, high error rates, and daily spend exceeding budget. Catch runaway loops before they drain your wallet.
Live
API Playground
Test any model from the dashboard. Send a prompt, see the response, routing decision, cost breakdown, and cache status side-by-side.
Live
Webhooks on Routing Decisions
POST to your webhook URL for every routing decision. Configurable in dashboard. Feed into Slack, PagerDuty, or custom analytics. Fire-and-forget, never blocks.
Live
Streaming Metrics (TTFT, tok/s)
Time-to-first-token and throughput tracking for streaming responses. P50/P95 percentiles on system health dashboard. Know which provider is actually fastest.
Live
PII Redaction Guardrails
Auto-detect and redact emails, SSNs, credit cards, phone numbers, IP addresses from prompts before forwarding to providers. Toggle per account. Enterprise compliance checkbox.
Live
Cost Analytics by Feature Tag
Tag requests with X-TokenSurf-Tag header to track spend per feature, team, or environment. Dashboard shows cost breakdown by tag with savings.
Live
Prompt Template Library
Store and version system prompts server-side. Reference by ID via X-TokenSurf-Template header. Change routing behavior without redeploying your app.
Live
Custom Classifier Prompt
Override the default AI classifier with your own domain-specific prompt. Define what "simple" means for your use case. Full control over routing decisions.
Live
Latency-Based Routing
Set max latency target — if a provider's p95 exceeds it, auto-downgrade to faster models. Dashboard config with preset thresholds.
Live
Priority Routing
Per-request priority via X-TokenSurf-Priority header. High = never downgrade (user-facing). Low = always downgrade (batch jobs). Toggle in dashboard.
Live
Context Window Management
Auto-trim conversation history to fit model context limits. Prevents 400 errors on long conversations. Returns X-TokenSurf-Context-Trimmed header.
Up NextIn development
Semantic Cache v2 (Embedding Similarity)
Fuzzy matching via embeddings. "What are your hours?" and "When are you open?" return the same cached response. 5-10x more cache hits than exact match.
Streaming Cache Support
Cache streaming (SSE) responses and replay them as streams. Extends semantic caching to the most common API usage pattern.
ExploringResearch & design
Prompt Optimization Engine
Automatically rewrite prompts to work well on cheaper models. Not just routing to a cheaper model — making the prompt work better on it. Defensible and high-value.