Rate LimitsMulti-LLMHigh AvailabilityAPI Gateway

Mitigating Upstream Rate Limits (HTTP 429) at Scale in Multi-LLM Deployments

June 29, 2026·9 min read·Selixes Engineering

The Scaling Bottleneck: Rate Limits and TPM Caps

As your AI applications grow, you will quickly hit the limits imposed by LLM providers. Rate limits are typically defined in two ways: Requests Per Minute (RPM) and Tokens Per Minute (TPM). Even high-tier enterprise accounts with OpenAI or Anthropic can hit sudden TPM caps during traffic spikes, resulting in HTTP 429 Too Many Requests errors that disrupt your users.

The Multi-Provider Fallback Solution

Instead of relying on a single provider and begging for quota increases, high-volume production applications should employ a multi-provider routing strategy. When a primary provider returns a 429, the gateway should instantly detect the error and failover to a standby provider with equivalent capabilities.

1. Model Mapping Equivalency

A resilient gateway maps request payloads to equivalent models. For example, if a request to OpenAI's gpt-4o fails due to rate limits, the gateway automatically maps the parameters (temperature, messages, system prompt) and redirects the request to Anthropic's claude-3-5-sonnet or Google's gemini-1.5-pro.

2. Intelligent Rate-Limit Backoff

When a provider triggers a 429, the gateway should temporarily stop routing traffic to that specific endpoint. This is known as a cool-down period. By maintaining a health status list in Redis, multiple gateway nodes can coordinate to avoid sending traffic to a rate-limited provider until the cooling window expires.

// Redis-backed rate limit cooldown logic
async function checkProviderHealth(provider) {
  const isCooldown = await redis.get(`cooldown:${provider}`);
  return !isCooldown;
}

async function markCooldown(provider, durationSeconds = 60) {
  await redis.set(`cooldown:${provider}`, 'true', 'EX', durationSeconds);
}

Seamless Client Integration

By routing through a unified proxy like Selixes, client applications do not need to implement complex retry-and-fallback logic. The proxy handles the HTTP 429 status code, retries with a fallback model, and returns a successful response. This keeps your application code clean and your uptime guaranteed.