How to Reduce 429 Errors in Agent Workloads: Routing, Retry, and Failover Patterns
guide

How to Reduce 429 Errors in Agent Workloads: Routing, Retry, and Failover Patterns

EvoLink Team
EvoLink Team
Product Team
March 25, 2026
10 min read
If your agents keep hitting 429 Too Many Requests, the problem is usually not that your team is "using AI wrong." The problem is that agent traffic is bursty, and most provider limits are still enforced against shared rate-limit buckets.
As of March 25, 2026, the official docs from OpenAI, Anthropic, and Google all make the same broad point in different ways:
  • limits are real
  • limits apply above the individual request level
  • limits are tied to organization or project scope
  • short bursts can still trigger failures even when monthly usage looks healthy

This guide focuses on what is verifiable from official documentation, then translates it into production patterns that actually reduce 429s.

TL;DR

  • Agent systems hit 429s faster than simple apps because they create bursts, not smooth traffic.
  • You need to budget tokens and concurrency, not just request count.
  • Retry logic should follow provider behavior: use retry-after when available, and add jittered backoff when it is not.
  • Queues, checkpoints, and graceful degradation matter as much as raw throughput.
  • Routing helps when you want to reduce dependency on a single upstream limit bucket.

Why agent workloads hit 429 differently

Traditional applications often look like this:

  • one user request
  • one LLM call
  • one response

Agent systems do not behave that way. They often trigger:

  • long-context reasoning steps
  • tool-call fanout
  • multi-agent concurrency
  • streaming responses that keep connections open
  • background retries at the same time as foreground work
That means rate limiting shows up as a burst management problem, not just a "too many requests per minute" problem.

What the provider docs actually say

ProviderOfficial limit dimensionsScopeOperational takeaway
OpenAIRPM, TPM, RPD, TPD, IPMOrganization and project level, model-specific, with some shared limitsOne noisy workflow can still consume the pool your other requests depend on
AnthropicRPM, ITPM, OTPMOrganization level with tier-based limitsShort bursts can trigger 429s before a full minute of traffic has elapsed
Gemini APIRPM, TPM, RPDPer project, model-specific, tier-basedMultiple agents in one project still compete for the same project limits

OpenAI: project-level control does not remove burst risk

OpenAI's current rate-limit guide says limits are defined at the organization level and the project level, not at the user level. The API reference also exposes project rate-limit objects per model.

The practical implication is straightforward:

  • splitting traffic across features inside one project does not make bursts disappear
  • some model families can share a limit pool
  • high-throughput agent traffic can starve unrelated requests if you do not throttle client-side

Anthropic: input and output pressure are separate

Anthropic's rate-limit docs are especially useful for agent systems because they explicitly separate:

  • RPM
  • ITPM for input tokens
  • OTPM for output tokens
The docs also state that rate limits are enforced at the organization level, use a token bucket algorithm, and can still fail over shorter intervals. Anthropic returns a retry-after header on 429s, which is exactly the kind of signal your retry layer should honor.

For agent systems, this matters because large prompts, long outputs, and parallel tool calls stress different parts of the budget.

Gemini API: project scope still means shared pressure

Google's Gemini API docs say rate limits are measured across RPM, TPM, and RPD, and they are applied per project, not per API key.

That means:

  • multiple agents under one project still share limits
  • project-level tier upgrades help, but they do not solve burst coordination inside your app
  • you should treat active limits as infrastructure constraints, not as an afterthought

Patterns that actually reduce 429 errors

1. Budget tokens, not just requests

A request counter is too crude for agent systems. A single long-context reasoning step can consume more of your real budget than many small requests.

Use a token-aware budget:

import asyncio
import time
from collections import deque


class TokenBudget:
    def __init__(self, tpm_limit: int):
        self.tpm_limit = tpm_limit
        self.window = deque()

    async def reserve(self, estimated_tokens: int) -> None:
        now = time.time()

        while self.window and self.window[0][0] < now - 60:
            self.window.popleft()

        used = sum(tokens for _, tokens in self.window)

        if used + estimated_tokens > self.tpm_limit and self.window:
            wait_seconds = 60 - (now - self.window[0][0])
            await asyncio.sleep(max(wait_seconds, 0))

        self.window.append((time.time(), estimated_tokens))
The important part is not the exact class. The important part is building a pre-request admission check before the provider has to reject you.

2. Cap concurrency around the agent loop

A lot of 429 storms are self-inflicted. Tool calls, background jobs, and retries all pile on each other.

Use concurrency controls:

import asyncio

agent_slots = asyncio.Semaphore(5)


async def run_agent(task):
    async with agent_slots:
        return await execute_agent(task)

This will not eliminate 429s on its own, but it prevents your own app from turning one spike into a collapse.

3. Make retries provider-aware

The retry layer should not treat every 429 the same way.

import asyncio
import random


async def retry_with_backoff(call, provider_name, attempts=5):
    for attempt in range(attempts):
        try:
            return await call()
        except Exception as exc:
            retry_after = getattr(exc, "retry_after", None)

            if retry_after is not None:
                await asyncio.sleep(float(retry_after))
                continue

            # Jittered fallback when the provider does not give a wait value
            delay = min(30, (2 ** attempt) + random.random())
            await asyncio.sleep(delay)

    raise RuntimeError(f"{provider_name} retry budget exhausted")

Production guidance:

  • respect retry-after when the provider returns it
  • use jittered exponential backoff as a fallback, not as your only strategy
  • do not retry forever
  • track retry volume separately from primary traffic

4. Separate foreground and background queues

When the same pool handles user-visible agent work and background analysis jobs, your low-value backlog can block high-value traffic.

Use at least two queues:

  • a foreground queue for user-facing responses
  • a background queue for batch or catch-up work

This gives you a place to shed or defer lower-priority traffic before the provider does it for you.

5. Checkpoint long-running loops

Do not make a 429 restart twenty minutes of work.

Checkpoint before each expensive call:

  • current task state
  • tool results already collected
  • last successful reasoning step
  • retry count and next-attempt timestamp

That turns a 429 from a workflow failure into a scheduling delay.

6. Degrade gracefully instead of failing hard

Your agent does not always need the biggest model in the system.

Graceful degradation might mean:

  • smaller context windows
  • fewer parallel tools
  • a cheaper or faster fallback model
  • queueing low-priority jobs instead of executing immediately

The right fallback depends on the workflow, but the architecture principle is the same: a partial answer is usually better than a crashed agent session.

Where routing fits into the picture

Routing is not a replacement for throttling. It is a way to keep single-provider pressure from dictating all of your application behavior.

The right way to think about routing is:

  • throttling controls what your app sends
  • retry logic controls how your app reacts
  • routing controls where your app can send work when conditions change

A simple before-and-after table

PatternWhat happens under loadMain weakness
Single provider, no admission controlRequests pile up until the upstream starts rejecting them429 storms and cascading retries
Single provider with retries onlyThe app survives some spikesSustained bursts still block the same upstream bucket
Single provider with throttling and queuesTraffic is smoother and failures are less chaoticYou still depend on one upstream pool
Routed gateway with throttling and retriesThe app can smooth bursts and keep the integration surface stableMore infrastructure choices to evaluate

The current repository copy for EvoLink Smart Router supports these publishable claims:

  • EvoLink provides a self-built routing layer for mixed workloads
  • you can send evolink/auto as the model ID
  • the actual model used is returned in the response
  • the request shape stays OpenAI-compatible
  • the routing layer itself does not add a separate routing fee
That does not mean you should promise zero 429s. It means routing can move more of the selection and fallback logic out of application code and into the gateway layer.

Here is the shape that repository copy supports:

curl https://api.evolink.ai/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "evolink/auto",
    "messages": [
      {
        "role": "user",
        "content": "Summarize the latest deployment incident and suggest next steps."
      }
    ]
  }'

A practical rollout checklist

Before you blame the provider, verify these first:

CheckWhy it matters
Estimate token volume per agent stepMany 429s are really TPM or ITPM problems
Cap concurrent agent loopsPrevent self-inflicted burst amplification
Respect retry-after when presentReduces wasteful retry storms
Split foreground and background queuesProtects user-facing latency
Save checkpoints for long-running workPrevents total restart after transient 429s
Decide when to route or fail overKeeps fallback behavior deliberate instead of ad hoc

FAQ

Why do agents hit 429 faster than a normal chat app?

Because agents generate burst traffic: long prompts, tool fanout, retries, background jobs, and concurrency all stack together.

Will adding more API keys solve the problem?

Usually not by itself. OpenAI documents organization-level and project-level limits, Anthropic documents organization-level limits, and Gemini documents project-level limits. Extra keys inside the same scope do not create a brand-new capacity pool.

Should I use exponential backoff for every 429?

Not as your first rule. Use retry-after when the provider gives you one. Fall back to jittered exponential backoff when it does not.

Do I need both throttling and routing?

Yes, if you are operating at production scale. Throttling smooths traffic before the provider rejects it. Routing helps reduce dependence on one upstream path.

What should I log when debugging 429s?

Log estimated tokens, concurrent tasks, queue depth, retry counts, request size, and any provider wait value such as retry-after.

It is useful when you want to keep an OpenAI-compatible request shape while moving model selection and mixed-workload routing out of your app code.

Can a router guarantee I will never see 429s again?

No. A router can improve resilience and flexibility, but it does not remove the need for client-side throttling, retry budgets, and queue control.

Build the Control Layer Before You Scale

If your agent system is already showing burst behavior, fixing 429s is usually an infrastructure problem before it is a prompt problem.

Explore EvoLink Smart Router

Sources

Ready to Reduce Your AI Costs by 89%?

Start using EvoLink today and experience the power of intelligent API routing.