guide

How to Reduce 429 Errors in Agent Workloads: Routing, Retry, and Failover Patterns

Name: EvoLink AI Model API Platform
Brand: EvoLink
Availability: InStock

EvoLink Team

Product Team

March 25, 2026

Updated on May 13, 2026

11 min read

Looking for OpenRouter-specific 429 errors? If your 429 is a "provider returned error" from OpenRouter, see Fix OpenRouter 429 "Provider Returned Error" for a targeted debug guide. This article covers general 429 reduction patterns for agent workloads across all providers.

If your agents keep hitting 429 Too Many Requests, the problem is usually not that your team is "using AI wrong." The problem is that agent traffic is bursty, and most provider limits are still enforced against shared rate-limit buckets.

As of March 25, 2026, the official docs from OpenAI, Anthropic, and Google all make the same broad point in different ways:

limits are real
limits apply above the individual request level
limits are tied to organization or project scope
short bursts can still trigger failures even when monthly usage looks healthy

This guide focuses on what is verifiable from official documentation, then translates it into production patterns that actually reduce 429s.

TL;DR

Agent systems hit 429s faster than simple apps because they create bursts, not smooth traffic.
You need to budget tokens and concurrency, not just request count.
Retry logic should follow provider behavior: use retry-after when available, and add jittered backoff when it is not.
Queues, checkpoints, and graceful degradation matter as much as raw throughput.
Routing helps when you want to reduce dependency on a single upstream limit bucket.

Why agent workloads hit 429 differently

Traditional applications often look like this:

one user request
one LLM call
one response

Agent systems do not behave that way. They often trigger:

long-context reasoning steps
tool-call fanout
multi-agent concurrency
streaming responses that keep connections open
background retries at the same time as foreground work

That means rate limiting shows up as a burst management problem, not just a "too many requests per minute" problem.

What the provider docs actually say

Provider	Official limit dimensions	Scope	Operational takeaway
OpenAI	RPM, TPM, RPD, TPD, IPM	Organization and project level, model-specific, with some shared limits	One noisy workflow can still consume the pool your other requests depend on
Anthropic	RPM, ITPM, OTPM	Organization level with tier-based limits	Short bursts can trigger 429s before a full minute of traffic has elapsed
Gemini API	RPM, TPM, RPD	Per project, model-specific, tier-based	Multiple agents in one project still compete for the same project limits

OpenAI: project-level control does not remove burst risk

OpenAI's current rate-limit guide says limits are defined at the organization level and the project level, not at the user level. The API reference also exposes project rate-limit objects per model.

The practical implication is straightforward:

splitting traffic across features inside one project does not make bursts disappear
some model families can share a limit pool
high-throughput agent traffic can starve unrelated requests if you do not throttle client-side

Anthropic: input and output pressure are separate

Anthropic's rate-limit docs are especially useful for agent systems because they explicitly separate:

RPM
ITPM for input tokens
OTPM for output tokens

The docs also state that rate limits are enforced at the organization level, use a token bucket algorithm, and can still fail over shorter intervals. Anthropic returns a retry-after header on 429s, which is exactly the kind of signal your retry layer should honor.

For agent systems, this matters because large prompts, long outputs, and parallel tool calls stress different parts of the budget.

Gemini API: project scope still means shared pressure

Google's Gemini API docs say rate limits are measured across RPM, TPM, and RPD, and they are applied per project, not per API key.

That means:

multiple agents under one project still share limits
project-level tier upgrades help, but they do not solve burst coordination inside your app
you should treat active limits as infrastructure constraints, not as an afterthought

Patterns that actually reduce 429 errors

1. Budget tokens, not just requests

A request counter is too crude for agent systems. A single long-context reasoning step can consume more of your real budget than many small requests.

Use a token-aware budget:

import asyncio
import time
from collections import deque


class TokenBudget:
    def __init__(self, tpm_limit: int):
        self.tpm_limit = tpm_limit
        self.window = deque()

    async def reserve(self, estimated_tokens: int) -> None:
        now = time.time()

        while self.window and self.window[0][0] < now - 60:
            self.window.popleft()

        used = sum(tokens for _, tokens in self.window)

        if used + estimated_tokens > self.tpm_limit and self.window:
            wait_seconds = 60 - (now - self.window[0][0])
            await asyncio.sleep(max(wait_seconds, 0))

        self.window.append((time.time(), estimated_tokens))

The important part is not the exact class. The important part is building a pre-request admission check before the provider has to reject you.

2. Cap concurrency around the agent loop

A lot of 429 storms are self-inflicted. Tool calls, background jobs, and retries all pile on each other.

Use concurrency controls:

import asyncio

agent_slots = asyncio.Semaphore(5)


async def run_agent(task):
    async with agent_slots:
        return await execute_agent(task)

This will not eliminate 429s on its own, but it prevents your own app from turning one spike into a collapse.

3. Make retries provider-aware

The retry layer should not treat every 429 the same way.

import asyncio
import random


async def retry_with_backoff(call, provider_name, attempts=5):
    for attempt in range(attempts):
        try:
            return await call()
        except Exception as exc:
            retry_after = getattr(exc, "retry_after", None)

            if retry_after is not None:
                await asyncio.sleep(float(retry_after))
                continue

            # Jittered fallback when the provider does not give a wait value
            delay = min(30, (2 ** attempt) + random.random())
            await asyncio.sleep(delay)

    raise RuntimeError(f"{provider_name} retry budget exhausted")

Production guidance:

respect retry-after when the provider returns it
use jittered exponential backoff as a fallback, not as your only strategy
do not retry forever
track retry volume separately from primary traffic

4. Separate foreground and background queues

When the same pool handles user-visible agent work and background analysis jobs, your low-value backlog can block high-value traffic.

Use at least two queues:

a foreground queue for user-facing responses
a background queue for batch or catch-up work

This gives you a place to shed or defer lower-priority traffic before the provider does it for you.

5. Checkpoint long-running loops

Do not make a 429 restart twenty minutes of work.

Checkpoint before each expensive call:

current task state
tool results already collected
last successful reasoning step
retry count and next-attempt timestamp

That turns a 429 from a workflow failure into a scheduling delay.

6. Degrade gracefully instead of failing hard

Your agent does not always need the biggest model in the system.

Graceful degradation might mean:

smaller context windows
fewer parallel tools
a cheaper or faster fallback model
queueing low-priority jobs instead of executing immediately

The right fallback depends on the workflow, but the architecture principle is the same: a partial answer is usually better than a crashed agent session.

Where routing fits into the picture

Routing is not a replacement for throttling. It is a way to keep single-provider pressure from dictating all of your application behavior.

The right way to think about routing is:

throttling controls what your app sends
retry logic controls how your app reacts
routing controls where your app can send work when conditions change

A simple before-and-after table

Pattern	What happens under load	Main weakness
Single provider, no admission control	Requests pile up until the upstream starts rejecting them	429 storms and cascading retries
Single provider with retries only	The app survives some spikes	Sustained bursts still block the same upstream bucket
Single provider with throttling and queues	Traffic is smoother and failures are less chaotic	You still depend on one upstream pool
Routed gateway with throttling and retries	The app can smooth bursts and keep the integration surface stable	More infrastructure choices to evaluate

What the repository copy supports for EvoLink Smart Router

The current repository copy for EvoLink Smart Router supports these publishable claims:

EvoLink provides a self-built routing layer for mixed workloads
you can send evolink/auto as the model ID
the actual model used is returned in the response
the request shape stays OpenAI-compatible
the routing layer itself does not add a separate routing fee

That does not mean you should promise zero 429s. It means routing can move more of the selection and fallback logic out of application code and into the gateway layer.

Here is the shape that repository copy supports:

curl https://api.evolink.ai/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "evolink/auto",
    "messages": [
      {
        "role": "user",
        "content": "Summarize the latest deployment incident and suggest next steps."
      }
    ]
  }'

A practical rollout checklist

Before you blame the provider, verify these first:

Check	Why it matters
Estimate token volume per agent step	Many 429s are really TPM or ITPM problems
Cap concurrent agent loops	Prevent self-inflicted burst amplification
Respect `retry-after` when present	Reduces wasteful retry storms
Split foreground and background queues	Protects user-facing latency
Save checkpoints for long-running work	Prevents total restart after transient 429s
Decide when to route or fail over	Keeps fallback behavior deliberate instead of ad hoc

FAQ

Why do agents hit 429 faster than a normal chat app?

Because agents generate burst traffic: long prompts, tool fanout, retries, background jobs, and concurrency all stack together.

Will adding more API keys solve the problem?

Usually not by itself. OpenAI documents organization-level and project-level limits, Anthropic documents organization-level limits, and Gemini documents project-level limits. Extra keys inside the same scope do not create a brand-new capacity pool.

Should I use exponential backoff for every 429?

Not as your first rule. Use retry-after when the provider gives you one. Fall back to jittered exponential backoff when it does not.

Do I need both throttling and routing?

Yes, if you are operating at production scale. Throttling smooths traffic before the provider rejects it. Routing helps reduce dependence on one upstream path.

What should I log when debugging 429s?

Log estimated tokens, concurrent tasks, queue depth, retry counts, request size, and any provider wait value such as retry-after.

When is a gateway like EvoLink useful?

It is useful when you want to keep an OpenAI-compatible request shape while moving model selection and mixed-workload routing out of your app code.

Can a router guarantee I will never see 429s again?

No. A router can improve resilience and flexibility, but it does not remove the need for client-side throttling, retry budgets, and queue control.

Build the Control Layer Before You Scale

If your agent system is already showing burst behavior, fixing 429s is usually an infrastructure problem before it is a prompt problem.

Explore EvoLink Smart Router

Fix OpenRouter 429 "Provider Returned Error" — OpenRouter-specific 429 debug guide
Model Not Found in OpenAI-Compatible APIs — when the error is model ID, not rate limit
AI API Timeout: Causes, Retry Patterns, and Fallback Design — when slow responses become timeouts
Context Length Exceeded in LLM API Calls — when input size, not request volume, is the issue
What is AI model routing?
OpenClaw fix 429 rate-limit errors
DeepSeek Status and Fallback Options for Coding Workloads — DeepSeek-specific availability and fallback
How Retry and Failure Rates Change Coding Agent API Cost — cost impact of retries and failures

Sources

All Posts

#429 errors #agent workloads #rate limiting #AI gateway #retry logic