
How to Reduce 429 Errors in Agent Workloads: Routing, Retry, and Failover Patterns

- limits are real
- limits apply above the individual request level
- limits are tied to organization or project scope
- short bursts can still trigger failures even when monthly usage looks healthy
This guide focuses on what is verifiable from official documentation, then translates it into production patterns that actually reduce 429s.
TL;DR
- Agent systems hit 429s faster than simple apps because they create bursts, not smooth traffic.
- You need to budget tokens and concurrency, not just request count.
- Retry logic should follow provider behavior: use
retry-afterwhen available, and add jittered backoff when it is not. - Queues, checkpoints, and graceful degradation matter as much as raw throughput.
- Routing helps when you want to reduce dependency on a single upstream limit bucket.
Why agent workloads hit 429 differently
Traditional applications often look like this:
- one user request
- one LLM call
- one response
Agent systems do not behave that way. They often trigger:
- long-context reasoning steps
- tool-call fanout
- multi-agent concurrency
- streaming responses that keep connections open
- background retries at the same time as foreground work
What the provider docs actually say
| Provider | Official limit dimensions | Scope | Operational takeaway |
|---|---|---|---|
| OpenAI | RPM, TPM, RPD, TPD, IPM | Organization and project level, model-specific, with some shared limits | One noisy workflow can still consume the pool your other requests depend on |
| Anthropic | RPM, ITPM, OTPM | Organization level with tier-based limits | Short bursts can trigger 429s before a full minute of traffic has elapsed |
| Gemini API | RPM, TPM, RPD | Per project, model-specific, tier-based | Multiple agents in one project still compete for the same project limits |
OpenAI: project-level control does not remove burst risk
The practical implication is straightforward:
- splitting traffic across features inside one project does not make bursts disappear
- some model families can share a limit pool
- high-throughput agent traffic can starve unrelated requests if you do not throttle client-side
Anthropic: input and output pressure are separate
Anthropic's rate-limit docs are especially useful for agent systems because they explicitly separate:
- RPM
- ITPM for input tokens
- OTPM for output tokens
retry-after header on 429s, which is exactly the kind of signal your retry layer should honor.For agent systems, this matters because large prompts, long outputs, and parallel tool calls stress different parts of the budget.
Gemini API: project scope still means shared pressure
That means:
- multiple agents under one project still share limits
- project-level tier upgrades help, but they do not solve burst coordination inside your app
- you should treat active limits as infrastructure constraints, not as an afterthought
Patterns that actually reduce 429 errors
1. Budget tokens, not just requests
A request counter is too crude for agent systems. A single long-context reasoning step can consume more of your real budget than many small requests.
Use a token-aware budget:
import asyncio
import time
from collections import deque
class TokenBudget:
def __init__(self, tpm_limit: int):
self.tpm_limit = tpm_limit
self.window = deque()
async def reserve(self, estimated_tokens: int) -> None:
now = time.time()
while self.window and self.window[0][0] < now - 60:
self.window.popleft()
used = sum(tokens for _, tokens in self.window)
if used + estimated_tokens > self.tpm_limit and self.window:
wait_seconds = 60 - (now - self.window[0][0])
await asyncio.sleep(max(wait_seconds, 0))
self.window.append((time.time(), estimated_tokens))2. Cap concurrency around the agent loop
A lot of 429 storms are self-inflicted. Tool calls, background jobs, and retries all pile on each other.
Use concurrency controls:
import asyncio
agent_slots = asyncio.Semaphore(5)
async def run_agent(task):
async with agent_slots:
return await execute_agent(task)This will not eliminate 429s on its own, but it prevents your own app from turning one spike into a collapse.
3. Make retries provider-aware
The retry layer should not treat every 429 the same way.
import asyncio
import random
async def retry_with_backoff(call, provider_name, attempts=5):
for attempt in range(attempts):
try:
return await call()
except Exception as exc:
retry_after = getattr(exc, "retry_after", None)
if retry_after is not None:
await asyncio.sleep(float(retry_after))
continue
# Jittered fallback when the provider does not give a wait value
delay = min(30, (2 ** attempt) + random.random())
await asyncio.sleep(delay)
raise RuntimeError(f"{provider_name} retry budget exhausted")Production guidance:
- respect
retry-afterwhen the provider returns it - use jittered exponential backoff as a fallback, not as your only strategy
- do not retry forever
- track retry volume separately from primary traffic
4. Separate foreground and background queues
When the same pool handles user-visible agent work and background analysis jobs, your low-value backlog can block high-value traffic.
Use at least two queues:
- a foreground queue for user-facing responses
- a background queue for batch or catch-up work
This gives you a place to shed or defer lower-priority traffic before the provider does it for you.
5. Checkpoint long-running loops
Do not make a 429 restart twenty minutes of work.
Checkpoint before each expensive call:
- current task state
- tool results already collected
- last successful reasoning step
- retry count and next-attempt timestamp
That turns a 429 from a workflow failure into a scheduling delay.
6. Degrade gracefully instead of failing hard
Your agent does not always need the biggest model in the system.
Graceful degradation might mean:
- smaller context windows
- fewer parallel tools
- a cheaper or faster fallback model
- queueing low-priority jobs instead of executing immediately
The right fallback depends on the workflow, but the architecture principle is the same: a partial answer is usually better than a crashed agent session.
Where routing fits into the picture
The right way to think about routing is:
- throttling controls what your app sends
- retry logic controls how your app reacts
- routing controls where your app can send work when conditions change
A simple before-and-after table
| Pattern | What happens under load | Main weakness |
|---|---|---|
| Single provider, no admission control | Requests pile up until the upstream starts rejecting them | 429 storms and cascading retries |
| Single provider with retries only | The app survives some spikes | Sustained bursts still block the same upstream bucket |
| Single provider with throttling and queues | Traffic is smoother and failures are less chaotic | You still depend on one upstream pool |
| Routed gateway with throttling and retries | The app can smooth bursts and keep the integration surface stable | More infrastructure choices to evaluate |
What the repository copy supports for EvoLink Smart Router
The current repository copy for EvoLink Smart Router supports these publishable claims:
- EvoLink provides a self-built routing layer for mixed workloads
- you can send
evolink/autoas the model ID - the actual model used is returned in the response
- the request shape stays OpenAI-compatible
- the routing layer itself does not add a separate routing fee
Here is the shape that repository copy supports:
curl https://api.evolink.ai/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "evolink/auto",
"messages": [
{
"role": "user",
"content": "Summarize the latest deployment incident and suggest next steps."
}
]
}'A practical rollout checklist
Before you blame the provider, verify these first:
| Check | Why it matters |
|---|---|
| Estimate token volume per agent step | Many 429s are really TPM or ITPM problems |
| Cap concurrent agent loops | Prevent self-inflicted burst amplification |
Respect retry-after when present | Reduces wasteful retry storms |
| Split foreground and background queues | Protects user-facing latency |
| Save checkpoints for long-running work | Prevents total restart after transient 429s |
| Decide when to route or fail over | Keeps fallback behavior deliberate instead of ad hoc |
FAQ
Why do agents hit 429 faster than a normal chat app?
Because agents generate burst traffic: long prompts, tool fanout, retries, background jobs, and concurrency all stack together.
Will adding more API keys solve the problem?
Usually not by itself. OpenAI documents organization-level and project-level limits, Anthropic documents organization-level limits, and Gemini documents project-level limits. Extra keys inside the same scope do not create a brand-new capacity pool.
Should I use exponential backoff for every 429?
retry-after when the provider gives you one. Fall back to jittered exponential backoff when it does not.Do I need both throttling and routing?
Yes, if you are operating at production scale. Throttling smooths traffic before the provider rejects it. Routing helps reduce dependence on one upstream path.
What should I log when debugging 429s?
retry-after.When is a gateway like EvoLink useful?
It is useful when you want to keep an OpenAI-compatible request shape while moving model selection and mixed-workload routing out of your app code.
Can a router guarantee I will never see 429s again?
No. A router can improve resilience and flexibility, but it does not remove the need for client-side throttling, retry budgets, and queue control.
Build the Control Layer Before You Scale
If your agent system is already showing burst behavior, fixing 429s is usually an infrastructure problem before it is a prompt problem.
Explore EvoLink Smart Router

