
AI API Timeout: Causes, Retry Patterns, and Fallback Design

Your AI API request timed out. But "timeout" is not one problem — it is at least four different problems wearing the same error message.
A text model timing out after 30 seconds is a different issue from a video generation job timing out after 5 minutes. Fixing the wrong type wastes engineering time and may make the problem worse.
This guide helps you diagnose which type of timeout you are dealing with and choose the right response pattern.
TL;DR
- AI API timeouts have different root causes: model latency, provider queue, large input, or network issues.
- Text model timeouts and video/image generation timeouts need different handling strategies.
- Do not retry blindly — some timeouts mean the request is still processing.
- For long-running tasks (video, image), use async patterns instead of waiting for synchronous responses.
- Design fallback before you need it: shorter timeout + fallback model is often better than a long timeout + no fallback.
Timeout diagnosis table
Use this to identify your timeout type before choosing a fix:
| Timeout type | Typical duration | Root cause | How to verify | Right response |
|---|---|---|---|---|
| Text model — slow response | 15–60s | Large input, complex reasoning, or high output tokens | Check input size and max_tokens | Reduce input, lower max_tokens, or switch to a faster model |
| Text model — provider overloaded | 30–120s | Provider is under heavy load; queued requests | Try the same request during off-peak hours | Retry with backoff, or route to a different provider |
| Video/image generation — normal processing | 60–300s+ | Generation takes time by design (especially video) | Check provider docs for expected generation time | Use async polling, not synchronous wait |
| Video/image generation — queue backlog | 300s+ | Too many jobs ahead in the provider's queue | Check provider queue status or position | Add queue management, set user expectations, or use a different provider |
| Network timeout | Variable | DNS, firewall, proxy, or connectivity issues | Test with a simple health-check request | Fix network configuration, not the API call |
| Client-side timeout too short | Fixed by config | Your HTTP client timeout is shorter than the model needs | Increase timeout setting and test again | Increase client timeout to match expected response time |
Pattern 1: Handling text model timeouts
Text model timeouts are usually caused by one of three things:
1.1 Large input or high max_tokens
max_tokens settings allow longer generation, which takes more time.# Problem: large input + high max_tokens = slow response
response = client.chat.completions.create(
model="gpt-4o",
messages=very_long_messages, # 100K+ tokens
max_tokens=4096 # Requesting long output
)
# Fix: reduce input or cap output
response = client.chat.completions.create(
model="gpt-4o",
messages=trimmed_messages, # Reduced to 50K tokens
max_tokens=1024 # Shorter output
)1.2 Provider under load
During peak usage, providers may queue your request. This shows up as a timeout rather than an explicit queue message.
- Same request works fine during off-peak hours
- Error is intermittent, not consistent
- Other users report similar issues at the same time
- Retry with jittered backoff
- Route to an alternative provider or model
- Use streaming to get partial results faster
1.3 Streaming as a timeout mitigation
Streaming does not make generation faster, but it starts returning tokens sooner. This can prevent client-side timeout triggers:
# Synchronous — client may timeout waiting for full response
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=2048
)
# Streaming — first token arrives faster, keeps connection alive
stream = client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=2048,
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")Pattern 2: Handling video/image generation timeouts
Video and image generation are fundamentally different from text. Generation times of 30 seconds to several minutes are normal, not errors.
2.1 Synchronous vs asynchronous generation
| Approach | How it works | When to use |
|---|---|---|
| Synchronous | Wait for the full response in one HTTP call | Quick generations (< 30s), simple integrations |
| Async polling | Submit job → get job ID → poll for status → fetch result | Video generation, batch processing, any task > 30s |
| Webhooks | Submit job → provider calls your endpoint when done | High-volume pipelines, event-driven architectures |
2.2 Async polling pattern
import asyncio
import httpx
async def generate_video_async(client, prompt: str, timeout: int = 600):
"""Submit a video generation job and poll for completion."""
# Step 1: Submit the job
submit_response = await client.post(
"/v1/video/generations",
json={"model": "veo-3-fast", "prompt": prompt}
)
job_id = submit_response.json()["id"]
# Step 2: Poll for completion
for _ in range(timeout // 5): # Check every 5 seconds
status_response = await client.get(f"/v1/video/generations/{job_id}")
status = status_response.json()
if status["status"] == "completed":
return status["result"]
elif status["status"] == "failed":
raise RuntimeError(f"Generation failed: {status.get('error')}")
await asyncio.sleep(5)
raise TimeoutError(f"Video generation did not complete within {timeout}s")2.3 Queue position awareness
When a video generation provider has a backlog, your job waits in queue before processing starts. Some providers expose queue position:
Status: queued → position 42
Status: queued → position 15
Status: processing → estimated 90s remaining
Status: completed → download URL availableIf the provider does not expose queue position, estimate based on historical wait times and set user expectations accordingly.
Pattern 3: Smart retry logic for timeouts
Not all timeouts should be retried the same way:
import asyncio
import random
async def retry_with_timeout_awareness(
make_request,
max_retries: int = 3,
base_timeout: float = 30.0
):
"""Retry with awareness of timeout type."""
for attempt in range(max_retries):
try:
return await asyncio.wait_for(
make_request(),
timeout=base_timeout * (1.5 ** attempt) # Increase timeout each retry
)
except asyncio.TimeoutError:
if attempt == max_retries - 1:
raise
# Add jitter to prevent thundering herd
delay = min(30, (2 ** attempt) + random.uniform(0, 1))
await asyncio.sleep(delay)
except Exception as e:
# Non-timeout errors: do not retry automatically
if "429" in str(e):
delay = min(60, (2 ** attempt) + random.uniform(0, 2))
await asyncio.sleep(delay)
else:
raiseRetry rules for different timeout types
| Timeout type | Should you retry? | How |
|---|---|---|
| Text model slow | Yes | Retry with backoff; consider a faster model on retry |
| Provider overloaded | Yes, with caution | Retry with longer backoff; consider a different provider |
| Video generation still processing | No — the job may still be running | Poll for status instead of resubmitting |
| Network timeout | Yes | Fix network first; retry after connectivity is confirmed |
| Client timeout too short | No — increase timeout instead | Adjust configuration, do not retry |
The most dangerous mistake is retrying a video generation job that is still processing. This creates duplicate jobs, wastes money, and may overload the provider queue.
Pattern 4: Fallback design for production
Timeout-triggered model fallback
async def call_with_fallback(messages, client, primary_model, fallback_model):
"""Try primary model; on timeout, fall back to a faster model."""
try:
return await asyncio.wait_for(
client.chat.completions.create(
model=primary_model,
messages=messages
),
timeout=30.0
)
except asyncio.TimeoutError:
# Fall back to a faster, possibly smaller model
return await client.chat.completions.create(
model=fallback_model,
messages=messages
)Using a routing gateway for timeout resilience
Instead of implementing fallback logic in every service, a routing gateway can handle timeouts at the infrastructure level:
- Route to faster providers when primary routes are slow
- Automatically retry on a different upstream path
- Return the actual model used so your app knows what happened
from openai import OpenAI
client = OpenAI(
api_key="your-evolink-key",
base_url="https://api.evolink.ai/v1"
)
# Smart Router handles provider selection and fallback
response = client.chat.completions.create(
model="evolink/auto",
messages=messages
)Timeout configuration reference
| Setting | Recommended value | Why |
|---|---|---|
| HTTP client timeout (text) | 60–120s | Allow for large inputs and complex reasoning |
| HTTP client timeout (image) | 120–300s | Image generation varies by model and resolution |
| HTTP client timeout (video) | Use async polling | Synchronous timeout is not appropriate for video |
| Retry attempts | 2–3 for text, 0 for in-progress video | Avoid duplicate video/image jobs |
| Backoff base delay | 2s with jitter | Prevent thundering herd on provider recovery |
| Fallback model switch timeout | 15–30s | Switch to faster model before user frustration |
Related articles
- Fix OpenRouter 429 "Provider Returned Error" — when the error is rate limiting, not timeout
- Context Length Exceeded in LLM API Calls — when large input causes rejection instead of timeout
- How to Reduce 429 Errors in Agent Workloads — managing burst traffic that leads to timeouts
- Best AI API Platform for Production Reliability — choosing a platform with built-in failover
FAQ
Why does my AI API request timeout even though the model works?
Timeouts are usually caused by one of: (1) large input taking longer to process, (2) provider under heavy load, (3) client-side timeout configured too short, or (4) network issues. The model itself may be fine.
Should I increase my timeout or use a different approach?
It depends. For text models, increasing timeout helps with occasional slow responses. For video/image generation, switch to async polling instead of increasing synchronous timeout. For persistent timeouts, investigate root cause before increasing limits.
Is a timeout the same as a rate limit error?
No. A timeout means the server did not respond within your time limit. A rate limit (429) means the server actively rejected your request. Timeouts often indicate slow processing; 429s indicate too many requests.
How do I handle timeouts in video generation?
Never wait synchronously for video generation. Use async job submission with polling or webhooks. If a video job times out during polling, check status before resubmitting — the job may still be processing.
Can streaming prevent timeouts?
Streaming prevents client-side timeouts because the first token arrives quickly, keeping the connection alive. However, streaming does not make the total generation faster — it just changes the delivery pattern.
When should I switch to a fallback model on timeout?
Set a threshold (e.g., 15–30 seconds for text) and switch to a faster model when the primary times out. This gives users a response instead of an error. The fallback model may be less capable, but a slightly worse answer is better than no answer.

