
Context Length Exceeded in LLM API Calls: Fixes, Tradeoffs, and Model Selection

Your LLM API call just returned:
{
"error": {
"message": "This model's maximum context length is 128000 tokens. However, your messages resulted in 142837 tokens.",
"type": "invalid_request_error",
"code": "context_length_exceeded"
}
}This means your input (system prompt + conversation history + user message) exceeds the model's context window. The request was rejected before any tokens were generated.
Unlike rate limit errors (which are about request volume), context length errors are about request size. The fix is not to slow down — it is to reduce input, restructure the request, or switch to a model with a larger context window.
TL;DR
- Context length exceeded = your input is too large for the model's token window.
- You have four options: truncate, summarize, chunk, or switch models.
- Each option has different cost, quality, and latency tradeoffs.
- Use the decision table below to choose the right approach for your workload.
- For production systems, handle this before the user sees it — not after.
Quick fix checklist
Before choosing a strategy, check these common causes first:
| Check | What to look for | Quick fix |
|---|---|---|
| Conversation history too long | Messages array grows unbounded over multi-turn chat | Trim oldest messages or implement a sliding window |
| System prompt too large | Detailed instructions eating up context budget | Compress system prompt or move static instructions to a reference |
| Duplicate content | Same context injected multiple times (RAG, tool results) | Deduplicate before sending |
| Large tool/function results | Tool call returned a huge JSON or text blob | Truncate or summarize tool output before adding to context |
| Unnecessary metadata | Full objects when only a few fields matter | Extract only the relevant fields |
If none of these quick fixes apply, you need a structural approach.
Decision table: truncate vs summarize vs chunk vs switch model
| Strategy | How it works | Quality impact | Cost impact | Latency impact | Best for |
|---|---|---|---|---|---|
| Truncate | Drop oldest messages or trim input | May lose important context | Reduces input tokens = lower cost | Faster (less input) | Chat apps with long histories; cases where recent context matters most |
| Summarize | Use a cheaper model to compress prior context into a summary | Lossy — summary may miss details | Extra API call for summarization, but main call is smaller | Adds one extra call | Agent workflows with accumulated state; knowledge-heavy conversations |
| Chunk + merge | Split input into chunks, process separately, merge results | Risk of missing cross-chunk context | Multiple calls = higher total cost | Slower (sequential or parallel chunks) | Document processing, analysis over long texts |
| Switch model | Use a model with a larger context window | Usually neutral or better | Larger-context models often cost more per token | May vary | When input cannot be reduced without losing essential information |
Strategy 1: Truncate — drop what matters least
Truncation is the simplest approach. Remove the oldest or least relevant parts of the input.
Sliding window for chat
def sliding_window(messages: list, max_tokens: int, system_prompt: str) -> list:
"""Keep system prompt + most recent messages within token budget."""
# Always keep system prompt
result = [{"role": "system", "content": system_prompt}]
token_count = count_tokens(system_prompt)
# Add messages from newest to oldest
for msg in reversed(messages):
msg_tokens = count_tokens(msg["content"])
if token_count + msg_tokens > max_tokens:
break
result.insert(1, msg) # Insert after system prompt
token_count += msg_tokens
return resultWhen truncation works well
- Multi-turn chat where recent context is most important
- RAG pipelines where you can re-retrieve if needed
- Batch processing where you can process in order
When truncation is dangerous
- Agent workflows where early instructions affect later behavior
- Legal or compliance contexts where completeness matters
- Multi-step reasoning where dropping steps changes the answer
Strategy 2: Summarize — compress without losing meaning
Summarization uses a cheaper, faster model to compress prior context:
async def summarize_context(messages: list, client) -> str:
"""Compress older messages into a summary."""
context_text = "\n".join(
f"{m['role']}: {m['content']}" for m in messages
)
response = await client.chat.completions.create(
model="gpt-4o-mini", # Cheaper model for summarization
messages=[{
"role": "user",
"content": f"Summarize this conversation context in under 500 tokens. "
f"Preserve key decisions, facts, and pending actions:\n\n{context_text}"
}],
max_tokens=500
)
return response.choices[0].message.contentCost comparison: summarize vs truncate
| Approach | Input tokens to main model | Extra API calls | Total cost |
|---|---|---|---|
| No management (hits error) | N/A — request rejected | 0 | Wasted latency + retry cost |
| Truncate to 80K tokens | 80K | 0 | Base cost at 80K input |
| Summarize older context | ~10K (summary) + 40K (recent) = 50K | 1 summarization call (~$0.01) | Lower main call cost + small summary cost |
Summarization adds one cheap call but often reduces the main call's input tokens significantly — which can be a net cost reduction for expensive models.
Strategy 3: Chunk and merge — for long documents
When the input is a single long document (not conversation), chunking is often the right approach:
async def process_long_document(
document: str,
question: str,
client,
chunk_size: int = 50000
) -> str:
"""Process a long document by chunking."""
chunks = split_into_chunks(document, chunk_size)
chunk_results = []
for i, chunk in enumerate(chunks):
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"Analyze chunk {i+1}/{len(chunks)} of a document.\n"
f"Question: {question}\n\n"
f"Chunk content:\n{chunk}"
}]
)
chunk_results.append(response.choices[0].message.content)
# Merge chunk results
merge_response = await client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"Merge these partial analyses into a final answer.\n"
f"Question: {question}\n\n"
f"Partial results:\n" + "\n---\n".join(chunk_results)
}]
)
return merge_response.choices[0].message.contentChunking tradeoffs
- Pro: Handles arbitrarily long inputs
- Con: Cross-chunk context is lost
- Con: Multiple API calls = higher total cost and latency
- Mitigation: Overlap chunks by 10-20% to preserve boundary context
Strategy 4: Switch to a larger-context model
If your input genuinely needs to stay large, switch to a model with a bigger context window:
| Model family | Typical max context | When to use |
|---|---|---|
| GPT-4o | 128K tokens | Default for most workloads |
| GPT-4o (long context) | 128K tokens | Already the default |
| Claude Sonnet/Opus | 200K tokens | When you need more than 128K |
| Gemini Pro | 1M+ tokens | Very long documents, full codebases |
| Gemini Flash | 1M+ tokens | Cost-sensitive long-context tasks |
Cost impact of larger-context models
Larger context windows usually mean higher per-token cost. Calculate whether the quality improvement justifies the cost:
Option A: Truncate to 80K + GPT-4o = $X
Option B: Full 150K input + Claude Sonnet = $Y
Option C: Full 150K input + Gemini Flash = $Z
Compare: quality difference vs cost differenceUsing a routing gateway for model selection
Instead of hardcoding model selection, you can use a gateway that routes to the appropriate model based on input size:
from openai import OpenAI
client = OpenAI(
api_key="your-evolink-key",
base_url="https://api.evolink.ai/v1"
)
# Let the Smart Router choose based on your workload
response = client.chat.completions.create(
model="evolink/auto",
messages=your_messages
)Production pattern: pre-check before sending
Do not wait for the API to reject your request. Check input size before sending:
import tiktoken
def check_context_length(messages: list, model: str, max_tokens: int) -> dict:
"""Pre-check whether messages fit within the model's context window."""
encoder = tiktoken.encoding_for_model(model)
total_tokens = sum(
len(encoder.encode(m["content"])) for m in messages
)
if total_tokens > max_tokens:
return {
"fits": False,
"total_tokens": total_tokens,
"excess": total_tokens - max_tokens,
"suggestion": "truncate" if total_tokens < max_tokens * 1.5
else "summarize_or_switch"
}
return {"fits": True, "total_tokens": total_tokens}This avoids wasted latency from rejected requests and lets you apply the right strategy before the user sees an error.
Related articles
- Fix OpenRouter 429 "Provider Returned Error" — when the issue is rate limits, not input size
- AI API Timeout: Causes, Retry Patterns, and Fallback Design — when long inputs cause timeouts instead of rejections
- How to Reduce 429 Errors in Agent Workloads — managing request volume alongside request size
- Best AI API Platform for Production Reliability — choosing a platform that handles model routing
FAQ
What does "context length exceeded" mean?
It means your input (system prompt + messages + any injected context) contains more tokens than the model's maximum context window allows. The request is rejected before any generation begins.
Is context length exceeded the same as a rate limit error?
No. Rate limit errors (429) are about request volume — too many requests in a time window. Context length errors are about request size — a single request that is too large. They require different fixes.
Which is better: truncation or summarization?
Truncation is simpler and cheaper but loses information. Summarization preserves meaning but adds an extra API call and introduces compression artifacts. Use truncation for chat histories where recency matters most; use summarization for agent workflows where accumulated context is important.
Can I avoid context length errors by using a larger model?
Yes, but at a cost. Models with larger context windows (Gemini 1M+, Claude 200K) can accept bigger inputs, but they typically cost more per token. Calculate whether the quality improvement justifies the additional spend.
How do I count tokens before sending a request?
tiktoken library for OpenAI models or the provider's token counting endpoint. The exact token count depends on the model's tokenizer, so use the right encoder for your target model.Should I handle this error in application code or at the gateway level?
Both. Application code should pre-check input size and apply the appropriate strategy (truncate, summarize, chunk, or switch). A routing gateway can additionally select a model with a large enough context window for your request.


