guide

Context Length Exceeded in LLM API Calls: Fixes, Tradeoffs, and Model Selection

Q: How do I count tokens before sending a request?

Use the tiktoken library for OpenAI models or the provider's token counting endpoint. The exact token count depends on the model's tokenizer, so use the right encoder for your target model.

EvoLink Team

Product Team

May 13, 2026

9 min read

Your LLM API call just returned:

{
  "error": {
    "message": "This model's maximum context length is 128000 tokens. However, your messages resulted in 142837 tokens.",
    "type": "invalid_request_error",
    "code": "context_length_exceeded"
  }
}

This means your input (system prompt + conversation history + user message) exceeds the model's context window. The request was rejected before any tokens were generated.

Unlike rate limit errors (which are about request volume), context length errors are about request size. The fix is not to slow down — it is to reduce input, restructure the request, or switch to a model with a larger context window.

TL;DR

Context length exceeded = your input is too large for the model's token window.
You have four options: truncate, summarize, chunk, or switch models.
Each option has different cost, quality, and latency tradeoffs.
Use the decision table below to choose the right approach for your workload.
For production systems, handle this before the user sees it — not after.

Quick fix checklist

Before choosing a strategy, check these common causes first:

Check	What to look for	Quick fix
Conversation history too long	Messages array grows unbounded over multi-turn chat	Trim oldest messages or implement a sliding window
System prompt too large	Detailed instructions eating up context budget	Compress system prompt or move static instructions to a reference
Duplicate content	Same context injected multiple times (RAG, tool results)	Deduplicate before sending
Large tool/function results	Tool call returned a huge JSON or text blob	Truncate or summarize tool output before adding to context
Unnecessary metadata	Full objects when only a few fields matter	Extract only the relevant fields

If none of these quick fixes apply, you need a structural approach.

Decision table: truncate vs summarize vs chunk vs switch model

Strategy	How it works	Quality impact	Cost impact	Latency impact	Best for
Truncate	Drop oldest messages or trim input	May lose important context	Reduces input tokens = lower cost	Faster (less input)	Chat apps with long histories; cases where recent context matters most
Summarize	Use a cheaper model to compress prior context into a summary	Lossy — summary may miss details	Extra API call for summarization, but main call is smaller	Adds one extra call	Agent workflows with accumulated state; knowledge-heavy conversations
Chunk + merge	Split input into chunks, process separately, merge results	Risk of missing cross-chunk context	Multiple calls = higher total cost	Slower (sequential or parallel chunks)	Document processing, analysis over long texts
Switch model	Use a model with a larger context window	Usually neutral or better	Larger-context models often cost more per token	May vary	When input cannot be reduced without losing essential information

Strategy 1: Truncate — drop what matters least

Truncation is the simplest approach. Remove the oldest or least relevant parts of the input.

Sliding window for chat

def sliding_window(messages: list, max_tokens: int, system_prompt: str) -> list:
    """Keep system prompt + most recent messages within token budget."""
    # Always keep system prompt
    result = [{"role": "system", "content": system_prompt}]
    token_count = count_tokens(system_prompt)

    # Add messages from newest to oldest
    for msg in reversed(messages):
        msg_tokens = count_tokens(msg["content"])
        if token_count + msg_tokens > max_tokens:
            break
        result.insert(1, msg)  # Insert after system prompt
        token_count += msg_tokens

    return result

When truncation works well

Multi-turn chat where recent context is most important
RAG pipelines where you can re-retrieve if needed
Batch processing where you can process in order

When truncation is dangerous

Agent workflows where early instructions affect later behavior
Legal or compliance contexts where completeness matters
Multi-step reasoning where dropping steps changes the answer

Strategy 2: Summarize — compress without losing meaning

Summarization uses a cheaper, faster model to compress prior context:

async def summarize_context(messages: list, client) -> str:
    """Compress older messages into a summary."""
    context_text = "\n".join(
        f"{m['role']}: {m['content']}" for m in messages
    )

    response = await client.chat.completions.create(
        model="gpt-4o-mini",  # Cheaper model for summarization
        messages=[{
            "role": "user",
            "content": f"Summarize this conversation context in under 500 tokens. "
                       f"Preserve key decisions, facts, and pending actions:\n\n{context_text}"
        }],
        max_tokens=500
    )

    return response.choices[0].message.content

Cost comparison: summarize vs truncate

Approach	Input tokens to main model	Extra API calls	Total cost
No management (hits error)	N/A — request rejected	0	Wasted latency + retry cost
Truncate to 80K tokens	80K	0	Base cost at 80K input
Summarize older context	~10K (summary) + 40K (recent) = 50K	1 summarization call (~$0.01)	Lower main call cost + small summary cost

Summarization adds one cheap call but often reduces the main call's input tokens significantly — which can be a net cost reduction for expensive models.

Strategy 3: Chunk and merge — for long documents

When the input is a single long document (not conversation), chunking is often the right approach:

async def process_long_document(
    document: str,
    question: str,
    client,
    chunk_size: int = 50000
) -> str:
    """Process a long document by chunking."""
    chunks = split_into_chunks(document, chunk_size)
    chunk_results = []

    for i, chunk in enumerate(chunks):
        response = await client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": f"Analyze chunk {i+1}/{len(chunks)} of a document.\n"
                           f"Question: {question}\n\n"
                           f"Chunk content:\n{chunk}"
            }]
        )
        chunk_results.append(response.choices[0].message.content)

    # Merge chunk results
    merge_response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"Merge these partial analyses into a final answer.\n"
                       f"Question: {question}\n\n"
                       f"Partial results:\n" + "\n---\n".join(chunk_results)
        }]
    )

    return merge_response.choices[0].message.content

Chunking tradeoffs

Pro: Handles arbitrarily long inputs
Con: Cross-chunk context is lost
Con: Multiple API calls = higher total cost and latency
Mitigation: Overlap chunks by 10-20% to preserve boundary context

Strategy 4: Switch to a larger-context model

If your input genuinely needs to stay large, switch to a model with a bigger context window:

Model family	Typical max context	When to use
GPT-4o	128K tokens	Default for most workloads
GPT-4o (long context)	128K tokens	Already the default
Claude Sonnet/Opus	200K tokens	When you need more than 128K
Gemini Pro	1M+ tokens	Very long documents, full codebases
Gemini Flash	1M+ tokens	Cost-sensitive long-context tasks

Cost impact of larger-context models

Larger context windows usually mean higher per-token cost. Calculate whether the quality improvement justifies the cost:

Option A: Truncate to 80K + GPT-4o = $X
Option B: Full 150K input + Claude Sonnet = $Y
Option C: Full 150K input + Gemini Flash = $Z

Compare: quality difference vs cost difference

Using a routing gateway for model selection

Instead of hardcoding model selection, you can use a gateway that routes to the appropriate model based on input size:

from openai import OpenAI

client = OpenAI(
    api_key="your-evolink-key",
    base_url="https://api.evolink.ai/v1"
)

# Let the Smart Router choose based on your workload
response = client.chat.completions.create(
    model="evolink/auto",
    messages=your_messages
)

EvoLink's Smart Router can route to models that fit your input size and cost requirements, so you do not have to hardcode context-window thresholds.

Production pattern: pre-check before sending

Do not wait for the API to reject your request. Check input size before sending:

import tiktoken

def check_context_length(messages: list, model: str, max_tokens: int) -> dict:
    """Pre-check whether messages fit within the model's context window."""
    encoder = tiktoken.encoding_for_model(model)
    total_tokens = sum(
        len(encoder.encode(m["content"])) for m in messages
    )

    if total_tokens > max_tokens:
        return {
            "fits": False,
            "total_tokens": total_tokens,
            "excess": total_tokens - max_tokens,
            "suggestion": "truncate" if total_tokens < max_tokens * 1.5
                          else "summarize_or_switch"
        }

    return {"fits": True, "total_tokens": total_tokens}

This avoids wasted latency from rejected requests and lets you apply the right strategy before the user sees an error.

Fix OpenRouter 429 "Provider Returned Error" — when the issue is rate limits, not input size
AI API Timeout: Causes, Retry Patterns, and Fallback Design — when long inputs cause timeouts instead of rejections
How to Reduce 429 Errors in Agent Workloads — managing request volume alongside request size
Best AI API Platform for Production Reliability — choosing a platform that handles model routing

Explore EvoLink Smart Router

FAQ

What does "context length exceeded" mean?

It means your input (system prompt + messages + any injected context) contains more tokens than the model's maximum context window allows. The request is rejected before any generation begins.

Is context length exceeded the same as a rate limit error?

No. Rate limit errors (429) are about request volume — too many requests in a time window. Context length errors are about request size — a single request that is too large. They require different fixes.

Which is better: truncation or summarization?

Truncation is simpler and cheaper but loses information. Summarization preserves meaning but adds an extra API call and introduces compression artifacts. Use truncation for chat histories where recency matters most; use summarization for agent workflows where accumulated context is important.

Can I avoid context length errors by using a larger model?

Yes, but at a cost. Models with larger context windows (Gemini 1M+, Claude 200K) can accept bigger inputs, but they typically cost more per token. Calculate whether the quality improvement justifies the additional spend.

How do I count tokens before sending a request?

Use the tiktoken library for OpenAI models or the provider's token counting endpoint. The exact token count depends on the model's tokenizer, so use the right encoder for your target model.

Should I handle this error in application code or at the gateway level?

Both. Application code should pre-check input size and apply the appropriate strategy (truncate, summarize, chunk, or switch). A routing gateway can additionally select a model with a large enough context window for your request.

All Posts

#context length exceeded #LLM API #token limits #troubleshooting #model selection

Context Length Exceeded in LLM API Calls: Fixes, Tradeoffs, and Model Selection

TL;DR

Quick fix checklist

Decision table: truncate vs summarize vs chunk vs switch model

Strategy 1: Truncate — drop what matters least

Sliding window for chat

When truncation works well

When truncation is dangerous

Strategy 2: Summarize — compress without losing meaning

Cost comparison: summarize vs truncate

Strategy 3: Chunk and merge — for long documents

Chunking tradeoffs

Strategy 4: Switch to a larger-context model

Cost impact of larger-context models

Using a routing gateway for model selection

Production pattern: pre-check before sending

FAQ

What does "context length exceeded" mean?

Is context length exceeded the same as a rate limit error?

Which is better: truncation or summarization?

Can I avoid context length errors by using a larger model?

How do I count tokens before sending a request?

Should I handle this error in application code or at the gateway level?

Related Articles

Fix OpenRouter 429 "Provider Returned Error": Rate Limits, Upstream Providers, and Fallback Options

Model Not Found in OpenAI-Compatible APIs: Causes, Fixes, and Debug Checklist

AI API Timeout: Causes, Retry Patterns, and Fallback Design

Ready to Reduce Your AI Costs by 89%?

Context Length Exceeded in LLM API Calls: Fixes, Tradeoffs, and Model Selection

TL;DR

Quick fix checklist

Decision table: truncate vs summarize vs chunk vs switch model

Strategy 1: Truncate — drop what matters least

Sliding window for chat

When truncation works well

When truncation is dangerous

Strategy 2: Summarize — compress without losing meaning

Cost comparison: summarize vs truncate

Strategy 3: Chunk and merge — for long documents

Chunking tradeoffs

Strategy 4: Switch to a larger-context model

Cost impact of larger-context models

Using a routing gateway for model selection

Production pattern: pre-check before sending

Related articles

FAQ

What does "context length exceeded" mean?

Is context length exceeded the same as a rate limit error?

Which is better: truncation or summarization?

Can I avoid context length errors by using a larger model?

How do I count tokens before sending a request?

Should I handle this error in application code or at the gateway level?

Related Articles

Fix OpenRouter 429 "Provider Returned Error": Rate Limits, Upstream Providers, and Fallback Options

Model Not Found in OpenAI-Compatible APIs: Causes, Fixes, and Debug Checklist

AI API Timeout: Causes, Retry Patterns, and Fallback Design

Ready to Reduce Your AI Costs by 89%?