HappyHorse 1.0 is now liveTry it now
How to Use Gemini 3.5 Flash API: Model ID, Pricing, and Code Examples
guide

How to Use Gemini 3.5 Flash API: Model ID, Pricing, and Code Examples

EvoLink Team
EvoLink Team
Product Team
May 20, 2026
20 min read

Gemini 3.5 Flash is Google's latest production-ready Flash model, generally available and stable for scaled production use. It is built for agentic workflows, coding agents, sub-agent deployment, and long-horizon tasks — combining frontier-level intelligence with Flash-tier speed and cost.

This guide covers everything you need to integrate Gemini 3.5 Flash into your application: model ID, pricing, code examples in Python and Node.js, function calling, structured outputs, agent workflow patterns, cost analysis, and how to choose between Flash and Pro.

For the full product page with live pricing, see Gemini 3.5 Flash API on EvoLink.

Quick Reference Card

ItemValue
Model IDgemini-3.5-flash
StatusGenerally available (GA), stable for production
Input pricing$1.50 per 1M tokens
Output pricing$9.00 per 1M tokens
Context window1,048,576 input tokens
Max output65,535 tokens
Input modalitiesText, image, video, audio, PDF
Output modalitiesText only
Function callingSupported
Structured outputsSupported
Code executionSupported
Search groundingSupported
Context cachingSupported
Batch APISupported
StreamingSupported

Table of Contents

  1. When to Use Gemini 3.5 Flash
  2. Gemini 3.5 Flash vs Other Gemini Models
  3. Pricing Deep Dive
  4. Setup: Getting Started in 2 Minutes
  5. Code Examples
  6. Function Calling
  7. Structured Outputs
  8. Coding Agent Workflow
  9. Sub-Agent Deployment Pattern
  10. Cost Analysis: What Agent Loops Actually Cost
  11. Cost-Control Strategies
  12. Common Mistakes and How to Avoid Them
  13. When NOT to Use Gemini 3.5 Flash
  14. FAQ

When to Use Gemini 3.5 Flash

Gemini 3.5 Flash is not a general-purpose budget model. Google explicitly positions it for specific high-value workloads where speed, cost per iteration, and tool support matter more than maximum reasoning depth.

Best Use Cases

Use caseWhy Gemini 3.5 Flash fitsWhat to measure
Coding agentsFast code generation, debugging, refactoring at Flash-tier speed per iterationIterations to fix, cost per session, diff quality
Agentic workflowsNative function calling, parallel execution loops, low per-call costTool call accuracy, fallback rate, total workflow cost
Sub-agent deploymentDeploy as a sub-agent in multi-agent systems where per-call economics matterLatency per sub-call, error rate, orchestration overhead
Long-horizon tasks1M context handles full codebases and multi-document analysis without truncationContext utilization rate, output quality at high token counts
Document processingPDF, audio, video inputs at unified pricing — no modality surchargesExtraction accuracy, processing cost per document
Production chatBuilt-in reasoning at Flash latency for customer-facing applicationsTime to first token, user satisfaction, cost per conversation

Use Case Decision Tree

Ask yourself these questions in order:

  1. Does the task need the absolute deepest reasoning? If yes → Gemini 3.1 Pro.
  2. Is this a high-volume, simple task (classification, routing, extraction)? If yes → Gemini 3.1 Flash Lite.
  3. Does the task involve coding, agents, tools, or long context? If yes → Gemini 3.5 Flash.
  4. Is this general production chat or summarization? If yes → Gemini 3.5 Flash or Gemini 2.5 Flash (compare on your workload).

Gemini 3.5 Flash vs Other Gemini Models

This is the comparison that matters for production routing decisions.

FeatureGemini 3.5 FlashGemini 3.1 ProGemini 3 FlashGemini 3.1 Flash LiteGemini 2.5 Flash
StatusGA, stablePreviewPreviewPreviewStable
Best forAgents, coding, long-horizonHardest reasoningGeneral fast workloadsHigh-volume batchProduction chat
Input cost$1.50/MTok$2–$4/MTok$0.50/MTok$0.25/MTok$0.30/MTok
Output cost$9.00/MTok$12–$18/MTok$3.00/MTok$1.50/MTok$2.50/MTok
Context1M / 65K1M / 64K1M / 64K1M / 64K1M / 64K
ReasoningBuilt-inDeepest (thinking)StandardLightweightStandard
Function callingYesYesYesYesYes
Code executionYesYesYesYesYes
Production readinessGAPreviewPreviewPreviewStable
Key takeaway: Gemini 3.5 Flash is the only GA-stable Flash model in the Gemini 3.x generation with built-in reasoning and full tool support. It costs more than Gemini 3 Flash ($1.50 vs $0.50 per MTok input), but delivers frontier-level intelligence that previous Flash models don't match.

Pricing Deep Dive

Standard Pricing

Token typePrice per 1M tokens
Text input$1.50
Text output$9.00
Audio inputUnified with text (no surcharge)
Image inputUnified with text (no surcharge)
Video inputUnified with text (no surcharge)
PDF inputUnified with text (no surcharge)

Cost Reduction Options

MethodHow it worksBest for
Context cachingCache repeated input prefixes; cache hits cost less than fresh inputAgent loops, repeated code context, system prompts
Batch APISubmit requests in batches for offline processing at discounted ratesTest generation, bulk extraction, offline analysis
EvoLink creditsPre-purchase credits for volume discountsTeams with predictable monthly usage

Real-World Cost Examples

ScenarioInput tokensOutput tokensEstimated cost
Single text question~500~200$0.003
Code review (1 file, ~2K lines)~8,000~2,000$0.03
Coding agent session (20 iterations)~80,000~20,000$0.30
Full codebase analysis (500K context)~500,000~10,000$0.84
PDF document extraction (100 pages)~150,000~5,000$0.27
8-hour agent deployment (continuous)~2,000,000~500,000$7.50

These estimates assume standard pricing without caching. With context caching enabled, agent loop costs can be significantly reduced.


Setup: Getting Started in 2 Minutes

Sign up at EvoLink and create an API key in Dashboard → Keys.

Step 2: Install the OpenAI SDK

EvoLink is OpenAI-compatible, so you use the standard OpenAI SDK:

Python:
pip install openai
Node.js:
npm install openai

Step 3: Make Your First Request

Python:
from openai import OpenAI

client = OpenAI(
    api_key="your-evolink-api-key",
    base_url="https://api.evolink.ai/v1"
)

response = client.chat.completions.create(
    model="gemini-3.5-flash",
    messages=[
        {"role": "user", "content": "What is Gemini 3.5 Flash best at?"}
    ]
)

print(response.choices[0].message.content)
Node.js:
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "your-evolink-api-key",
  baseURL: "https://api.evolink.ai/v1",
});

const response = await client.chat.completions.create({
  model: "gemini-3.5-flash",
  messages: [
    { role: "user", content: "What is Gemini 3.5 Flash best at?" },
  ],
});

console.log(response.choices[0].message.content);

That's it. No Google-specific SDK needed, no separate auth flow, no Vertex AI setup.


Code Examples

Basic Text Request with System Prompt

response = client.chat.completions.create(
    model="gemini-3.5-flash",
    messages=[
        {"role": "system", "content": "You are a senior software engineer. Be concise and precise."},
        {"role": "user", "content": "Explain the difference between a mutex and a semaphore in 3 sentences."}
    ],
    temperature=0.3,
    max_tokens=512
)

Multimodal: Image Analysis

import base64

with open("screenshot.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gemini-3.5-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What error is shown in this screenshot? Suggest a fix."},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}}
            ]
        }
    ]
)

All multimodal inputs share the same per-token pricing as text — no audio or video surcharges.

Streaming

For interactive applications where you want tokens to appear as they are generated:

Python:
stream = client.chat.completions.create(
    model="gemini-3.5-flash",
    messages=[{"role": "user", "content": "Write a Python function that validates email addresses."}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
Node.js:
const stream = await client.chat.completions.create({
  model: "gemini-3.5-flash",
  messages: [{ role: "user", content: "Write a Python function that validates email addresses." }],
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) process.stdout.write(content);
}

Multi-Turn Conversation

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a linked list implementation in Python."},
]

# First turn
response = client.chat.completions.create(model="gemini-3.5-flash", messages=messages)
assistant_message = response.choices[0].message.content
messages.append({"role": "assistant", "content": assistant_message})

# Follow-up
messages.append({"role": "user", "content": "Now add a reverse() method."})
response = client.chat.completions.create(model="gemini-3.5-flash", messages=messages)
print(response.choices[0].message.content)

Function Calling

Gemini 3.5 Flash supports native function calling, which is essential for agent workflows. Define tools and let the model decide when to call them.

Python Example

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_database",
            "description": "Search the internal knowledge base",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"},
                    "limit": {"type": "integer", "description": "Max results to return"}
                },
                "required": ["query"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="gemini-3.5-flash",
    messages=[{"role": "user", "content": "What's the weather in Tokyo and find articles about climate change?"}],
    tools=tools,
    tool_choice="auto"
)

# The model may call one or both tools
for tool_call in response.choices[0].message.tool_calls:
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

Node.js Example

const tools = [
  {
    type: "function",
    function: {
      name: "run_tests",
      description: "Run the test suite and return results",
      parameters: {
        type: "object",
        properties: {
          test_file: { type: "string", description: "Path to test file" },
          verbose: { type: "boolean", description: "Show detailed output" },
        },
        required: ["test_file"],
      },
    },
  },
];

const response = await client.chat.completions.create({
  model: "gemini-3.5-flash",
  messages: [{ role: "user", content: "Run the tests for auth module" }],
  tools,
  tool_choice: "auto",
});

const toolCalls = response.choices[0].message.tool_calls;
for (const call of toolCalls) {
  console.log(`Call: ${call.function.name}(${call.function.arguments})`);
}

Function Calling Best Practices

PracticeWhy
Write clear function descriptionsThe model relies on descriptions to decide when to call each tool
Use required fieldsPrevents the model from omitting critical parameters
Keep parameter schemas simpleComplex nested schemas increase error rates
Handle parallel tool callsGemini 3.5 Flash can call multiple tools in a single response
Validate tool call argumentsAlways validate before executing — don't trust model output blindly

Structured Outputs

For workflows that need machine-readable results, use JSON mode or response format:

response = client.chat.completions.create(
    model="gemini-3.5-flash",
    messages=[
        {"role": "system", "content": "Extract structured data from the text. Return valid JSON only."},
        {"role": "user", "content": "John Smith, age 34, works at Acme Corp as a senior engineer since 2022. Email: [email protected]"}
    ],
    response_format={"type": "json_object"}
)

import json
data = json.loads(response.choices[0].message.content)
print(data)
# {"name": "John Smith", "age": 34, "company": "Acme Corp", "role": "senior engineer", "start_year": 2022, "email": "[email protected]"}

When to Use Structured Outputs

ScenarioFormatWhy
Data extraction from documentsJSON modeDownstream systems need structured data
Agent tool responsesJSON modeTool orchestrators need parseable output
Classification tasksJSON modeNeed a consistent label field, not free text
Code generationPlain textCode is already structured; JSON wrapping adds overhead
Explanations and chatPlain textNatural language reads better without JSON

Coding Agent Workflow

This is the highest-value use case for Gemini 3.5 Flash. Here is a complete coding agent loop:

from openai import OpenAI
import subprocess
import json

client = OpenAI(api_key="your-evolink-api-key", base_url="https://api.evolink.ai/v1")

def run_tests(test_file: str) -> dict:
    """Run tests and return results."""
    result = subprocess.run(["python", "-m", "pytest", test_file, "-v", "--tb=short"],
                          capture_output=True, text=True, timeout=60)
    return {"passed": result.returncode == 0, "output": result.stdout + result.stderr}

def read_file(path: str) -> str:
    with open(path) as f:
        return f.read()

def write_file(path: str, content: str):
    with open(path, "w") as f:
        f.write(content)

# Initial context
module_code = read_file("src/auth.py")
test_code = read_file("tests/test_auth.py")
test_result = run_tests("tests/test_auth.py")

messages = [
    {"role": "system", "content": """You are a coding agent. Your job is to fix failing tests.
Rules:
1. Read the code and test output carefully.
2. Identify the root cause.
3. Output the complete fixed file content.
4. Do not change test expectations — fix the implementation."""},
    {"role": "user", "content": f"""Module code:\n```python\n{module_code}\n```\n\nTest code:\n```python\n{test_code}\n```\n\nTest output:\n```\n{test_result['output']}\n```"""}
]

MAX_ITERATIONS = 15
for i in range(MAX_ITERATIONS):
    response = client.chat.completions.create(
        model="gemini-3.5-flash",
        messages=messages,
        temperature=0.2,
        max_tokens=8192
    )

    reply = response.choices[0].message.content
    messages.append({"role": "assistant", "content": reply})

    # Extract and apply the fix
    if "```python" in reply:
        code_block = reply.split("```python")[1].split("```")[0]
        write_file("src/auth.py", code_block)

    # Re-run tests
    test_result = run_tests("tests/test_auth.py")

    if test_result["passed"]:
        print(f"All tests pass after {i + 1} iterations.")
        break

    messages.append({"role": "user", "content": f"Tests still failing:\n```\n{test_result['output']}\n```\nAnalyze the failure and try again."})
else:
    print(f"Failed to fix after {MAX_ITERATIONS} iterations.")

Agent Loop Performance Tips

TipImpact
Use temperature=0.2 for deterministic fixesReduces random variation between iterations
Set max_tokens=8192 for code outputPrevents truncation on large files
Include test output in contextGives the model concrete failure signals
Limit iterations (15–20)Prevents runaway cost if the model is stuck
Use context cachingSame code context sent every iteration — cache hits can significantly reduce input cost

Sub-Agent Deployment Pattern

In multi-agent systems, Gemini 3.5 Flash works well as a sub-agent handling specific tasks while a coordinator (Pro or another model) manages the overall workflow:

def coding_sub_agent(task: str, context: str) -> str:
    """Fast coding sub-agent using Gemini 3.5 Flash."""
    response = client.chat.completions.create(
        model="gemini-3.5-flash",
        messages=[
            {"role": "system", "content": "You are a fast coding sub-agent. Complete the task concisely."},
            {"role": "user", "content": f"Context:\n{context}\n\nTask:\n{task}"}
        ],
        temperature=0.2,
        max_tokens=4096
    )
    return response.choices[0].message.content

def reasoning_agent(task: str) -> str:
    """Deep reasoning agent using Gemini 3.1 Pro for complex decisions."""
    response = client.chat.completions.create(
        model="gemini-3.1-pro-preview",
        messages=[
            {"role": "system", "content": "You are a senior architect. Analyze deeply and decide."},
            {"role": "user", "content": task}
        ],
        temperature=0.3,
        max_tokens=4096
    )
    return response.choices[0].message.content

# Coordinator pattern: Pro decides, Flash executes
plan = reasoning_agent("Design a refactoring plan for the auth module to support OAuth2.")
subtasks = parse_subtasks(plan)

results = []
for subtask in subtasks:
    result = coding_sub_agent(subtask, context=module_code)
    results.append(result)

When to Use Which Model in a Multi-Agent System

Agent roleRecommended modelWhy
Coordinator / plannerGemini 3.1 ProNeeds deepest reasoning for architecture decisions
Coding sub-agentGemini 3.5 FlashFast iteration, good code quality, low per-call cost
Classification / routingGemini 3.1 Flash LiteCheapest option for simple structured decisions
Document analysisGemini 3.5 Flash1M context + multimodal for PDFs and images
Validation / reviewGemini 3.5 Flash or ProDepends on how critical the review is

Cost Analysis: What Agent Loops Actually Cost

Most developers underestimate agent costs because they only look at single-request pricing. Here is a realistic breakdown:

Coding Agent: 20-Iteration Debug Session

PhaseInput tokensOutput tokensInput costOutput cost
Iteration 1 (full context)8,0002,000$0.012$0.018
Iterations 2–5 (growing context)40,0006,000$0.060$0.054
Iterations 6–10 (large context)60,0005,000$0.090$0.045
Iterations 11–20 (plateau)100,0007,000$0.150$0.063
Total208,00020,000$0.312$0.180
Session total$0.49

With context caching (assume 50% hit rate on repeated code context):

Without cachingWith cachingSavings
Input cost$0.312~$0.18740%
Output cost$0.180$0.1800%
Total$0.492$0.367can reduce total session cost depending on cache hit rate

Cost Comparison: Same Agent Session Across Models

ModelInput costOutput costSession totalQuality trade-off
Gemini 3.5 Flash$0.312$0.180$0.49Best balance for coding agents
Gemini 3.1 Pro$0.416–$0.832$0.240–$0.360$0.66–$1.19Deeper reasoning, 2–3x cost
Gemini 3 Flash$0.104$0.060$0.16Cheaper but weaker coding
Gemini 3.1 Flash Lite$0.052$0.030$0.08Cheapest but limited reasoning

Cost-Control Strategies

1. Enable Context Caching

If your agent sends the same code context repeatedly, context caching reduces input cost on cache hits. For a 20-iteration coding session, this can meaningfully reduce total cost depending on cache hit rate and prefix length.

2. Use Batch API for Non-Urgent Work

For test generation, bulk extraction, or offline code analysis, the Batch API provides discounts. Latency is higher but cost per token is lower.

3. Set Max Tokens

Always set max_tokens to prevent unexpectedly long outputs that inflate cost:
response = client.chat.completions.create(
    model="gemini-3.5-flash",
    messages=messages,
    max_tokens=4096  # Reasonable limit for code output
)

4. Route by Task Complexity

Don't use one model for everything. Build a routing layer:

def route_request(task_type: str) -> str:
    routing_table = {
        "architecture": "gemini-3.1-pro-preview",      # Deep reasoning
        "coding": "gemini-3.5-flash",           # Fast iteration
        "classification": "gemini-3.1-flash-lite",  # Cheapest
        "review": "gemini-3.5-flash",           # Good balance
        "chat": "gemini-3.5-flash",             # Production default
    }
    return routing_table.get(task_type, "gemini-3.5-flash")

5. Monitor Token Usage

Track input and output tokens per request. EvoLink's dashboard provides real-time usage visibility. Check usage regularly and set budget limits on your application side as needed.

6. Truncate Context When Possible

Don't send your entire 1M token context if you only need the last 50K tokens. Trim old conversation turns and keep only relevant context.


Common Mistakes and How to Avoid Them

MistakeWhat happensFix
Hard-coding model ID everywhereCan't switch models without code changesStore model ID in config; route by task type
Not setting max_tokensOutput can be unexpectedly long and expensiveAlways set a reasonable output limit
Sending full context every iteration without cachingInput cost grows linearly with iterationsEnable context caching for repeated prefixes
Using Flash for tasks that need deep reasoningLower accuracy on complex architecture decisionsRoute hardest steps to Gemini 3.1 Pro
Using Pro for tasks that Flash handles well2–3x higher cost with marginal quality gainDefault to Flash; upgrade to Pro only when needed
Ignoring retry cost in budget estimatesReal cost is higher than single-request estimatesInclude retry rate and fallback cost in calculations
Not validating function call argumentsModel outputs invalid parametersAlways validate tool call args before execution
Treating context window as unlimited1M tokens is large but not infiniteMonitor context usage; truncate when approaching limits

When NOT to Use Gemini 3.5 Flash

Gemini 3.5 Flash is strong but not universal. Use something else when:

ScenarioWhy Flash is wrongBetter choice
Image/audio/video generationFlash is text-output onlySpecialized generation models
Hardest multi-step reasoningPro offers deeper reasoning tracesGemini 3.1 Pro
Cheapest possible batch extractionFlash Lite is 6x cheaper on inputGemini 3.1 Flash Lite
Real-time voice conversationFlash doesn't support Live APIGemini models with Live API
Computer useFlash doesn't support computer useModels with computer use support

FAQ

What is the model ID for Gemini 3.5 Flash?

The model ID is gemini-3.5-flash. Use this exact string in API requests through EvoLink.

Is Gemini 3.5 Flash free?

Gemini 3.5 Flash has a free tier on the Google Gemini API. The paid standard pricing is $1.50 per 1M input tokens and $9.00 per 1M output tokens. Context caching and Batch API offer reduced rates. For EvoLink pricing, check the product page.

Can I use Gemini 3.5 Flash with the OpenAI SDK?

Yes. Point the OpenAI SDK at https://api.evolink.ai/v1 and set model="gemini-3.5-flash". Works with Python, Node.js, Go, and any other OpenAI-compatible client.

Does Gemini 3.5 Flash support function calling?

Yes. Function calling, structured outputs, code execution, and search grounding are all supported natively. You can define tools and the model will call them when appropriate.

How does Gemini 3.5 Flash compare to Gemini 3 Flash?

Gemini 3.5 Flash is the current-generation Flash model with frontier-level intelligence, stronger agentic and coding performance, and built-in reasoning. Gemini 3 Flash is the previous generation with lower capability but also lower cost ($0.50 vs $1.50 per MTok input).

What is the context window?

1,048,576 input tokens and 65,535 output tokens. This is large enough for full codebases, multi-document analysis, and long agent conversation histories.

Is Gemini 3.5 Flash good for coding agents?

Yes. Google explicitly optimizes it for coding tasks and agentic workflows. It handles code generation, debugging, refactoring, and multi-file analysis at Flash-tier speed. A typical 20-iteration debug session costs about $0.30–$0.50.

Is Gemini 3.5 Flash production-ready?

Yes. Google lists it as generally available (GA) and stable for scaled production use. It is not a preview or experimental model.

How much does a coding agent session cost?

A typical 20-iteration debug session with ~200K total input tokens and ~20K output tokens costs approximately $0.49 at standard pricing, or ~$0.37 with context caching enabled.

Can I switch between Gemini models without changing code?

Yes. With EvoLink, all Gemini models share the same API format. Change the model parameter from "gemini-3.5-flash" to "gemini-3.1-pro-preview" or "gemini-3.1-flash-lite" — no other changes needed.

Does Gemini 3.5 Flash support structured JSON output?

Yes. Use response_format={"type": "json_object"} to get structured JSON responses. This is useful for data extraction, classification, and tool orchestration.

Next Steps

Ready to Reduce Your AI Costs by 89%?

Start using EvoLink today and experience the power of intelligent API routing.