
How to Use Gemini 3.5 Flash API: Model ID, Pricing, and Code Examples

Gemini 3.5 Flash is Google's latest production-ready Flash model, generally available and stable for scaled production use. It is built for agentic workflows, coding agents, sub-agent deployment, and long-horizon tasks — combining frontier-level intelligence with Flash-tier speed and cost.
This guide covers everything you need to integrate Gemini 3.5 Flash into your application: model ID, pricing, code examples in Python and Node.js, function calling, structured outputs, agent workflow patterns, cost analysis, and how to choose between Flash and Pro.
Quick Reference Card
| Item | Value |
|---|---|
| Model ID | gemini-3.5-flash |
| Status | Generally available (GA), stable for production |
| Input pricing | $1.50 per 1M tokens |
| Output pricing | $9.00 per 1M tokens |
| Context window | 1,048,576 input tokens |
| Max output | 65,535 tokens |
| Input modalities | Text, image, video, audio, PDF |
| Output modalities | Text only |
| Function calling | Supported |
| Structured outputs | Supported |
| Code execution | Supported |
| Search grounding | Supported |
| Context caching | Supported |
| Batch API | Supported |
| Streaming | Supported |
Table of Contents
- When to Use Gemini 3.5 Flash
- Gemini 3.5 Flash vs Other Gemini Models
- Pricing Deep Dive
- Setup: Getting Started in 2 Minutes
- Code Examples
- Function Calling
- Structured Outputs
- Coding Agent Workflow
- Sub-Agent Deployment Pattern
- Cost Analysis: What Agent Loops Actually Cost
- Cost-Control Strategies
- Common Mistakes and How to Avoid Them
- When NOT to Use Gemini 3.5 Flash
- FAQ
When to Use Gemini 3.5 Flash
Gemini 3.5 Flash is not a general-purpose budget model. Google explicitly positions it for specific high-value workloads where speed, cost per iteration, and tool support matter more than maximum reasoning depth.
Best Use Cases
| Use case | Why Gemini 3.5 Flash fits | What to measure |
|---|---|---|
| Coding agents | Fast code generation, debugging, refactoring at Flash-tier speed per iteration | Iterations to fix, cost per session, diff quality |
| Agentic workflows | Native function calling, parallel execution loops, low per-call cost | Tool call accuracy, fallback rate, total workflow cost |
| Sub-agent deployment | Deploy as a sub-agent in multi-agent systems where per-call economics matter | Latency per sub-call, error rate, orchestration overhead |
| Long-horizon tasks | 1M context handles full codebases and multi-document analysis without truncation | Context utilization rate, output quality at high token counts |
| Document processing | PDF, audio, video inputs at unified pricing — no modality surcharges | Extraction accuracy, processing cost per document |
| Production chat | Built-in reasoning at Flash latency for customer-facing applications | Time to first token, user satisfaction, cost per conversation |
Use Case Decision Tree
Ask yourself these questions in order:
- Does the task need the absolute deepest reasoning? If yes → Gemini 3.1 Pro.
- Is this a high-volume, simple task (classification, routing, extraction)? If yes → Gemini 3.1 Flash Lite.
- Does the task involve coding, agents, tools, or long context? If yes → Gemini 3.5 Flash.
- Is this general production chat or summarization? If yes → Gemini 3.5 Flash or Gemini 2.5 Flash (compare on your workload).
Gemini 3.5 Flash vs Other Gemini Models
This is the comparison that matters for production routing decisions.
| Feature | Gemini 3.5 Flash | Gemini 3.1 Pro | Gemini 3 Flash | Gemini 3.1 Flash Lite | Gemini 2.5 Flash |
|---|---|---|---|---|---|
| Status | GA, stable | Preview | Preview | Preview | Stable |
| Best for | Agents, coding, long-horizon | Hardest reasoning | General fast workloads | High-volume batch | Production chat |
| Input cost | $1.50/MTok | $2–$4/MTok | $0.50/MTok | $0.25/MTok | $0.30/MTok |
| Output cost | $9.00/MTok | $12–$18/MTok | $3.00/MTok | $1.50/MTok | $2.50/MTok |
| Context | 1M / 65K | 1M / 64K | 1M / 64K | 1M / 64K | 1M / 64K |
| Reasoning | Built-in | Deepest (thinking) | Standard | Lightweight | Standard |
| Function calling | Yes | Yes | Yes | Yes | Yes |
| Code execution | Yes | Yes | Yes | Yes | Yes |
| Production readiness | GA | Preview | Preview | Preview | Stable |
Pricing Deep Dive
Standard Pricing
| Token type | Price per 1M tokens |
|---|---|
| Text input | $1.50 |
| Text output | $9.00 |
| Audio input | Unified with text (no surcharge) |
| Image input | Unified with text (no surcharge) |
| Video input | Unified with text (no surcharge) |
| PDF input | Unified with text (no surcharge) |
Cost Reduction Options
| Method | How it works | Best for |
|---|---|---|
| Context caching | Cache repeated input prefixes; cache hits cost less than fresh input | Agent loops, repeated code context, system prompts |
| Batch API | Submit requests in batches for offline processing at discounted rates | Test generation, bulk extraction, offline analysis |
| EvoLink credits | Pre-purchase credits for volume discounts | Teams with predictable monthly usage |
Real-World Cost Examples
| Scenario | Input tokens | Output tokens | Estimated cost |
|---|---|---|---|
| Single text question | ~500 | ~200 | $0.003 |
| Code review (1 file, ~2K lines) | ~8,000 | ~2,000 | $0.03 |
| Coding agent session (20 iterations) | ~80,000 | ~20,000 | $0.30 |
| Full codebase analysis (500K context) | ~500,000 | ~10,000 | $0.84 |
| PDF document extraction (100 pages) | ~150,000 | ~5,000 | $0.27 |
| 8-hour agent deployment (continuous) | ~2,000,000 | ~500,000 | $7.50 |
These estimates assume standard pricing without caching. With context caching enabled, agent loop costs can be significantly reduced.
Setup: Getting Started in 2 Minutes
Step 1: Get an EvoLink API Key
Step 2: Install the OpenAI SDK
EvoLink is OpenAI-compatible, so you use the standard OpenAI SDK:
pip install openainpm install openaiStep 3: Make Your First Request
from openai import OpenAI
client = OpenAI(
api_key="your-evolink-api-key",
base_url="https://api.evolink.ai/v1"
)
response = client.chat.completions.create(
model="gemini-3.5-flash",
messages=[
{"role": "user", "content": "What is Gemini 3.5 Flash best at?"}
]
)
print(response.choices[0].message.content)import OpenAI from "openai";
const client = new OpenAI({
apiKey: "your-evolink-api-key",
baseURL: "https://api.evolink.ai/v1",
});
const response = await client.chat.completions.create({
model: "gemini-3.5-flash",
messages: [
{ role: "user", content: "What is Gemini 3.5 Flash best at?" },
],
});
console.log(response.choices[0].message.content);That's it. No Google-specific SDK needed, no separate auth flow, no Vertex AI setup.
Code Examples
Basic Text Request with System Prompt
response = client.chat.completions.create(
model="gemini-3.5-flash",
messages=[
{"role": "system", "content": "You are a senior software engineer. Be concise and precise."},
{"role": "user", "content": "Explain the difference between a mutex and a semaphore in 3 sentences."}
],
temperature=0.3,
max_tokens=512
)Multimodal: Image Analysis
import base64
with open("screenshot.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="gemini-3.5-flash",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What error is shown in this screenshot? Suggest a fix."},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}}
]
}
]
)All multimodal inputs share the same per-token pricing as text — no audio or video surcharges.
Streaming
For interactive applications where you want tokens to appear as they are generated:
stream = client.chat.completions.create(
model="gemini-3.5-flash",
messages=[{"role": "user", "content": "Write a Python function that validates email addresses."}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)const stream = await client.chat.completions.create({
model: "gemini-3.5-flash",
messages: [{ role: "user", content: "Write a Python function that validates email addresses." }],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) process.stdout.write(content);
}Multi-Turn Conversation
messages = [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a linked list implementation in Python."},
]
# First turn
response = client.chat.completions.create(model="gemini-3.5-flash", messages=messages)
assistant_message = response.choices[0].message.content
messages.append({"role": "assistant", "content": assistant_message})
# Follow-up
messages.append({"role": "user", "content": "Now add a reverse() method."})
response = client.chat.completions.create(model="gemini-3.5-flash", messages=messages)
print(response.choices[0].message.content)Function Calling
Gemini 3.5 Flash supports native function calling, which is essential for agent workflows. Define tools and let the model decide when to call them.
Python Example
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}
},
{
"type": "function",
"function": {
"name": "search_database",
"description": "Search the internal knowledge base",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"limit": {"type": "integer", "description": "Max results to return"}
},
"required": ["query"]
}
}
}
]
response = client.chat.completions.create(
model="gemini-3.5-flash",
messages=[{"role": "user", "content": "What's the weather in Tokyo and find articles about climate change?"}],
tools=tools,
tool_choice="auto"
)
# The model may call one or both tools
for tool_call in response.choices[0].message.tool_calls:
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")Node.js Example
const tools = [
{
type: "function",
function: {
name: "run_tests",
description: "Run the test suite and return results",
parameters: {
type: "object",
properties: {
test_file: { type: "string", description: "Path to test file" },
verbose: { type: "boolean", description: "Show detailed output" },
},
required: ["test_file"],
},
},
},
];
const response = await client.chat.completions.create({
model: "gemini-3.5-flash",
messages: [{ role: "user", content: "Run the tests for auth module" }],
tools,
tool_choice: "auto",
});
const toolCalls = response.choices[0].message.tool_calls;
for (const call of toolCalls) {
console.log(`Call: ${call.function.name}(${call.function.arguments})`);
}Function Calling Best Practices
| Practice | Why |
|---|---|
| Write clear function descriptions | The model relies on descriptions to decide when to call each tool |
Use required fields | Prevents the model from omitting critical parameters |
| Keep parameter schemas simple | Complex nested schemas increase error rates |
| Handle parallel tool calls | Gemini 3.5 Flash can call multiple tools in a single response |
| Validate tool call arguments | Always validate before executing — don't trust model output blindly |
Structured Outputs
For workflows that need machine-readable results, use JSON mode or response format:
response = client.chat.completions.create(
model="gemini-3.5-flash",
messages=[
{"role": "system", "content": "Extract structured data from the text. Return valid JSON only."},
{"role": "user", "content": "John Smith, age 34, works at Acme Corp as a senior engineer since 2022. Email: [email protected]"}
],
response_format={"type": "json_object"}
)
import json
data = json.loads(response.choices[0].message.content)
print(data)
# {"name": "John Smith", "age": 34, "company": "Acme Corp", "role": "senior engineer", "start_year": 2022, "email": "[email protected]"}When to Use Structured Outputs
| Scenario | Format | Why |
|---|---|---|
| Data extraction from documents | JSON mode | Downstream systems need structured data |
| Agent tool responses | JSON mode | Tool orchestrators need parseable output |
| Classification tasks | JSON mode | Need a consistent label field, not free text |
| Code generation | Plain text | Code is already structured; JSON wrapping adds overhead |
| Explanations and chat | Plain text | Natural language reads better without JSON |
Coding Agent Workflow
This is the highest-value use case for Gemini 3.5 Flash. Here is a complete coding agent loop:
from openai import OpenAI
import subprocess
import json
client = OpenAI(api_key="your-evolink-api-key", base_url="https://api.evolink.ai/v1")
def run_tests(test_file: str) -> dict:
"""Run tests and return results."""
result = subprocess.run(["python", "-m", "pytest", test_file, "-v", "--tb=short"],
capture_output=True, text=True, timeout=60)
return {"passed": result.returncode == 0, "output": result.stdout + result.stderr}
def read_file(path: str) -> str:
with open(path) as f:
return f.read()
def write_file(path: str, content: str):
with open(path, "w") as f:
f.write(content)
# Initial context
module_code = read_file("src/auth.py")
test_code = read_file("tests/test_auth.py")
test_result = run_tests("tests/test_auth.py")
messages = [
{"role": "system", "content": """You are a coding agent. Your job is to fix failing tests.
Rules:
1. Read the code and test output carefully.
2. Identify the root cause.
3. Output the complete fixed file content.
4. Do not change test expectations — fix the implementation."""},
{"role": "user", "content": f"""Module code:\n```python\n{module_code}\n```\n\nTest code:\n```python\n{test_code}\n```\n\nTest output:\n```\n{test_result['output']}\n```"""}
]
MAX_ITERATIONS = 15
for i in range(MAX_ITERATIONS):
response = client.chat.completions.create(
model="gemini-3.5-flash",
messages=messages,
temperature=0.2,
max_tokens=8192
)
reply = response.choices[0].message.content
messages.append({"role": "assistant", "content": reply})
# Extract and apply the fix
if "```python" in reply:
code_block = reply.split("```python")[1].split("```")[0]
write_file("src/auth.py", code_block)
# Re-run tests
test_result = run_tests("tests/test_auth.py")
if test_result["passed"]:
print(f"All tests pass after {i + 1} iterations.")
break
messages.append({"role": "user", "content": f"Tests still failing:\n```\n{test_result['output']}\n```\nAnalyze the failure and try again."})
else:
print(f"Failed to fix after {MAX_ITERATIONS} iterations.")Agent Loop Performance Tips
| Tip | Impact |
|---|---|
Use temperature=0.2 for deterministic fixes | Reduces random variation between iterations |
Set max_tokens=8192 for code output | Prevents truncation on large files |
| Include test output in context | Gives the model concrete failure signals |
| Limit iterations (15–20) | Prevents runaway cost if the model is stuck |
| Use context caching | Same code context sent every iteration — cache hits can significantly reduce input cost |
Sub-Agent Deployment Pattern
In multi-agent systems, Gemini 3.5 Flash works well as a sub-agent handling specific tasks while a coordinator (Pro or another model) manages the overall workflow:
def coding_sub_agent(task: str, context: str) -> str:
"""Fast coding sub-agent using Gemini 3.5 Flash."""
response = client.chat.completions.create(
model="gemini-3.5-flash",
messages=[
{"role": "system", "content": "You are a fast coding sub-agent. Complete the task concisely."},
{"role": "user", "content": f"Context:\n{context}\n\nTask:\n{task}"}
],
temperature=0.2,
max_tokens=4096
)
return response.choices[0].message.content
def reasoning_agent(task: str) -> str:
"""Deep reasoning agent using Gemini 3.1 Pro for complex decisions."""
response = client.chat.completions.create(
model="gemini-3.1-pro-preview",
messages=[
{"role": "system", "content": "You are a senior architect. Analyze deeply and decide."},
{"role": "user", "content": task}
],
temperature=0.3,
max_tokens=4096
)
return response.choices[0].message.content
# Coordinator pattern: Pro decides, Flash executes
plan = reasoning_agent("Design a refactoring plan for the auth module to support OAuth2.")
subtasks = parse_subtasks(plan)
results = []
for subtask in subtasks:
result = coding_sub_agent(subtask, context=module_code)
results.append(result)When to Use Which Model in a Multi-Agent System
| Agent role | Recommended model | Why |
|---|---|---|
| Coordinator / planner | Gemini 3.1 Pro | Needs deepest reasoning for architecture decisions |
| Coding sub-agent | Gemini 3.5 Flash | Fast iteration, good code quality, low per-call cost |
| Classification / routing | Gemini 3.1 Flash Lite | Cheapest option for simple structured decisions |
| Document analysis | Gemini 3.5 Flash | 1M context + multimodal for PDFs and images |
| Validation / review | Gemini 3.5 Flash or Pro | Depends on how critical the review is |
Cost Analysis: What Agent Loops Actually Cost
Most developers underestimate agent costs because they only look at single-request pricing. Here is a realistic breakdown:
Coding Agent: 20-Iteration Debug Session
| Phase | Input tokens | Output tokens | Input cost | Output cost |
|---|---|---|---|---|
| Iteration 1 (full context) | 8,000 | 2,000 | $0.012 | $0.018 |
| Iterations 2–5 (growing context) | 40,000 | 6,000 | $0.060 | $0.054 |
| Iterations 6–10 (large context) | 60,000 | 5,000 | $0.090 | $0.045 |
| Iterations 11–20 (plateau) | 100,000 | 7,000 | $0.150 | $0.063 |
| Total | 208,000 | 20,000 | $0.312 | $0.180 |
| Session total | $0.49 |
With context caching (assume 50% hit rate on repeated code context):
| Without caching | With caching | Savings | |
|---|---|---|---|
| Input cost | $0.312 | ~$0.187 | 40% |
| Output cost | $0.180 | $0.180 | 0% |
| Total | $0.492 | $0.367 | can reduce total session cost depending on cache hit rate |
Cost Comparison: Same Agent Session Across Models
| Model | Input cost | Output cost | Session total | Quality trade-off |
|---|---|---|---|---|
| Gemini 3.5 Flash | $0.312 | $0.180 | $0.49 | Best balance for coding agents |
| Gemini 3.1 Pro | $0.416–$0.832 | $0.240–$0.360 | $0.66–$1.19 | Deeper reasoning, 2–3x cost |
| Gemini 3 Flash | $0.104 | $0.060 | $0.16 | Cheaper but weaker coding |
| Gemini 3.1 Flash Lite | $0.052 | $0.030 | $0.08 | Cheapest but limited reasoning |
Cost-Control Strategies
1. Enable Context Caching
If your agent sends the same code context repeatedly, context caching reduces input cost on cache hits. For a 20-iteration coding session, this can meaningfully reduce total cost depending on cache hit rate and prefix length.
2. Use Batch API for Non-Urgent Work
For test generation, bulk extraction, or offline code analysis, the Batch API provides discounts. Latency is higher but cost per token is lower.
3. Set Max Tokens
max_tokens to prevent unexpectedly long outputs that inflate cost:response = client.chat.completions.create(
model="gemini-3.5-flash",
messages=messages,
max_tokens=4096 # Reasonable limit for code output
)4. Route by Task Complexity
Don't use one model for everything. Build a routing layer:
def route_request(task_type: str) -> str:
routing_table = {
"architecture": "gemini-3.1-pro-preview", # Deep reasoning
"coding": "gemini-3.5-flash", # Fast iteration
"classification": "gemini-3.1-flash-lite", # Cheapest
"review": "gemini-3.5-flash", # Good balance
"chat": "gemini-3.5-flash", # Production default
}
return routing_table.get(task_type, "gemini-3.5-flash")5. Monitor Token Usage
Track input and output tokens per request. EvoLink's dashboard provides real-time usage visibility. Check usage regularly and set budget limits on your application side as needed.
6. Truncate Context When Possible
Don't send your entire 1M token context if you only need the last 50K tokens. Trim old conversation turns and keep only relevant context.
Common Mistakes and How to Avoid Them
| Mistake | What happens | Fix |
|---|---|---|
| Hard-coding model ID everywhere | Can't switch models without code changes | Store model ID in config; route by task type |
Not setting max_tokens | Output can be unexpectedly long and expensive | Always set a reasonable output limit |
| Sending full context every iteration without caching | Input cost grows linearly with iterations | Enable context caching for repeated prefixes |
| Using Flash for tasks that need deep reasoning | Lower accuracy on complex architecture decisions | Route hardest steps to Gemini 3.1 Pro |
| Using Pro for tasks that Flash handles well | 2–3x higher cost with marginal quality gain | Default to Flash; upgrade to Pro only when needed |
| Ignoring retry cost in budget estimates | Real cost is higher than single-request estimates | Include retry rate and fallback cost in calculations |
| Not validating function call arguments | Model outputs invalid parameters | Always validate tool call args before execution |
| Treating context window as unlimited | 1M tokens is large but not infinite | Monitor context usage; truncate when approaching limits |
When NOT to Use Gemini 3.5 Flash
Gemini 3.5 Flash is strong but not universal. Use something else when:
| Scenario | Why Flash is wrong | Better choice |
|---|---|---|
| Image/audio/video generation | Flash is text-output only | Specialized generation models |
| Hardest multi-step reasoning | Pro offers deeper reasoning traces | Gemini 3.1 Pro |
| Cheapest possible batch extraction | Flash Lite is 6x cheaper on input | Gemini 3.1 Flash Lite |
| Real-time voice conversation | Flash doesn't support Live API | Gemini models with Live API |
| Computer use | Flash doesn't support computer use | Models with computer use support |
FAQ
What is the model ID for Gemini 3.5 Flash?
gemini-3.5-flash. Use this exact string in API requests through EvoLink.Is Gemini 3.5 Flash free?
Can I use Gemini 3.5 Flash with the OpenAI SDK?
https://api.evolink.ai/v1 and set model="gemini-3.5-flash". Works with Python, Node.js, Go, and any other OpenAI-compatible client.Does Gemini 3.5 Flash support function calling?
Yes. Function calling, structured outputs, code execution, and search grounding are all supported natively. You can define tools and the model will call them when appropriate.
How does Gemini 3.5 Flash compare to Gemini 3 Flash?
Gemini 3.5 Flash is the current-generation Flash model with frontier-level intelligence, stronger agentic and coding performance, and built-in reasoning. Gemini 3 Flash is the previous generation with lower capability but also lower cost ($0.50 vs $1.50 per MTok input).
What is the context window?
1,048,576 input tokens and 65,535 output tokens. This is large enough for full codebases, multi-document analysis, and long agent conversation histories.
Is Gemini 3.5 Flash good for coding agents?
Yes. Google explicitly optimizes it for coding tasks and agentic workflows. It handles code generation, debugging, refactoring, and multi-file analysis at Flash-tier speed. A typical 20-iteration debug session costs about $0.30–$0.50.
Is Gemini 3.5 Flash production-ready?
Yes. Google lists it as generally available (GA) and stable for scaled production use. It is not a preview or experimental model.
How much does a coding agent session cost?
A typical 20-iteration debug session with ~200K total input tokens and ~20K output tokens costs approximately $0.49 at standard pricing, or ~$0.37 with context caching enabled.
Can I switch between Gemini models without changing code?
model parameter from "gemini-3.5-flash" to "gemini-3.1-pro-preview" or "gemini-3.1-flash-lite" — no other changes needed.Does Gemini 3.5 Flash support structured JSON output?
response_format={"type": "json_object"} to get structured JSON responses. This is useful for data extraction, classification, and tool orchestration.Next Steps
- Gemini 3.5 Flash API — Full Product Page — Live pricing, status, and model details
- Compare All Gemini Models — Side-by-side comparison of 7 Gemini routes
- Gemini 3.5 Flash Release Notes — What changed from preview to GA
- EvoLink API Docs — Full API reference and integration guides
- Create API Key — Start building in 2 minutes


