What is Gemini 3.5 Flash's main agentic benchmark score?

The headline agentic score is 83.6 percent on MCP Atlas, a multi-step workflow benchmark using MCP.

Does Gemini 3.5 Flash beat GPT-5.5 on every benchmark?

No. In Google's published table, GPT-5.5 leads several benchmarks, including Terminal-Bench 2.1, OSWorld-Verified, GDPval-AA, MRCR v2 at 128k, Blueprint-Bench 2, and ARC-AGI-2.

Gemini 3.5 Flash Benchmarks: Agentic and Coding Scores Explained

Gemini 3.5 Flash is not a normal speed-tier update. Google announced it on May 19, 2026 as the first model in the Gemini 3.5 family, and the positioning is unusually direct: Flash is now being sold as a serious model for long-horizon agents, iterative coding, multimodal reasoning, and production workflows where latency still matters. That combination is why the benchmark table is worth reading carefully. The important story is not that one score is higher than another in isolation. The important story is that a Flash-class model is competitive with flagship models on several agentic and coding evaluations while keeping the operational shape of a faster model. For product teams, developer-tool companies, and AI SaaS founders, that changes the way model selection conversations should happen.

The benchmark snapshot

Google's published table reports Gemini 3.5 Flash across coding, agentic tool use, UI control, expert tasks, multimodal understanding, long context, and reasoning. The headline coding marks are 76.2 percent on Terminal-Bench 2.1 and 55.1 percent on SWE-Bench Pro Public. The headline agentic mark is 83.6 percent on MCP Atlas, with 56.5 percent on Toolathlon and 78.4 percent on OSWorld-Verified. It also posts 57.9 percent on Finance Agent v2, 1656 Elo on GDPval-AA, 84.2 percent on CharXiv Reasoning, and 83.6 percent on MMMU-Pro. These numbers do not all mean the same thing, but together they show a model tuned for tool-using work, not only chat response quality.

Benchmark graph

Terminal-Bench 2.1 agentic terminal coding

Published benchmark scores from Google's Gemini 3.5 Flash model table. Higher is better.

GPT-5.5

78.2%

Gemini 3.5 Flash

76.2%

Gemini 3.1 Pro

70.3%

Claude Opus 4.7

66.1%

Gemini 3 Flash

58.0%

Coding marks: strong terminal work, harder repo work

Terminal-Bench 2.1 is the most flattering coding number for Gemini 3.5 Flash. It measures agentic terminal coding through the Terminus-2 harness, which means the model must operate through a realistic command-line loop rather than simply answer a programming question. A 76.2 percent score puts it ahead of Gemini 3.1 Pro at 70.3 percent and Claude Opus 4.7 at 66.1 percent, while still slightly behind GPT-5.5 at 78.2 percent. That matters because terminal coding is close to how AI coding agents actually work: inspect files, run commands, change code, react to errors, and keep the task state alive across multiple steps.

Terminal-Bench 2.1: Gemini 3.5 Flash scores 76.2 percent.
Gemini 3 Flash scores 58.0 percent, so the published jump is 18.2 percentage points.
Gemini 3.1 Pro scores 70.3 percent, meaning Flash beats Pro on this particular agentic coding benchmark.
GPT-5.5 remains slightly ahead on this benchmark at 78.2 percent.

SWE-Bench Pro is more cautious

SWE-Bench Pro Public tells a more restrained story. Gemini 3.5 Flash reaches 55.1 percent, ahead of Gemini 3 Flash at 49.6 percent and a touch ahead of Gemini 3.1 Pro at 54.2 percent. But Claude Opus 4.7 scores 64.3 percent and GPT-5.5 scores 58.6 percent, so Gemini 3.5 Flash is not the outright coding champion if your definition of coding is broad repository repair under a benchmark harness. The useful interpretation is narrower and more practical: Flash has become strong enough that many coding-agent workflows can start on a cheaper, faster model and escalate only when the problem becomes deeply architectural or benchmark-like.

Benchmark	Gemini 3.5 Flash	Nearest context	Practical reading
Terminal-Bench 2.1	76.2%	GPT-5.5 leads at 78.2%	Excellent fit for terminal-style agent loops
SWE-Bench Pro Public	55.1%	Claude Opus 4.7 leads at 64.3%	Good, but not the top repo-repair score
MCP Atlas	83.6%	Highest score in Google's table	Very strong for multi-step MCP workflows
OSWorld-Verified	78.4%	GPT-5.5 leads at 78.7%	Near-frontier UI control score

Agentic marks: MCP Atlas is the real headline

The most strategically important score is probably MCP Atlas at 83.6 percent. MCP Atlas focuses on multi-step workflows using the Model Context Protocol, so it is closer to the direction developer tools are moving: models that can inspect external systems, call tools, preserve state, and work through a task with structured context. Gemini 3.5 Flash beats Gemini 3 Flash by 21.6 points on this benchmark and leads the comparison set that Google published, including Gemini 3.1 Pro, Claude Sonnet 4.6, Claude Opus 4.7, and GPT-5.5. If you are building an agent that lives inside product data, Figma files, codebases, CRMs, databases, or internal tools, this is the number to notice first.

Benchmark graph

MCP Atlas multi-step workflow benchmark

Agentic workflow scores from Google's published benchmark comparison.

Gemini 3.5 Flash

83.6%

Claude Opus 4.7

79.1%

Gemini 3.1 Pro

78.2%

GPT-5.5

75.3%

Claude Sonnet 4.6

69.5%

Gemini 3 Flash

62.0%

Toolathlon, OSWorld, and Finance Agent

The agent story gets stronger when you look beyond MCP Atlas. Gemini 3.5 Flash scores 56.5 percent on Toolathlon, compared with 49.4 percent for Gemini 3 Flash and 55.6 percent for GPT-5.5. On OSWorld-Verified, it reaches 78.4 percent, nearly level with GPT-5.5 at 78.7 percent and slightly above Claude Opus 4.7 at 78.0 percent. On Finance Agent v2, it leads the published comparison at 57.9 percent. These are different tasks, but they share a common shape: the model needs to coordinate tools, read context, make intermediate decisions, and avoid drifting away from the user's objective.

Area	Benchmark	Gemini 3.5 Flash	Why it matters
Tool use	Toolathlon	56.5%	Measures general tool use in realistic tasks
UI control	OSWorld-Verified	78.4%	Tests agentic computer-use style workflows
Expert work	Finance Agent v2	57.9%	Evaluates financial analysis and decision-making
Knowledge work	GDPval-AA	1656 Elo	Tracks economically valuable work against other systems

Where Gemini 3.5 Flash does not lead

The benchmark table is strong, but it is not a clean sweep. GPT-5.5 leads Terminal-Bench 2.1, OSWorld-Verified, GDPval-AA, Blueprint-Bench 2, MRCR v2 at 128k, and ARC-AGI-2 in the published comparison. Claude Opus 4.7 leads SWE-Bench Pro Public and Humanity's Last Exam. Gemini 3.1 Pro still leads Gemini 3.5 Flash on some reasoning and long-context measures, including Humanity's Last Exam and ARC-AGI-2. That is not a failure. It simply means Gemini 3.5 Flash is best understood as a high-throughput agentic work model, not as the single best model for every reasoning benchmark.

Benchmark graph

Selected non-leading Gemini 3.5 Flash comparisons

A few benchmarks where another model is ahead in Google's published table.

GPT-5.5 MRCR 128k

94.8%

Gemini 3.5 Flash MRCR 128k

77.3%

Claude Opus 4.7 SWE-Bench Pro

64.3%

Gemini 3.5 Flash SWE-Bench Pro

55.1%

Methodology caveats

Google's evaluation methodology matters. The Gemini scores are reported as pass @1 unless otherwise noted, with no majority voting or parallel test-time compute for single-attempt settings. The methodology document also says smaller benchmarks are averaged across multiple trials to reduce variance. Some results are self-computed by Google, some come from public leaderboards, and non-Gemini model numbers are generally sourced from provider self-reports unless the methodology says otherwise. That does not make the table useless. It does mean teams should treat it as a selection signal, then run private evals on their own codebase, tool stack, data volume, latency limits, and failure tolerance.

Use official numbers to shortlist models, not to choose blindly.
Run your own evals with the same tools, files, and permissions your agent will use in production.
Measure cost per completed task, not just input and output token price.
Track wrong tool calls, repeated tool calls, and recovery from failed commands.

How teams should benchmark it locally

The cleanest local benchmark is a small suite of real tasks that already cost your team time. For a coding agent, include one bug fix, one refactor, one failing-test diagnosis, one feature touching two modules, and one task where the model must decide not to make a change. For an enterprise agent, include a document workflow, a tool-use workflow, a data-retrieval workflow, and an escalation workflow. Score each model on completion, correctness, review burden, latency, tool count, and total cost. Gemini 3.5 Flash should be especially interesting where many medium-difficulty tasks need to run in parallel.

Benchmark against Gemini 3 Flash if you already use the Flash tier.
Benchmark against your current flagship model if the workflow is expensive or latency-sensitive.
Include at least one adversarial or ambiguous task to test overconfident execution.
Review logs manually before trusting aggregate scores.

Verdict

Gemini 3.5 Flash looks like a serious agentic workhorse. It is not the top model on every published benchmark, and it should not be sold internally as a universal replacement for every Pro or flagship model. But the combination of 76.2 percent on Terminal-Bench 2.1, 83.6 percent on MCP Atlas, 78.4 percent on OSWorld-Verified, and strong multimodal scores makes it one of the most important models for teams building coding agents, internal automation, AI SaaS features, and long-running tool workflows in 2026.

Sources and release notes

Frequently asked questions

What is Gemini 3.5 Flash's main coding benchmark score?: Google reports 76.2 percent on Terminal-Bench 2.1 and 55.1 percent on SWE-Bench Pro Public for Gemini 3.5 Flash.
What is Gemini 3.5 Flash's main agentic benchmark score?: The headline agentic score is 83.6 percent on MCP Atlas, a multi-step workflow benchmark using MCP.
Does Gemini 3.5 Flash beat GPT-5.5 on every benchmark?: No. In Google's published table, GPT-5.5 leads several benchmarks, including Terminal-Bench 2.1, OSWorld-Verified, GDPval-AA, MRCR v2 at 128k, Blueprint-Bench 2, and ARC-AGI-2.

Related insights

Technical founder reviewing an AI agent MVP workflow on a laptop

AI7 min read

AI Agent MVPs: What to Build Before You Build the Agent

A practical guide to scoping an AI agent MVP around workflow proof before investing in full agent autonomy.

Read article

Product team discussing a niche AI SaaS opportunity around a laptop

AI7 min read

Vertical AI SaaS Ideas That Still Have Room in 2026

How founders can evaluate vertical AI SaaS opportunities by workflow pain, data access, and buyer urgency.

Read article

Team mapping multi-agent SaaS responsibilities and handoffs on a laptop

AI7 min read

Multi-Agent SaaS Products: When They Make Sense

How founders can decide whether a multi-agent SaaS product is necessary or just premature complexity.

Read article

Gemini 3.5 Flash Benchmarks: Agentic and Coding Scores Explained

The benchmark snapshot

Terminal-Bench 2.1 agentic terminal coding

Coding marks: strong terminal work, harder repo work

SWE-Bench Pro is more cautious

Agentic marks: MCP Atlas is the real headline

MCP Atlas multi-step workflow benchmark

Toolathlon, OSWorld, and Finance Agent

Where Gemini 3.5 Flash does not lead

Selected non-leading Gemini 3.5 Flash comparisons

Methodology caveats

How teams should benchmark it locally

Verdict

Sources and release notes

Frequently asked questions

Related insights

AI Agent MVPs: What to Build Before You Build the Agent

Vertical AI SaaS Ideas That Still Have Room in 2026

Multi-Agent SaaS Products: When They Make Sense

Start today and get the firstupdate tomorrow

Start today and get the first
update tomorrow