Gemini 3.0 Flash Benchmark Details

Gemini 3.0 Flash currently shows benchmark results led by τ²-Bench (3 / 40, score 90.20), AIME2025 (8 / 106, score 99.70), GPQA Diamond (18 / 180, score 90.40). This page also compares it with 2 competitor models and 2 predecessor or same-series models, including performance and pricing views when available. 1 source link is attached for reference.

Benchmark Results

Gemini 3.0 Flash

Benchmark Results

General Knowledge

6 evaluations

Benchmark / mode

Score

Rank/total

GPQA Diamond

90.40

18 / 180

LiveBench

Standard Mode

56.35

79 / 115

LiveBench

Thinking Level · High

72.40

26 / 115

HLE

43.50

42 / 164

HLE

33.70

73 / 164

ARC-AGI-2

33.60

27 / 59

Common Sense

1 evaluations

Benchmark / mode

Score

Rank/total

SimpleQA

68.70

7 / 45

Coding and Software Engineer

2 evaluations

Benchmark / mode

Score

Rank/total

SWE-bench Verified

68.70

63 / 109

SWE-Bench Pro - Public

Thinking Level · HighTools

49.60

36 / 47

Math and Reasoning

3 evaluations

Benchmark / mode

Score

Rank/total

AIME2025

99.70

8 / 106

AIME2025

95.20

24 / 106

FrontierMath - Tier 4

Standard Mode

4.20

40 / 80

Agent Level Benchmark

1 evaluations

Benchmark / mode

Score

Rank/total

τ²-Bench

90.20

3 / 40

AI Agent - Tool Usage

3 evaluations

Benchmark / mode

Score

Rank/total

MCP-Atlas

Standard ModeTools

18 / 25

TerminalBench 2.1

Thinking Level · HighTools

20 / 21

Terminal Bench 2.0

47.60

37 / 46

Claw-style Agent Evaluation

2 evaluations

Benchmark / mode

Score

Rank/total

Claw Bench

Thinking ModeTools

85.70

15 / 29

Pinch Bench

Thinking ModeTools

85.20

16 / 37

Compare with other models

Competitor Comparison

Benchmark scores for Gemini 3.0 Flash compared against top models in its class

Gemini 3.0 FlashClaude Sonnet 4 GPT-5.3

Benchmark categories:

The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

10 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

Benchmark	Gemini 3.0 FlashCurrent	Claude Sonnet 4
ARC-AGI-2 综合评估	33.60Thinking Enabled	5.90Thinking Enabled
GPQA Diamond 综合评估	90.40Thinking Enabled	83.80Deep Thinking Mode ｜ Tools
HLE 综合评估	43.50Thinking Enabled ｜ Tools	9.60Thinking Enabled
LiveBench 综合评估	72.40Thinking Level · High	61.2764K
SWE-Bench Pro - Public 编程与软件工程	49.60Thinking Level · High ｜ Tools	42.70Thinking Enabled
SWE-bench Verified 编程与软件工程	68.70Thinking Enabled	80.20Thinking Enabled ｜ Tools
AIME2025 数学推理	99.70Thinking Enabled ｜ Tools	85.00Deep Thinking Mode ｜ Tools
τ²-Bench Agent能力评测	90.20Thinking Enabled ｜ Tools	52.00Standard Mode ｜ Tools
Claw Bench OpenClaw智能体能力综合测评	85.70Thinking Enabled ｜ Tools	77.80Thinking Enabled ｜ Tools
Pinch Bench OpenClaw智能体能力综合测评	85.20Thinking Enabled ｜ Tools	80.50Thinking Enabled ｜ Tools

Standard API Pricing: Gemini 3.0 Flash vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

When a context threshold exists, the charted base price only applies within these limits:

Claude Sonnet 4: Base price applies to <= 200000

Model	Supplier	Standard input	Standard output	Base price applies to
Gemini 3.0 Flash	Google Deep Mind	$0.5 / 1M tokens	$3 / 1M tokens	—
Claude Sonnet 4	Anthropic	$3 / 1M tokens	$15 / 1M tokens	<= 200000

Version History

How each version of the Gemini 3.0 Flash series stacks up on benchmark tests

Gemini 3.0 FlashGemini 2.5 Flash Gemini 2.0 Flash Experimental

Benchmark categories:

7 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

Benchmark	Gemini 3.0 FlashCurrent	Gemini 2.5 Flash	Gemini 2.0 Flash Experimental
GPQA Diamond 综合评估	90.40Thinking Enabled	82.80Thinking Enabled	65.20Standard Mode
HLE 综合评估	43.50Thinking Enabled ｜ Tools	11.00Thinking Enabled	5.10Standard Mode
LiveBench 综合评估	72.40Thinking Level · High	47.74Thinking Level · High	--
SimpleQA 常识问答	68.70Thinking Enabled	26.90Thinking Enabled	29.90Standard Mode
SWE-bench Verified 编程与软件工程	68.70Thinking Enabled	50.00Standard Mode	21.40Standard Mode
AIME2025 数学推理	99.70Thinking Enabled ｜ Tools	72.00Thinking Enabled	29.70Standard Mode
Pinch Bench OpenClaw智能体能力综合测评	85.20Thinking Enabled ｜ Tools	70.70Thinking Enabled ｜ Tools	--

Single-Benchmark Version Trend

Viewing: GPQA Diamond · 综合评估

Benchmark

NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the Gemini 3.0 Flash Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

Model	Supplier	Standard input	Standard output	Base price applies to
Gemini 3.0 Flash	Google Deep Mind	$0.5 / 1M tokens	$3 / 1M tokens	—
Gemini 2.5 Flash	Google Deep Mind	$0.3 / 1M tokens	$2.5 / 1M tokens	—
Gemini 2.0 Flash Experimental	Google Deep Mind	$0.1 / 1M tokens	$0.4 / 1M tokens	—

Sources

blog.googleblog.google