Gemini 3.0 Pro (Preview 11-2025) Benchmark Details

Gemini 3.0 Pro (Preview 11-2025) currently shows benchmark results led by MMLU Pro (2 / 132, score 90), LiveCodeBench (2 / 123, score 92), GPQA Diamond (5 / 187, score 93.80). 1 source link is attached for reference.

Benchmark Results

Gemini 3.0 Pro (Preview 11-2025)

Benchmark Results

General Knowledge

14 evaluations

Benchmark / mode

Score

Rank/total

GPQA Diamond

93.80

5 / 187

GPQA Diamond

91.90

14 / 187

GPQA Diamond

17 / 187

MMLU Pro

2 / 132

ARC-AGI

87.50

19 / 67

ARC-AGI

26 / 67

LiveBench

Thinking Level · Low

63.90

54 / 115

LiveBench

Thinking Level · High

73.39

24 / 115

HLE

45.80

38 / 170

HLE

58 / 170

HLE

37.50

67 / 170

HLE

37.20

68 / 170

ARC-AGI-2

45.10

25 / 61

ARC-AGI-2

31.10

31 / 61

Common Sense

1 evaluations

Benchmark / mode

Score

Rank/total

SimpleQA

72.10

6 / 47

Coding and Software Engineer

2 evaluations

Benchmark / mode

Score

Rank/total

LiveCodeBench

2 / 123

SWE-bench Verified

76.20

35 / 111

Math and Reasoning

5 evaluations

Benchmark / mode

Score

Rank/total

AIME2025

25 / 107

AIME 2026

90.60

14 / 18

FrontierMath

10 / 60

FrontierMath - Tier 4

Standard Mode

18.80

16 / 80

FrontierMath - Tier 4

18.80

16 / 80

Common Sense Reasoning

1 evaluations

Benchmark / mode

Score

Rank/total

Simple Bench

Thinking Mode

76.40

5 / 63

Agent Level Benchmark

4 evaluations

Benchmark / mode

Score

Rank/total

τ²-Bench - Telecom

5 / 35

τ²-Bench

85.40

8 / 43

Terminal Bench Hard

4 / 13

Terminal Bench Hard

5 / 13

Instruction Following

2 evaluations

Benchmark / mode

Score

Rank/total

IF Bench

13 / 30

IF Bench

13 / 30

AI Agent - Information Search

1 evaluations

Benchmark / mode

Score

Rank/total

BrowseComp

59.20

37 / 52

AI Agent - Tool Usage

3 evaluations

Benchmark / mode

Score

Rank/total

MCP-Atlas

Standard ModeTools

70.30

15 / 27

Terminal Bench 2.0

56.90

25 / 47

Terminal Bench 2.0

54.20

29 / 47

Productivity Knowledge

1 evaluations

Benchmark / mode

Score

Rank/total

GDPval-AA

18 / 21

Long Context

1 evaluations

Benchmark / mode

Score

Rank/total

AA-LCR

2 / 14

Claw-style Agent Evaluation

1 evaluations

Benchmark / mode

Score

Rank/total

Pinch Bench

Thinking ModeTools

70.70

31 / 37

Compare with other models

Sources

artificialanalysis.aiartificialanalysis.ai