GPT-5.1vsGemini 2.5-Pro

Across 15 shared benchmarks, GPT-5.1 leads overall: GPT-5.1 wins 13, Gemini 2.5-Pro wins 2, with 0 ties and an average score difference of +12.83.

OpenAI · 2025-11-12 · Reasoning model

Google Deep Mind · 2025-06-05 · Reasoning model

GPT-5.113 wins(87%)(13%)2 winsGemini 2.5-Pro

Benchmark scores

Grouped by capability, sorted by largest gap within each. 15 shared benchmarks.

GPT-5.1 4/5

Benchmark	GPT-5.1	Gemini 2.5-Pro	Diff
ARC-AGI	72.8028 / 68	3750 / 68	+35.80
LiveBench	42.65106 / 115Normal (No Tools)	58.3376 / 115Thinking High (No Tools)	-15.68
ARC-AGI-2	17.6036 / 62	4.9047 / 62	+12.70
HLE	26.5097 / 172	21.60112 / 172	+4.90
GPQA Diamond	88.1031 / 187	86.4045 / 187	+1.70

GPT-5.1 3/3

Benchmark	GPT-5.1	Gemini 2.5-Pro	Diff
FrontierMath	26.7013 / 60Thinking High (With Tools)	1123 / 60	+15.70
FrontierMath - Tier 4	12.5029 / 80Thinking High (With Tools)	2.1056 / 80Normal (No Tools)	+10.40
AIME2025	9428 / 107	8844 / 107	+6

GPT-5.1 2/2

Benchmark	GPT-5.1	Gemini 2.5-Pro	Diff
τ²-Bench - Telecom	95.6014 / 35Thinking High (With Tools)	5432 / 35	+41.60
Terminal Bench Hard	432 / 13Thinking High (With Tools)	2512 / 13	+18

GPT-5.1 1/1

Benchmark	GPT-5.1	Gemini 2.5-Pro	Diff
BrowseComp	50.8043 / 53Thinking High (No Tools)	7.8052 / 53	+43

GPT-5.1 1/1

Benchmark	GPT-5.1	Gemini 2.5-Pro	Diff
Terminal Bench 2.0	47.6038 / 47Thinking High (With Tools)	32.6047 / 47	+15

GPT-5.1 1/1

Benchmark	GPT-5.1	Gemini 2.5-Pro	Diff
SWE-bench Verified	76.3034 / 112	67.2072 / 112	+9.10

Gemini 2.5-Pro 1/1

Benchmark	GPT-5.1	Gemini 2.5-Pro	Diff
Simple Bench	53.2023 / 63Thinking High (No Tools)	62.4011 / 63Thinking (No Tools)	-9.20

GPT-5.1 1/1

Benchmark	GPT-5.1	Gemini 2.5-Pro	Diff
MMMU	85.402 / 29	8210 / 29	+3.40

Prices use DataLearner records when available; missing fields are not inferred.

GPT-5.1leads in:General Knowledge (4/5), Math and Reasoning (3/3), Agent Level Benchmark (2/2), AI Agent - Information Search (1/1), AI Agent - Tool Usage (1/1), Coding and Software Engineer (1/1), Multimodal Understanding (1/1)
Gemini 2.5-Proleads in:Commonsense Reasoning (1/1)

On average across the 15 shared benchmarks, GPT-5.1 scores 12.83 higher.

Largest single-benchmark gap: BrowseComp — GPT-5.1 50.80 vs Gemini 2.5-Pro 7.80 (+43).

Page generated from structured model, pricing and benchmark records. No real-time LLM is used to write the prose.