GPT-5.1vsClaude Sonnet 4.5

Across 17 shared benchmarks, GPT-5.1 leads overall: GPT-5.1 wins 10, Claude Sonnet 4.5 wins 7, with 0 ties and an average score difference of +3.72.

OpenAI · 2025-11-12 · Reasoning model

Anthropic · 2025-09-30 · Chat model

GPT-5.110 wins(59%)(41%)7 winsClaude Sonnet 4.5

Benchmark scores

Grouped by capability, sorted by largest gap within each. 17 shared benchmarks.

GPT-5.1 3/5

Benchmark	GPT-5.1	Claude Sonnet 4.5	Diff
LiveBench	42.65106 / 115Normal (No Tools)	53.6983 / 115Normal (No Tools)	-11.04
ARC-AGI	72.8028 / 68	63.7035 / 68	+9.10
HLE	26.5097 / 172	33.6080 / 172	-7.10
GPQA Diamond	88.1031 / 187	83.4063 / 187	+4.70
ARC-AGI-2	17.6036 / 62	13.6038 / 62	+4

GPT-5.1 2/3

Benchmark	GPT-5.1	Claude Sonnet 4.5	Diff
FrontierMath	26.7013 / 60Thinking High (With Tools)	5.2038 / 60	+21.50
FrontierMath - Tier 4	12.5029 / 80Thinking High (With Tools)	2.1056 / 80Normal (No Tools)	+10.40
AIME2025	9428 / 107	1001 / 107	-6

Even 2/2

Benchmark	GPT-5.1	Claude Sonnet 4.5	Diff
Terminal Bench Hard	432 / 13Thinking High (With Tools)	338 / 13	+10
τ²-Bench - Telecom	95.6014 / 35Thinking High (With Tools)	985 / 35	-2.40

Even 2/2

Benchmark	GPT-5.1	Claude Sonnet 4.5	Diff
MCP-Atlas	50.1025 / 27Thinking High (With Tools)	59.5021 / 27Thinking (With Tools)	-9.40
Terminal Bench 2.0	47.6038 / 47Thinking High (With Tools)	42.8042 / 47	+4.80

Even 2/2

Benchmark	GPT-5.1	Claude Sonnet 4.5	Diff
SWE-Bench Pro - Public	50.8040 / 54Thinking High (No Tools)	43.6047 / 54	+7.20
SWE-bench Verified	76.3034 / 112	828 / 112	-5.70

GPT-5.1 1/1

Benchmark	GPT-5.1	Claude Sonnet 4.5	Diff
BrowseComp	50.8043 / 53Thinking High (No Tools)	24.1051 / 53	+26.70

Claude Sonnet 4.5 1/1

Benchmark	GPT-5.1	Claude Sonnet 4.5	Diff
Simple Bench	53.2023 / 63Thinking High (No Tools)	54.3022 / 63Normal (No Tools)	-1.10

GPT-5.1 1/1

Benchmark	GPT-5.1	Claude Sonnet 4.5	Diff
MMMU	85.402 / 29	77.8015 / 29	+7.60

Prices use DataLearner records when available; missing fields are not inferred.

GPT-5.1leads in:General Knowledge (3/5), Math and Reasoning (2/3), AI Agent - Information Search (1/1), Multimodal Understanding (1/1)
Claude Sonnet 4.5leads in:Commonsense Reasoning (1/1)
Tied in:Agent Level Benchmark, AI Agent - Tool Usage, Coding and Software Engineer

On average across the 17 shared benchmarks, GPT-5.1 scores 3.72 higher.

Largest single-benchmark gap: BrowseComp — GPT-5.1 50.80 vs Claude Sonnet 4.5 24.10 (+26.70).

Page generated from structured model, pricing and benchmark records. No real-time LLM is used to write the prose.