GPT-5.1vsClaude Opus 4

Across 9 shared benchmarks, GPT-5.1 leads overall: GPT-5.1 wins 8, Claude Opus 4 wins 1, with 0 ties and an average score difference of +13.07.

OpenAI
GPT-5.1

OpenAI · 2025-11-12 · Reasoning model

Anthropic
Claude Opus 4

Anthropic · 2025-05-23 · Reasoning model

GPT-5.18 wins(89%)(11%)1 winClaude Opus 4

Benchmark scores

Grouped by capability, sorted by largest gap within each. 9 shared benchmarks.

General Knowledge

GPT-5.1 4/4
BenchmarkGPT-5.1Claude Opus 4Diff
ARC-AGI72.8025 / 6535.7048 / 65+37.10
HLE26.5083 / 15710.70129 / 157+15.80
ARC-AGI-217.6033 / 598.6039 / 59+9
GPQA Diamond88.1028 / 17879.6079 / 178+8.50

Math and Reasoning

GPT-5.1 3/4
BenchmarkGPT-5.1Claude Opus 4Diff
FrontierMath26.7013 / 60Thinking High (With Tools)4.5039 / 60+22.20
AIME20259428 / 10675.5065 / 106+18.50
FrontierMath - Tier 412.5029 / 80Thinking High (With Tools)4.2040 / 80+8.30
Simple Bench53.2010 / 2758.807 / 27-5.60

Coding and Software Engineer

GPT-5.1 1/1
BenchmarkGPT-5.1Claude Opus 4Diff
SWE-bench Verified76.3030 / 10872.5048 / 108+3.80

Specs

FieldGPT-5.1Claude Opus 4
PublisherOpenAIAnthropic
Release date2025-11-122025-05-23
Model typeReasoning modelReasoning model
ArchitectureDenseDense
ParametersNot availableNot available
Context length400K200K
Max output128K32K

Summary

  • GPT-5.1leads in:General Knowledge (4/4), Math and Reasoning (3/4), Coding and Software Engineer (1/1)

On average across the 9 shared benchmarks, GPT-5.1 scores 13.07 higher.

Largest single-benchmark gap: ARC-AGI — GPT-5.1 72.80 vs Claude Opus 4 35.70 (+37.10).

Page generated from structured model, pricing and benchmark records. No real-time LLM is used to write the prose.