GPT-5.1vsClaude Sonnet 4.5

Across 15 shared benchmarks, GPT-5.1 leads overall: GPT-5.1 wins 10, Claude Sonnet 4.5 wins 5, with 0 ties and an average score difference of +5.58.

OpenAI
GPT-5.1

OpenAI · 2025-11-12 · Reasoning model

Anthropic
Claude Sonnet 4.5

Anthropic · 2025-09-30 · Chat model

GPT-5.110 wins(67%)(33%)5 winsClaude Sonnet 4.5

Benchmark scores

Grouped by capability, sorted by largest gap within each. 15 shared benchmarks.

General Knowledge

GPT-5.1 3/4
BenchmarkGPT-5.1Claude Sonnet 4.5Diff
ARC-AGI72.8025 / 6563.7032 / 65+9.10
HLE26.5083 / 15733.6067 / 157-7.10
GPQA Diamond88.1028 / 17883.4058 / 178+4.70
ARC-AGI-217.6033 / 5913.6035 / 59+4

Math and Reasoning

Even 4/4
BenchmarkGPT-5.1Claude Sonnet 4.5Diff
FrontierMath26.7013 / 60Thinking High (With Tools)5.2038 / 60+21.50
FrontierMath - Tier 412.5029 / 80Thinking High (With Tools)2.1056 / 80Normal (No Tools)+10.40
AIME20259428 / 1061001 / 106-6
Simple Bench53.2010 / 2754.309 / 27-1.10

Agent Level Benchmark

Even 2/2
BenchmarkGPT-5.1Claude Sonnet 4.5Diff
Terminal Bench Hard432 / 13Thinking High (With Tools)338 / 13+10
τ²-Bench - Telecom95.6014 / 35Thinking High (With Tools)985 / 35-2.40

Coding and Software Engineer

Even 2/2
BenchmarkGPT-5.1Claude Sonnet 4.5Diff
SWE-Bench Pro - Public50.8030 / 43Thinking High (No Tools)43.6036 / 43+7.20
SWE-bench Verified76.3030 / 108826 / 108-5.70

AI Agent - Information Search

GPT-5.1 1/1
BenchmarkGPT-5.1Claude Sonnet 4.5Diff
BrowseComp50.8036 / 45Thinking High (No Tools)24.1043 / 45+26.70

AI Agent - Tool Usage

GPT-5.1 1/1
BenchmarkGPT-5.1Claude Sonnet 4.5Diff
Terminal Bench 2.047.6037 / 46Thinking High (With Tools)42.8041 / 46+4.80

Multimodal Understanding

GPT-5.1 1/1
BenchmarkGPT-5.1Claude Sonnet 4.5Diff
MMMU85.402 / 2877.8014 / 28+7.60

Specs

FieldGPT-5.1Claude Sonnet 4.5
PublisherOpenAIAnthropic
Release date2025-11-122025-09-30
Model typeReasoning modelChat model
ArchitectureDenseDense
ParametersNot availableNot available
Context length400K1000K
Max output128K64K

Summary

  • GPT-5.1leads in:General Knowledge (3/4), AI Agent - Information Search (1/1), AI Agent - Tool Usage (1/1), Multimodal Understanding (1/1)
  • Tied in:Math and Reasoning, Agent Level Benchmark, Coding and Software Engineer

On average across the 15 shared benchmarks, GPT-5.1 scores 5.58 higher.

Largest single-benchmark gap: BrowseComp — GPT-5.1 50.80 vs Claude Sonnet 4.5 24.10 (+26.70).

Page generated from structured model, pricing and benchmark records. No real-time LLM is used to write the prose.