Benchmark Results

GPT-5.1

Benchmark Results

综合评估

12 evaluations

Benchmark / mode

Score

Rank/total

GPQA Diamond

Thinking Enabled

88.10

27 / 177

GPQA Diamond

Thinking Level · High

88.10

27 / 177

GPQA Diamond

Thinking Level · High

88.10

27 / 177

ARC-AGI

Thinking Level · Low

33.20

50 / 65

ARC-AGI

Thinking Level · Medium

57.70

37 / 65

ARC-AGI

Thinking Level · High

72.80

25 / 65

HLE

Thinking Enabled

26.50

80 / 154

HLE

Thinking Level · High

25.70

82 / 154

HLE

Thinking Level · HighToolsInternet

42.70

40 / 154

ARC-AGI-2

Thinking Level · Low

1.90

50 / 59

ARC-AGI-2

Thinking Level · Medium

6.50

41 / 59

ARC-AGI-2

Thinking Level · High

17.60

33 / 59

编程与软件工程

4 evaluations

Benchmark / mode

Score

Rank/total

SWE-bench Verified

Thinking Level · High

76.30

27 / 105

SWE-bench Verified

Thinking Level · HighTools

76.30

27 / 105

IC SWE-Lancer(Diamond)

Thinking Level · High

69.70

3 / 8

SWE-Bench Pro - Public

Thinking Level · High

50.80

27 / 40

数学推理

6 evaluations

Benchmark / mode

Score

Rank/total

AIME2025

Thinking Level · High

28 / 106

AIME2025

Thinking Level · High

28 / 106

FrontierMath

Thinking Level · HighTools

26.70

13 / 60

FrontierMath - Tier 4

Thinking Level · Medium

4.20

40 / 80

FrontierMath - Tier 4

Thinking Level · High

12.50

29 / 80

FrontierMath - Tier 4

Thinking Level · HighTools

12.50

29 / 80

多模态理解

2 evaluations

Benchmark / mode

Score

Rank/total

MMMU

Thinking Level · High

85.40

2 / 28

MMMU

Thinking Level · High

85.40

2 / 28

常识推理

1 evaluations

Benchmark / mode

Score

Rank/total

Simple Bench

Thinking Level · High

53.20

10 / 27

Agent能力评测

2 evaluations

Benchmark / mode

Score

Rank/total

τ²-Bench - Telecom

Thinking Level · HighTools

95.60

14 / 35

Terminal Bench Hard

Thinking Level · HighTools

2 / 13

AI Agent - 信息收集

1 evaluations

Benchmark / mode

Score

Rank/total

BrowseComp

Thinking Level · High

50.80

34 / 43

AI Agent - 工具使用

1 evaluations

Benchmark / mode

Score

Rank/total

Terminal Bench 2.0

Thinking Level · HighTools

47.60

37 / 46

Compare with other models

Competitor Comparison

Benchmark scores for GPT-5.1 compared against top models in its class

GPT-5.1Claude Opus 4 Gemini 2.5-Pro

Benchmark categories:

The chart shows each model’s highest score per benchmark within the current filter. See the table below for per-mode details.

12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

Benchmark	GPT-5.1Current	Claude Opus 4	Gemini 2.5-Pro
ARC-AGI 综合评估	72.80Thinking Level · High	35.70Standard Mode	37.00Thinking Enabled
ARC-AGI-2 综合评估	17.60Thinking Level · High	8.60Standard Mode	4.90Thinking Enabled
GPQA Diamond 综合评估	88.10Thinking Enabled	79.60Standard Mode	86.40Thinking Enabled
HLE 综合评估	42.70Thinking Level · High ｜ Tools	10.70Standard Mode	21.60Thinking Enabled
SWE-bench Verified 编程与软件工程	76.30Thinking Level · High	72.50Standard Mode	67.20Thinking Enabled
AIME2025 数学推理	94.00Thinking Level · High	75.50Standard Mode	88.00Thinking Enabled
FrontierMath 数学推理	26.70Thinking Level · High ｜ Tools	4.50Standard Mode	11.00Standard Mode
FrontierMath - Tier 4 数学推理	12.50Thinking Level · High ｜ Tools	4.20Thinking Enabled	2.10Standard Mode
MMMU 多模态理解	85.40Thinking Level · High	--	82.00Thinking Enabled
Simple Bench 常识推理	53.20Thinking Level · High	58.80Thinking Enabled	62.40Thinking Enabled
Terminal Bench Hard Agent能力评测	43.00Thinking Level · High ｜ Tools	--	25.00Thinking Enabled ｜ Tools
τ²-Bench - Telecom Agent能力评测	95.60Thinking Level · High ｜ Tools	--	54.00Thinking Enabled ｜ Tools

2 additional benchmarks remain in the chart above.

Standard API Pricing: GPT-5.1 vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier.

These models use different currencies or billing units, so the page falls back to raw price values instead of a shared bar chart.

GPT-5.1

Standard input: 1.25 美元/100万 tokens

Standard output: 10 美元/100万 tokens

Claude Opus 4

Standard input: 15 美元/ 100万tokens

Standard output: 75 美元/100万tokens

Gemini 2.5-Pro

Standard input: 1.25 美元/100 万tokens

Standard output: 10 美元/100 万tokens

Base price applies to <= 200K

Model	Supplier	Standard input	Standard output	Base price applies to
GPT-5.1	—	1.25 美元/100万 tokens	10 美元/100万 tokens	—
Claude Opus 4	—	15 美元/ 100万tokens	75 美元/100万tokens	—
Gemini 2.5-Pro	—	1.25 美元/100 万tokens	10 美元/100 万tokens	<= 200K

Version History

How each version of the GPT-5.1 series stacks up on benchmark tests

GPT-5.1GPT-5 GPT-4.5

Benchmark categories:

The chart shows each model’s highest score per benchmark within the current filter. See the table below for per-mode details.

12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

Benchmark	GPT-5.1Current	GPT-5	GPT-4.5
ARC-AGI 综合评估	72.80Thinking Level · High	65.70Thinking Level · High	--
ARC-AGI-2 综合评估	17.60Thinking Level · High	9.90Thinking Level · High	--
GPQA Diamond 综合评估	88.10Thinking Enabled	87.30Thinking Enabled ｜ Tools	71.40Standard Mode
HLE 综合评估	42.70Thinking Level · High ｜ Tools	35.20Thinking Enabled ｜ Tools	--
IC SWE-Lancer(Diamond) 编程与软件工程	69.70Thinking Level · High	--	32.60Standard Mode
SWE-Bench Pro - Public 编程与软件工程	50.80Thinking Level · High	36.30Thinking Level · High	--
SWE-bench Verified 编程与软件工程	76.30Thinking Level · High	72.80Thinking Level · High	38.00Standard Mode
AIME2025 数学推理	94.00Thinking Level · High	99.60Thinking Enabled ｜ Tools	--
FrontierMath 数学推理	26.70Thinking Level · High ｜ Tools	26.30Thinking Level · High ｜ Tools	--
FrontierMath - Tier 4 数学推理	12.50Thinking Level · High ｜ Tools	12.50Thinking Level · High	--
MMMU 多模态理解	85.40Thinking Level · High	84.20Thinking Level · High	--
Simple Bench 常识推理	53.20Thinking Level · High	56.70Thinking Level · High	34.50Standard Mode

2 additional benchmarks remain in the chart above.

Single-Benchmark Version Trend

Viewing: ARC-AGI · 综合评估

Benchmark

NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the GPT-5.1 Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier.

These models use different currencies or billing units, so the page falls back to raw price values instead of a shared bar chart.

GPT-5.1

Standard input: 1.25 美元/100万 tokens

Standard output: 10 美元/100万 tokens

GPT-5

Standard input: 1.25 美元/100 万tokens

Standard output: 10 美元/100 万tokens

Model	Supplier	Standard input	Standard output	Base price applies to
GPT-5.1	—	1.25 美元/100万 tokens	10 美元/100万 tokens	—
GPT-5	—	1.25 美元/100 万tokens	10 美元/100 万tokens	—

Benchmark Results

Benchmark Results

综合评估

编程与软件工程

数学推理

多模态理解

常识推理

Agent能力评测

AI Agent - 信息收集

AI Agent - 工具使用

Competitor Comparison

Standard API Pricing: GPT-5.1 vs. Peer Models

Version History

Single-Benchmark Version Trend

Standard API Pricing Across the GPT-5.1 Series

GPT-5.1 Benchmark Details

Sources