GPT-5 Benchmark Details

GPT-5 currently shows benchmark results led by Aider-Polyglot (1 / 59, score 88), AIME2025 (9 / 106, score 99.60), IMO-ProofBench (2 / 16, score 59). This page also compares it with 2 competitor models and 3 predecessor or same-series models, including performance and pricing views when available. 1 source link is attached for reference.

Benchmark Results

GPT-5

Benchmark Results

General Knowledge

14 evaluations

Benchmark / mode

Score

Rank/total

GPQA Diamond

87.30

38 / 179

GPQA Diamond

85.70

45 / 179

GPQA Diamond

77.80

85 / 179

ARC-AGI

65.70

30 / 65

ARC-AGI

56.20

40 / 65

ARC-AGI

45 / 65

ARC-AGI

61 / 65

HLE

35.20

62 / 159

HLE

24.80

90 / 159

HLE

6.30

148 / 159

ARC-AGI-2

9.90

37 / 59

ARC-AGI-2

7.50

40 / 59

ARC-AGI-2

1.90

50 / 59

ARC-AGI-2

57 / 59

Coding and Software Engineer

3 evaluations

Benchmark / mode

Score

Rank/total

CodeClash

Standard ModeTools

1360

2 / 8

SWE-bench Verified

72.80

46 / 108

SWE-Bench Pro - Public

36.30

42 / 44

Math and Reasoning

12 evaluations

Benchmark / mode

Score

Rank/total

AIME2025

99.60

9 / 106

AIME2025

94.60

26 / 106

AIME2025

61.90

80 / 106

IMO-ProofBench

2 / 16

IMO 2025

2 / 9

FrontierMath

24.80

15 / 60

FrontierMath

24.80

15 / 60

FrontierMath

Thinking Level · HighTools

26.30

14 / 60

IMO-ProofBench Advanced

2 / 8

FrontierMath - Tier 4

Thinking Level · Medium

6.30

35 / 80

FrontierMath - Tier 4

Thinking Level · High

12.50

29 / 80

IMO 2024

4 / 10

AI Agent - Tool Usage

1 evaluations

Benchmark / mode

Score

Rank/total

Terminal-Bench

43.80

8 / 35

Multimodal Understanding

1 evaluations

Benchmark / mode

Score

Rank/total

MMMU

84.20

5 / 28

常识推理

1 evaluations

Benchmark / mode

Score

Rank/total

Simple Bench

Thinking Level · High

56.70

20 / 63

Agent Level Benchmark

6 evaluations

Benchmark / mode

Score

Rank/total

τ²-Bench - Telecom

95.80

13 / 35

τ²-Bench - Telecom

Thinking Level · HighTools

96.70

11 / 35

Aider-Polyglot

Thinking Level · Low

81.30

5 / 59

Aider-Polyglot

Thinking Level · Medium

86.70

2 / 59

Aider-Polyglot

Thinking Level · High

1 / 59

τ²-Bench

15 / 40

Instruction Following

1 evaluations

Benchmark / mode

Score

Rank/total

IF Bench

73.10

8 / 29

AI Agent - Information Search

1 evaluations

Benchmark / mode

Score

Rank/total

BrowseComp

54.90

32 / 45

Compare with other models

Competitor Comparison

Benchmark scores for GPT-5 compared against top models in its class

GPT-5Claude Opus 4 Gemini 2.5-Pro

Benchmark categories:

The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

Benchmark	GPT-5Current	Claude Opus 4	Gemini 2.5-Pro
ARC-AGI 综合评估	65.70Thinking Level · High	35.70Standard Mode	37.00Thinking Enabled
ARC-AGI-2 综合评估	9.90Thinking Level · High	8.60Standard Mode	4.90Thinking Enabled
GPQA Diamond 综合评估	87.30Thinking Enabled ｜ Tools	79.60Standard Mode	86.40Thinking Enabled
HLE 综合评估	35.20Thinking Enabled ｜ Tools	10.70Standard Mode	21.60Thinking Enabled
CodeClash 编程与软件工程	1360.00Standard Mode ｜ Tools	--	1125.00Standard Mode ｜ Tools
SWE-bench Verified 编程与软件工程	72.80Thinking Level · High	72.50Standard Mode	67.20Thinking Enabled
AIME2025 数学推理	99.60Thinking Enabled ｜ Tools	75.50Standard Mode	88.00Thinking Enabled
FrontierMath 数学推理	26.30Thinking Level · High ｜ Tools	4.50Standard Mode	11.00Standard Mode
IMO 2024 数学推理	11.00Thinking Enabled	--	19.00Thinking Enabled
IMO 2025 数学推理	29.00Thinking Enabled	--	15.20Thinking Enabled
IMO-ProofBench 数学推理	59.00Thinking Enabled	2.90Thinking Enabled	55.20Thinking Enabled
IMO-ProofBench Advanced 数学推理	20.00Thinking Enabled	--	17.60Thinking Enabled

8 additional benchmarks remain in the chart above.

Standard API Pricing: GPT-5 vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier.

Comparable standard text pricing is not available for these models.

Version History

How each version of the GPT-5 series stacks up on benchmark tests

GPT-5GPT-4.5 GPT-4.1 GPT-4o(2025-03-27)

Benchmark categories:

9 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

Benchmark	GPT-5Current	GPT-4.5	GPT-4.1	GPT-4o(2025-03-27)
ARC-AGI 综合评估	65.70Thinking Level · High	--	--	8.80Standard Mode
GPQA Diamond 综合评估	87.30Thinking Enabled ｜ Tools	71.40Standard Mode	66.30Standard Mode	66.90Standard Mode
HLE 综合评估	35.20Thinking Enabled ｜ Tools	--	3.70Standard Mode	--
SWE-bench Verified 编程与软件工程	72.80Thinking Level · High	38.00Standard Mode	54.60Standard Mode	--
AIME2025 数学推理	99.60Thinking Enabled ｜ Tools	--	36.70Standard Mode	26.70Standard Mode
FrontierMath 数学推理	26.30Thinking Level · High ｜ Tools	--	5.50Standard Mode	--
Simple Bench 常识推理	56.70Thinking Level · High	34.50Standard Mode	27.00Standard Mode	--
Aider-Polyglot Agent能力评测	88.00Thinking Level · High	44.90Standard Mode	52.40Standard Mode	45.30Standard Mode
τ²-Bench Agent能力评测	80.00Thinking Enabled ｜ Tools	--	54.70Standard Mode ｜ Tools	--

Single-Benchmark Version Trend

Viewing: ARC-AGI · 综合评估

Benchmark

NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the GPT-5 Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier.

Comparable standard text pricing is not available for these models.

Sources

openai.comopenai.com