GPT-5 Benchmark Details

GPT-5 currently shows benchmark results led by Aider-Polyglot (1 / 59, score 88), AIME2025 (9 / 106, score 99.60), IMO-ProofBench (2 / 16, score 59). This page also compares it with 2 competitor models and 3 predecessor or same-series models, including performance and pricing views when available. 1 source link is attached for reference.

Benchmark Results

GPT-5

Benchmark Results

Thinking
Tool usage

General Knowledge

14 evaluations
Benchmark / mode
Score
Rank/total
87.30
38 / 179
85.70
45 / 179
77.80
85 / 179
65.70
30 / 65
56.20
40 / 65
44
45 / 65
6
61 / 65
35.20
62 / 159
24.80
90 / 159
6.30
148 / 159
9.90
37 / 59
7.50
40 / 59
1.90
50 / 59
0
57 / 59

Coding and Software Engineer

3 evaluations
Benchmark / mode
Score
Rank/total
CodeClash
Standard ModeTools
1360
2 / 8
72.80
46 / 108

Math and Reasoning

12 evaluations
Benchmark / mode
Score
Rank/total
99.60
9 / 106
94.60
26 / 106
61.90
80 / 106
29
2 / 9
24.80
15 / 60
24.80
15 / 60
FrontierMath
Thinking Level · HighTools
26.30
14 / 60
FrontierMath - Tier 4
Thinking Level · Medium
6.30
35 / 80
FrontierMath - Tier 4
Thinking Level · High
12.50
29 / 80
11
4 / 10

AI Agent - Tool Usage

1 evaluations
Benchmark / mode
Score
Rank/total
43.80
8 / 35

Multimodal Understanding

1 evaluations
Benchmark / mode
Score
Rank/total
84.20
5 / 28

常识推理

1 evaluations
Benchmark / mode
Score
Rank/total
Simple Bench
Thinking Level · High
56.70
20 / 63

Agent Level Benchmark

6 evaluations
Benchmark / mode
Score
Rank/total
τ²-Bench - Telecom
Thinking Level · HighTools
96.70
11 / 35
Aider-Polyglot
Thinking Level · Low
81.30
5 / 59
Aider-Polyglot
Thinking Level · Medium
86.70
2 / 59
Aider-Polyglot
Thinking Level · High
88
1 / 59
80
15 / 40

Instruction Following

1 evaluations
Benchmark / mode
Score
Rank/total
73.10
8 / 29

AI Agent - Information Search

1 evaluations
Benchmark / mode
Score
Rank/total
54.90
32 / 45

Competitor Comparison

Benchmark scores for GPT-5 compared against top models in its class

Benchmark categories:
The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

BenchmarkGPT-5CurrentClaude Opus 4Gemini 2.5-Pro
ARC-AGI
综合评估
65.70Thinking Level · High
35.70Standard Mode
37.00Thinking Enabled
ARC-AGI-2
综合评估
9.90Thinking Level · High
8.60Standard Mode
4.90Thinking Enabled
GPQA Diamond
综合评估
87.30Thinking Enabled | Tools
79.60Standard Mode
86.40Thinking Enabled
HLE
综合评估
35.20Thinking Enabled | Tools
10.70Standard Mode
21.60Thinking Enabled
CodeClash
编程与软件工程
1360.00Standard Mode | Tools
--
1125.00Standard Mode | Tools
SWE-bench Verified
编程与软件工程
72.80Thinking Level · High
72.50Standard Mode
67.20Thinking Enabled
AIME2025
数学推理
99.60Thinking Enabled | Tools
75.50Standard Mode
88.00Thinking Enabled
FrontierMath
数学推理
26.30Thinking Level · High | Tools
4.50Standard Mode
11.00Standard Mode
IMO 2024
数学推理
11.00Thinking Enabled
--
19.00Thinking Enabled
IMO 2025
数学推理
29.00Thinking Enabled
--
15.20Thinking Enabled
IMO-ProofBench
数学推理
59.00Thinking Enabled
2.90Thinking Enabled
55.20Thinking Enabled
20.00Thinking Enabled
--
17.60Thinking Enabled
8 additional benchmarks remain in the chart above.

Standard API Pricing: GPT-5 vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier.

Comparable standard text pricing is not available for these models.

Version History

How each version of the GPT-5 series stacks up on benchmark tests

Benchmark categories:
The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

9 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

BenchmarkGPT-5CurrentGPT-4.5GPT-4.1GPT-4o(2025-03-27)
ARC-AGI
综合评估
65.70Thinking Level · High
--
--
8.80Standard Mode
GPQA Diamond
综合评估
87.30Thinking Enabled | Tools
71.40Standard Mode
66.30Standard Mode
66.90Standard Mode
HLE
综合评估
35.20Thinking Enabled | Tools
--
3.70Standard Mode
--
SWE-bench Verified
编程与软件工程
72.80Thinking Level · High
38.00Standard Mode
54.60Standard Mode
--
AIME2025
数学推理
99.60Thinking Enabled | Tools
--
36.70Standard Mode
26.70Standard Mode
FrontierMath
数学推理
26.30Thinking Level · High | Tools
--
5.50Standard Mode
--
Simple Bench
常识推理
56.70Thinking Level · High
34.50Standard Mode
27.00Standard Mode
--
Aider-Polyglot
Agent能力评测
88.00Thinking Level · High
44.90Standard Mode
52.40Standard Mode
45.30Standard Mode
τ²-Bench
Agent能力评测
80.00Thinking Enabled | Tools
--
54.70Standard Mode | Tools
--

Single-Benchmark Version Trend

Viewing: ARC-AGI · 综合评估

Benchmark
NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the GPT-5 Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier.

Comparable standard text pricing is not available for these models.

Sources