GPT-5 Benchmark Details
GPT-5 currently shows benchmark results led by Aider-Polyglot (1 / 59, score 88), AIME2025 (9 / 106, score 99.60), IMO-ProofBench (2 / 16, score 59). This page also compares it with 2 competitor models and 3 predecessor or same-series models, including performance and pricing views when available. 1 source link is attached for reference.
Benchmark Results
Benchmark Results
General Knowledge
14 evaluationsCoding and Software Engineer
3 evaluationsMath and Reasoning
12 evaluationsAgent Level Benchmark
6 evaluationsCompetitor Comparison
Benchmark scores for GPT-5 compared against top models in its class
12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.
| Benchmark | GPT-5Current | Claude Opus 4 | Gemini 2.5-Pro |
|---|---|---|---|
ARC-AGI 综合评估 | 65.70Thinking Level · High | 35.70Standard Mode | 37.00Thinking Enabled |
ARC-AGI-2 综合评估 | 9.90Thinking Level · High | 8.60Standard Mode | 4.90Thinking Enabled |
GPQA Diamond 综合评估 | 87.30Thinking Enabled | Tools | 79.60Standard Mode | 86.40Thinking Enabled |
HLE 综合评估 | 35.20Thinking Enabled | Tools | 10.70Standard Mode | 21.60Thinking Enabled |
CodeClash 编程与软件工程 | 1360.00Standard Mode | Tools | -- | 1125.00Standard Mode | Tools |
SWE-bench Verified 编程与软件工程 | 72.80Thinking Level · High | 72.50Standard Mode | 67.20Thinking Enabled |
AIME2025 数学推理 | 99.60Thinking Enabled | Tools | 75.50Standard Mode | 88.00Thinking Enabled |
FrontierMath 数学推理 | 26.30Thinking Level · High | Tools | 4.50Standard Mode | 11.00Standard Mode |
IMO 2024 数学推理 | 11.00Thinking Enabled | -- | 19.00Thinking Enabled |
IMO 2025 数学推理 | 29.00Thinking Enabled | -- | 15.20Thinking Enabled |
IMO-ProofBench 数学推理 | 59.00Thinking Enabled | 2.90Thinking Enabled | 55.20Thinking Enabled |
20.00Thinking Enabled | -- | 17.60Thinking Enabled |
Standard API Pricing: GPT-5 vs. Peer Models
Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.
Source: DataLearnerAI. Standard text prices shown here use the default supplier.
Version History
How each version of the GPT-5 series stacks up on benchmark tests
9 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.
| Benchmark | GPT-5Current | GPT-4.5 | GPT-4.1 | GPT-4o(2025-03-27) |
|---|---|---|---|---|
ARC-AGI 综合评估 | 65.70Thinking Level · High | -- | -- | 8.80Standard Mode |
GPQA Diamond 综合评估 | 87.30Thinking Enabled | Tools | 71.40Standard Mode | 66.30Standard Mode | 66.90Standard Mode |
HLE 综合评估 | 35.20Thinking Enabled | Tools | -- | 3.70Standard Mode | -- |
SWE-bench Verified 编程与软件工程 | 72.80Thinking Level · High | 38.00Standard Mode | 54.60Standard Mode | -- |
AIME2025 数学推理 | 99.60Thinking Enabled | Tools | -- | 36.70Standard Mode | 26.70Standard Mode |
FrontierMath 数学推理 | 26.30Thinking Level · High | Tools | -- | 5.50Standard Mode | -- |
Simple Bench 常识推理 | 56.70Thinking Level · High | 34.50Standard Mode | 27.00Standard Mode | -- |
Aider-Polyglot Agent能力评测 | 88.00Thinking Level · High | 44.90Standard Mode | 52.40Standard Mode | 45.30Standard Mode |
τ²-Bench Agent能力评测 | 80.00Thinking Enabled | Tools | -- | 54.70Standard Mode | Tools | -- |
Single-Benchmark Version Trend
Viewing: ARC-AGI · 综合评估
Standard API Pricing Across the GPT-5 Series
Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.
Source: DataLearnerAI. Standard text prices shown here use the default supplier.