Benchmark Results
Benchmark Results
综合评估
16 evaluations编程与软件工程
2 evaluations数学推理
12 evaluationsAgent能力评测
3 evaluationsCompetitor Comparison
Benchmark scores for GPT-5 compared against top models in its class
12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.
| Benchmark | GPT-5Current | Claude Opus 4 | Gemini 2.5-Pro |
|---|---|---|---|
ARC-AGI 综合评估 | 65.70Thinking Level · High | 35.70Standard Mode | 37.00Thinking Enabled |
ARC-AGI-2 综合评估 | 9.90Thinking Level · High | 8.60Standard Mode | 4.90Thinking Enabled |
GPQA Diamond 综合评估 | 87.30Thinking Enabled | Tools | 79.60Standard Mode | 86.40Thinking Enabled |
HLE 综合评估 | 35.20Thinking Enabled | Tools | 10.70Standard Mode | 21.60Thinking Enabled |
LiveBench 综合评估 | 79.33Thinking Level · High | -- | 71.92Thinking Enabled |
SWE-bench Verified 编程与软件工程 | 72.80Thinking Level · High | 72.50Standard Mode | 67.20Thinking Enabled |
AIME2025 数学推理 | 99.60Thinking Enabled | Tools | 75.50Standard Mode | 88.00Thinking Enabled |
FrontierMath 数学推理 | 26.30Thinking Level · High | Tools | 4.50Standard Mode | 11.00Standard Mode |
12.50Thinking Level · High | 4.20Thinking Enabled | 2.10Standard Mode | |
IMO 2024 数学推理 | 11.00Thinking Enabled | -- | 19.00Thinking Enabled |
IMO 2025 数学推理 | 29.00Thinking Enabled | -- | 15.20Thinking Enabled |
IMO-ProofBench 数学推理 | 59.00Thinking Enabled | 2.90Thinking Enabled | 55.20Thinking Enabled |
Standard API Pricing: GPT-5 vs. Peer Models
Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.
Source: DataLearnerAI. Standard text prices shown here use the default supplier.
These models use different currencies or billing units, so the page falls back to raw price values instead of a shared bar chart.
| Model | Supplier | Standard input | Standard output | Base price applies to |
|---|---|---|---|---|
GPT-5 | — | 1.25 美元/100 万tokens | 10 美元/100 万tokens | — |
Claude Opus 4 | — | 15 美元/ 100万tokens | 75 美元/100万tokens | — |
Gemini 2.5-Pro | — | 1.25 美元/100 万tokens | 10 美元/100 万tokens | <= 200K |
Version History
How each version of the GPT-5 series stacks up on benchmark tests
8 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.
| Benchmark | GPT-5Current | GPT-4.5 | GPT-4.1 | GPT-4o(2025-03-27) |
|---|---|---|---|---|
ARC-AGI 综合评估 | 65.70Thinking Level · High | -- | -- | 8.80Standard Mode |
GPQA Diamond 综合评估 | 87.30Thinking Enabled | Tools | 71.40Standard Mode | 66.30Standard Mode | 66.90Standard Mode |
HLE 综合评估 | 35.20Thinking Enabled | Tools | -- | 3.70Standard Mode | -- |
SWE-bench Verified 编程与软件工程 | 72.80Thinking Level · High | 38.00Standard Mode | 54.60Standard Mode | -- |
AIME2025 数学推理 | 99.60Thinking Enabled | Tools | -- | 36.70Standard Mode | 26.70Standard Mode |
FrontierMath 数学推理 | 26.30Thinking Level · High | Tools | -- | 5.50Standard Mode | -- |
Simple Bench 常识推理 | 56.70Thinking Level · High | 34.50Standard Mode | 27.00Standard Mode | -- |
τ²-Bench Agent能力评测 | 80.00Thinking Enabled | Tools | -- | 54.70Standard Mode | Tools | -- |
Single-Benchmark Version Trend
Viewing: ARC-AGI · 综合评估
Standard API Pricing Across the GPT-5 Series
Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.
Source: DataLearnerAI. Standard text prices shown here use the default supplier.
These models use different currencies or billing units, so the page falls back to raw price values instead of a shared bar chart.
| Model | Supplier | Standard input | Standard output | Base price applies to |
|---|---|---|---|---|
GPT-5 | — | 1.25 美元/100 万tokens | 10 美元/100 万tokens | — |
GPT-4.1 | — | 2 美元/100万 tokens | 8 美元/100万 tokens | — |
GPT-4o(2025-03-27) | — | 2.5 美元/100万 tokens | 10 美元/100万 tokens | — |