Benchmark Results
Benchmark Results
综合评估
12 evaluations编程与软件工程
4 evaluations数学推理
6 evaluationsAgent能力评测
2 evaluationsCompetitor Comparison
Benchmark scores for GPT-5.1 compared against top models in its class
12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.
| Benchmark | GPT-5.1Current | Claude Opus 4 | Gemini 2.5-Pro |
|---|---|---|---|
ARC-AGI 综合评估 | 72.80Thinking Level · High | 35.70Standard Mode | 37.00Thinking Enabled |
ARC-AGI-2 综合评估 | 17.60Thinking Level · High | 8.60Standard Mode | 4.90Thinking Enabled |
GPQA Diamond 综合评估 | 88.10Thinking Enabled | 79.60Standard Mode | 86.40Thinking Enabled |
HLE 综合评估 | 42.70Thinking Level · High | Tools | 10.70Standard Mode | 21.60Thinking Enabled |
SWE-bench Verified 编程与软件工程 | 76.30Thinking Level · High | 72.50Standard Mode | 67.20Thinking Enabled |
AIME2025 数学推理 | 94.00Thinking Level · High | 75.50Standard Mode | 88.00Thinking Enabled |
FrontierMath 数学推理 | 26.70Thinking Level · High | Tools | 4.50Standard Mode | 11.00Standard Mode |
12.50Thinking Level · High | Tools | 4.20Thinking Enabled | 2.10Standard Mode | |
MMMU 多模态理解 | 85.40Thinking Level · High | -- | 82.00Thinking Enabled |
Simple Bench 常识推理 | 53.20Thinking Level · High | 58.80Thinking Enabled | 62.40Thinking Enabled |
Terminal Bench Hard Agent能力评测 | 43.00Thinking Level · High | Tools | -- | 25.00Thinking Enabled | Tools |
τ²-Bench - Telecom Agent能力评测 | 95.60Thinking Level · High | Tools | -- | 54.00Thinking Enabled | Tools |
Standard API Pricing: GPT-5.1 vs. Peer Models
Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.
Source: DataLearnerAI. Standard text prices shown here use the default supplier.
These models use different currencies or billing units, so the page falls back to raw price values instead of a shared bar chart.
| Model | Supplier | Standard input | Standard output | Base price applies to |
|---|---|---|---|---|
GPT-5.1 | — | 1.25 美元/100万 tokens | 10 美元/100万 tokens | — |
Claude Opus 4 | — | 15 美元/ 100万tokens | 75 美元/100万tokens | — |
Gemini 2.5-Pro | — | 1.25 美元/100 万tokens | 10 美元/100 万tokens | <= 200K |
Version History
How each version of the GPT-5.1 series stacks up on benchmark tests
12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.
| Benchmark | GPT-5.1Current | GPT-5 | GPT-4.5 |
|---|---|---|---|
ARC-AGI 综合评估 | 72.80Thinking Level · High | 65.70Thinking Level · High | -- |
ARC-AGI-2 综合评估 | 17.60Thinking Level · High | 9.90Thinking Level · High | -- |
GPQA Diamond 综合评估 | 88.10Thinking Enabled | 87.30Thinking Enabled | Tools | 71.40Standard Mode |
HLE 综合评估 | 42.70Thinking Level · High | Tools | 35.20Thinking Enabled | Tools | -- |
IC SWE-Lancer(Diamond) 编程与软件工程 | 69.70Thinking Level · High | -- | 32.60Standard Mode |
SWE-Bench Pro - Public 编程与软件工程 | 50.80Thinking Level · High | 36.30Thinking Level · High | -- |
SWE-bench Verified 编程与软件工程 | 76.30Thinking Level · High | 72.80Thinking Level · High | 38.00Standard Mode |
AIME2025 数学推理 | 94.00Thinking Level · High | 99.60Thinking Enabled | Tools | -- |
FrontierMath 数学推理 | 26.70Thinking Level · High | Tools | 26.30Thinking Level · High | Tools | -- |
12.50Thinking Level · High | Tools | 12.50Thinking Level · High | -- | |
MMMU 多模态理解 | 85.40Thinking Level · High | 84.20Thinking Level · High | -- |
Simple Bench 常识推理 | 53.20Thinking Level · High | 56.70Thinking Level · High | 34.50Standard Mode |
Single-Benchmark Version Trend
Viewing: ARC-AGI · 综合评估
Standard API Pricing Across the GPT-5.1 Series
Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.
Source: DataLearnerAI. Standard text prices shown here use the default supplier.
These models use different currencies or billing units, so the page falls back to raw price values instead of a shared bar chart.
| Model | Supplier | Standard input | Standard output | Base price applies to |
|---|---|---|---|---|
GPT-5.1 | — | 1.25 美元/100万 tokens | 10 美元/100万 tokens | — |
GPT-5 | — | 1.25 美元/100 万tokens | 10 美元/100 万tokens | — |