GPT-5.1 Benchmark Details
GPT-5.1 currently shows benchmark results led by MMMU (2 / 28, score 85.40), GPQA Diamond (19 / 165, score 88.10), FrontierMath (7 / 54, score 26.70). This page also compares it with 2 competitor models and 2 predecessor or same-series models, including performance and pricing views when available. 2 source links are attached for reference.
Benchmark Results
Benchmark Results
综合评估
4 evaluationsCompetitor Comparison
Benchmark scores for GPT-5.1 compared against top models in its class
Benchmark Score Comparison
12 benchmarks with comparable scores
| Benchmark | GPT-5.1(This model) | Claude Opus 4 | Gemini 2.5-Pro |
|---|---|---|---|
ARC-AGI 综合评估 | 72.80 high | 35.70 normal | 37.00 thinking |
ARC-AGI-2 综合评估 | 17.60 high | 8.60 normal | 4.90 thinking |
GPQA Diamond 综合评估 | 88.10 thinking | 79.60 normal | 86.40 thinking |
HLE 综合评估 | 42.70 思考模式 High(工具+联网) | 10.70 normal | 21.60 thinking |
SWE-bench Verified 编程与软件工程 | 76.30 high | 72.50 normal | 67.20 thinking |
AIME2025 数学推理 | 94.00 high | 75.50 normal | 88.00 thinking |
FrontierMath 数学推理 | 26.70 思考模式 High(工具) | 4.50 normal | 11.00 normal |
12.50 high | 4.20 thinking | 4.20 normal | |
MMMU 多模态理解 | 85.40 high | -- | 82.00 thinking |
Simple Bench 常识推理 | 53.20 high | 58.80 thinking | 62.40 thinking |
Terminal Bench Hard Agent能力评测 | 43.00 思考模式 High(工具) | -- | 25.00 thinking + 使用工具 |
τ²-Bench - Telecom Agent能力评测 | 95.60 思考模式 High(工具) | -- | 54.00 thinking + 使用工具 |
Standard API Pricing: GPT-5.1 vs. Peer Models
Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.
Source: DataLearnerAI. Standard text prices shown here use the default supplier.
These models use different currencies or billing units, so the page falls back to raw price values instead of a shared bar chart.
| Model | Supplier | Standard input | Standard output | Base price applies to |
|---|---|---|---|---|
GPT-5.1 Current model | — | 1.25 美元/100万 tokens | 10 美元/100万 tokens | — |
Claude Opus 4 | — | 15 美元/ 100万tokens | 75 美元/100万tokens | — |
Gemini 2.5-Pro | — | 1.25 美元/100 万tokens | 10 美元/100 万tokens | <= 200K |
Version History
How each version of the GPT-5.1 series stacks up on benchmark tests
Benchmark Score Comparison
12 benchmarks with comparable scores
| Benchmark | GPT-5.1(This model) | GPT-5 | GPT-4.5 |
|---|---|---|---|
ARC-AGI 综合评估 | 72.80 high | 65.70 high | -- |
ARC-AGI-2 综合评估 | 17.60 high | 9.90 high | -- |
GPQA Diamond 综合评估 | 88.10 thinking | 87.30 thinking + 使用工具 | 71.40 normal |
HLE 综合评估 | 42.70 思考模式 High(工具+联网) | 35.20 thinking + 使用工具 | -- |
IC SWE-Lancer(Diamond) 编程与软件工程 | 69.70 思考模式 High(无工具) | -- | 32.60 normal |
SWE-Bench Pro - Public 编程与软件工程 | 50.80 思考模式 High(无工具) | 36.30 high | -- |
SWE-bench Verified 编程与软件工程 | 76.30 high | 72.80 high | 38.00 normal |
AIME2025 数学推理 | 94.00 high | 99.60 thinking + 使用工具 | -- |
FrontierMath 数学推理 | 26.70 思考模式 High(工具) | 26.30 思考模式 High(工具) | -- |
12.50 high | 12.50 high | -- | |
MMMU 多模态理解 | 85.40 high | 84.20 high | -- |
Simple Bench 常识推理 | 53.20 high | 56.70 high | 34.50 normal |
Standard API Pricing Across the GPT-5.1 Series
Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.
Source: DataLearnerAI. Standard text prices shown here use the default supplier.
These models use different currencies or billing units, so the page falls back to raw price values instead of a shared bar chart.
| Model | Supplier | Standard input | Standard output | Base price applies to |
|---|---|---|---|---|
GPT-5.1 Current model | — | 1.25 美元/100万 tokens | 10 美元/100万 tokens | — |
GPT-5 | — | 1.25 美元/100 万tokens | 10 美元/100 万tokens | — |
Series Overview
See how each version of the GPT-5.1 series performs across major benchmarks. Click any row to break down scores by reasoning mode.
Tip: click any score cell to switch the chart below.
Single-Benchmark Mode Relation
Viewing: ARC-AGI · 综合评估