GPT-5.5 Benchmark Details
GPT-5.5 currently shows benchmark results led by ARC-AGI-2 (1 / 49, score 85), Terminal Bench 2.0 (1 / 37, score 82.70), FrontierMath (2 / 57, score 51.70). This page also compares it with 3 competitor models and 3 predecessor or same-series models, including performance and pricing views when available. 1 source link is attached for reference.
Benchmark Results
Benchmark Results
综合评估
5 evaluations数学推理
2 evaluationsAI Agent - 工具使用
2 evaluationsCompetitor Comparison
Benchmark scores for GPT-5.5 compared against top models in its class
Benchmark Score Comparison
10 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.
| Benchmark | GPT-5.5Current | Opus 4.7 | Claude Mythos Preview | Gemini 3.1 Pro Preview |
|---|---|---|---|---|
ARC-AGI-2 综合评估 | 85.00Thinking Level · High | -- | -- | 77.10Thinking Level · High |
GPQA Diamond 综合评估 | 93.60Thinking Level · High | 94.20Extended Thinking | 94.60Extended Thinking | 94.30Thinking Level · High |
HLE 综合评估 | 52.20Thinking Level · High | Tools | 54.70Extended Thinking | Tools | 64.70Extended Thinking | Tools | 51.40Thinking Level · High | Tools |
FrontierMath 数学推理 | 51.70Thinking Level · High | Tools | 43.80Thinking Level · Extra High | -- | -- |
35.40Thinking Level · High | Tools | 22.90Thinking Level · Extra High | -- | -- | |
SWE-Bench Pro - Public 编程与软件工程 | 58.60Thinking Level · High | Tools | 64.30Extended Thinking | Tools | 77.80Extended Thinking | Tools | 54.20Thinking Level · High | Tools |
τ²-Bench - Telecom Agent能力评测 | 98.00Thinking Level · High | Tools | -- | -- | 99.30Thinking Level · High | Tools |
BrowseComp AI Agent - 信息收集 | 84.40Thinking Level · High | Tools | 79.30Extended Thinking | Tools | 84.90Extended Thinking | Tools | 85.90Thinking Level · High | Tools |
OSWorld-Verified AI Agent - 工具使用 | 78.70Thinking Level · High | Tools | 78.00Extended Thinking | Tools | 79.60Extended Thinking | Tools | -- |
Terminal Bench 2.0 AI Agent - 工具使用 | 82.70Thinking Level · High | Tools | 69.40Extended Thinking | Tools | 82.00Extended Thinking | Tools | 68.50Thinking Level · High | Tools |
Standard API Pricing: GPT-5.5 vs. Peer Models
Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.
Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens
When a context threshold exists, the charted base price only applies within these limits:
| Model | Supplier | Standard input | Standard output | Base price applies to |
|---|---|---|---|---|
GPT-5.5 | OpenAI | $5 / 1M tokens | $30 / 1M tokens | — |
Opus 4.7 | Anthropic | $5 / 1M tokens | $25 / 1M tokens | — |
Claude Mythos Preview | Anthropic | $25 / 1M tokens | $125 / 1M tokens | — |
Gemini 3.1 Pro Preview | Google Deep Mind | $2 / 1M tokens | $12 / 1M tokens | <= 200K |
Version History
How each version of the GPT-5.5 series stacks up on benchmark tests
Benchmark Score Comparison
12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.
| Benchmark | GPT-5.5Current | GPT-5.4 | GPT-5.2 | GPT-5.1 |
|---|---|---|---|---|
ARC-AGI 综合评估 | 95.00Thinking Level · High | 93.70Thinking Level · Extra High | 90.50Deep Thinking Mode | 72.80Thinking Level · High |
ARC-AGI-2 综合评估 | 85.00Thinking Level · High | 77.10Standard Mode | 54.20Deep Thinking Mode | 17.60Thinking Level · High |
GPQA Diamond 综合评估 | 93.60Thinking Level · High | 92.80Thinking Level · Extra High | 93.20Deep Thinking Mode | 88.10Thinking Enabled |
HLE 综合评估 | 52.20Thinking Level · High | Tools | 52.10Thinking Level · Extra High | Tools | 45.50Deep Thinking Mode | Tools | 42.70Thinking Level · High | Tools |
FrontierMath 数学推理 | 51.70Thinking Level · High | Tools | 47.60Thinking Level · Extra High | 40.30Thinking Level · Extra High | Tools | 26.70Thinking Level · High | Tools |
35.40Thinking Level · High | Tools | 27.10Thinking Level · Extra High | 14.60Thinking Level · Extra High | Tools | 12.50Thinking Level · High | |
SWE-Bench Pro - Public 编程与软件工程 | 58.60Thinking Level · High | Tools | 57.70Thinking Level · Extra High | 55.60Thinking Level · Extra High | Tools | 50.80Thinking Level · High |
τ²-Bench - Telecom Agent能力评测 | 98.00Thinking Level · High | Tools | 98.90Thinking Level · Extra High | Tools | 98.70Thinking Level · Extra High | Tools | 95.60Thinking Level · High | Tools |
BrowseComp AI Agent - 信息收集 | 84.40Thinking Level · High | Tools | 82.70Thinking Level · Extra High | Tools | 65.80Thinking Level · Extra High | Tools | 50.80Thinking Level · High |
OSWorld-Verified AI Agent - 工具使用 | 78.70Thinking Level · High | Tools | 75.00Thinking Level · Extra High | Tools | -- | -- |
Terminal Bench 2.0 AI Agent - 工具使用 | 82.70Thinking Level · High | Tools | 75.10Thinking Level · Extra High | Tools | -- | 47.60Thinking Level · High | Tools |
GDPval-AA 生产力知识 | 84.90Thinking Level · High | -- | 70.90Thinking Level · High | Tools | -- |
Single-Benchmark Version Trend
Viewing: ARC-AGI · 综合评估
Standard API Pricing Across the GPT-5.5 Series
Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.
Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens
When a context threshold exists, the charted base price only applies within these limits:
| Model | Supplier | Standard input | Standard output | Base price applies to |
|---|---|---|---|---|
GPT-5.5 | OpenAI | $5 / 1M tokens | $30 / 1M tokens | — |
GPT-5.4 | OpenAI | $2.5 / 1M tokens | $15 / 1M tokens | <= 272K |
GPT-5.2 | Facebook AI研究实验室 | $1.75 / 1M tokens | $14 / 1M tokens | — |
GPT-5.1 | — | 1.25 美元/100万 tokens | 10 美元/100万 tokens | — |