GPT-5.5 Benchmark Details

GPT-5.5 currently shows benchmark results led by LiveBench (1 / 115, score 80.71), Terminal Bench 2.0 (1 / 47, score 82.70), GPQA Diamond (6 / 187, score 93.60). This page also compares it with 3 competitor models and 3 predecessor or same-series models, including performance and pricing views when available. 2 source links are attached for reference.

Benchmark Results

GPT-5.5

Benchmark Results

General Knowledge

15 evaluations

Benchmark / mode

Score

Rank/total

ARC-AGI

Low

76.20

26 / 68

ARC-AGI

Medium

92.20

12 / 68

ARC-AGI

High

94.50

7 / 68

ARC-AGI

Extra-High

5 / 68

GPQA Diamond

High

93.60

6 / 187

ARC-AGI-2

Low

33.30

31 / 62

ARC-AGI-2

Medium

70.40

14 / 62

ARC-AGI-2

High

2 / 62

ARC-AGI-2

Extra-High

2 / 62

LiveBench

Medium

68.66

44 / 115

LiveBench

High

76.24

9 / 115

LiveBench

Deep Thinking Mode

80.71

1 / 115

HLE

High

41.40

58 / 172

HLE

HighTools

52.20

20 / 172

ARC-AGI-3

High

5 / 9

Common Sense Reasoning

1 evaluations

Benchmark / mode

Score

Rank/total

Simple Bench

Standard Mode

7 / 63

Math and Reasoning

3 evaluations

Benchmark / mode

Score

Rank/total

FrontierMath

HighTools

51.70

2 / 60

FrontierMath - Tier 4

HighTools

35.40

7 / 80

FrontierMath - Tier 4

Extra-High

35.40

7 / 80

Coding and Software Engineer

2 evaluations

Benchmark / mode

Score

Rank/total

DeepSWE

Extra-HighTools

7 / 19

SWE-Bench Pro - Public

HighTools

58.60

13 / 54

Agent Level Benchmark

1 evaluations

Benchmark / mode

Score

Rank/total

τ²-Bench - Telecom

HighTools

5 / 35

AI Agent - Information Search

1 evaluations

Benchmark / mode

Score

Rank/total

BrowseComp

HighToolsInternet

84.40

8 / 53

AI Agent - Tool Usage

4 evaluations

Benchmark / mode

Score

Rank/total

TerminalBench 2.1

HighTools

83.40

7 / 27

Terminal Bench 2.0

HighTools

82.70

1 / 47

OSWorld-Verified

HighTools

78.70

8 / 24

MCP-Atlas

Extra-HighTools

75.30

12 / 27

Productivity Knowledge

1 evaluations

Benchmark / mode

Score

Rank/total

GDPval-AA

High

1769

2 / 21

Compare with other models

Competitor Comparison

Benchmark scores for GPT-5.5 compared against top models in its class

GPT-5.5Opus 4.7 Claude Mythos Preview Gemini 3.1 Pro Preview

Benchmark categories:

The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

Benchmark	GPT-5.5Current	Opus 4.7	Claude Mythos Preview	Gemini 3.1 Pro Preview
ARC-AGI 综合评估	95.00Thinking Level · Extra High	93.50Thinking Level · High	--	--
ARC-AGI-2 综合评估	85.00Thinking Level · Extra High	75.80Thinking Level · High	--	77.10Thinking Level · High
GPQA Diamond 综合评估	93.60Thinking Level · High	94.20Extended Thinking	94.60Extended Thinking	94.30Thinking Level · High
HLE 综合评估	52.20Thinking Level · High ｜ Tools	54.70Extended Thinking ｜ Tools	64.70Extended Thinking ｜ Tools	51.40Thinking Level · High ｜ Tools
LiveBench 综合评估	80.71Deep Thinking Mode	76.91Deep Thinking Mode	--	79.93Thinking Level · High
Simple Bench 常识推理	69.00Standard Mode	61.70Standard Mode	--	79.60Standard Mode
FrontierMath 数学推理	51.70Thinking Level · High ｜ Tools	43.80Thinking Level · Extra High	--	36.90Thinking Level · High
FrontierMath - Tier 4 数学推理	35.40Thinking Level · Extra High	22.90Thinking Level · Extra High	--	16.70Standard Mode
DeepSWE 编程与软件工程	67.00Thinking Level · Extra High ｜ Tools	--	--	12.00Thinking Level · High ｜ Tools
SWE-Bench Pro - Public 编程与软件工程	58.60Thinking Level · High ｜ Tools	64.30Extended Thinking ｜ Tools	77.80Extended Thinking ｜ Tools	54.20Thinking Level · High ｜ Tools
τ²-Bench - Telecom Agent能力评测	98.00Thinking Level · High ｜ Tools	--	--	99.30Thinking Level · High ｜ Tools
BrowseComp AI Agent - 信息收集	84.40Thinking Level · High ｜ Tools	79.30Extended Thinking ｜ Tools	84.90Extended Thinking ｜ Tools	85.90Thinking Level · High ｜ Tools

4 additional benchmarks remain in the chart above.

Standard API Pricing: GPT-5.5 vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

When a context threshold exists, the charted base price only applies within these limits:

Gemini 3.1 Pro Preview: Base price applies to <= 200K

Model	Supplier	Standard input	Standard output	Base price applies to
GPT-5.5	OpenAI	$5 / 1M tokens	$30 / 1M tokens	—
Opus 4.7	Anthropic	$5 / 1M tokens	$25 / 1M tokens	—
Claude Mythos Preview	Anthropic	$25 / 1M tokens	$125 / 1M tokens	—
Gemini 3.1 Pro Preview	Google Deep Mind	$2 / 1M tokens	$12 / 1M tokens	<= 200K

Version History

How each version of the GPT-5.5 series stacks up on benchmark tests

GPT-5.5GPT-5.4 GPT-5.2 GPT-5.1

Benchmark categories:

12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

Benchmark	GPT-5.5Current	GPT-5.4	GPT-5.2	GPT-5.1
ARC-AGI 综合评估	95.00Thinking Level · Extra High	93.70Thinking Level · Extra High	90.50Deep Thinking Mode	72.80Thinking Level · High
ARC-AGI-2 综合评估	85.00Thinking Level · Extra High	77.10Standard Mode	54.20Deep Thinking Mode	17.60Thinking Level · High
GPQA Diamond 综合评估	93.60Thinking Level · High	92.80Thinking Level · Extra High	93.20Deep Thinking Mode	88.10Thinking Enabled
HLE 综合评估	52.20Thinking Level · High ｜ Tools	52.10Thinking Level · Extra High ｜ Tools	45.50Deep Thinking Mode ｜ Tools	42.70Thinking Level · High ｜ Tools
LiveBench 综合评估	80.71Deep Thinking Mode	80.28Deep Thinking Mode	74.84Thinking Level · High	72.04Thinking Level · High
Simple Bench 常识推理	69.00Standard Mode	--	45.80Thinking Level · High	53.20Thinking Level · High
FrontierMath 数学推理	51.70Thinking Level · High ｜ Tools	47.60Thinking Level · Extra High	40.30Thinking Level · Extra High ｜ Tools	26.70Thinking Level · High ｜ Tools
FrontierMath - Tier 4 数学推理	35.40Thinking Level · Extra High	27.10Thinking Level · Extra High	18.80Thinking Level · Extra High	12.50Thinking Level · High ｜ Tools
DeepSWE 编程与软件工程	67.00Thinking Level · Extra High ｜ Tools	52.00Thinking Level · Extra High ｜ Tools	--	--
SWE-Bench Pro - Public 编程与软件工程	58.60Thinking Level · High ｜ Tools	57.70Thinking Level · Extra High	55.60Thinking Level · Extra High ｜ Tools	50.80Thinking Level · High
τ²-Bench - Telecom Agent能力评测	98.00Thinking Level · High ｜ Tools	98.90Thinking Level · Extra High ｜ Tools	98.70Thinking Level · Extra High ｜ Tools	95.60Thinking Level · High ｜ Tools
BrowseComp AI Agent - 信息收集	84.40Thinking Level · High ｜ Tools	82.70Thinking Level · Extra High ｜ Tools	65.80Thinking Level · Extra High ｜ Tools	50.80Thinking Level · High

4 additional benchmarks remain in the chart above.

Single-Benchmark Version Trend

Viewing: ARC-AGI · 综合评估

Benchmark

NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the GPT-5.5 Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

When a context threshold exists, the charted base price only applies within these limits:

GPT-5.4: Base price applies to <= 272K

Model	Supplier	Standard input	Standard output	Base price applies to
GPT-5.5	OpenAI	$5 / 1M tokens	$30 / 1M tokens	—
GPT-5.4	OpenAI	$2.5 / 1M tokens	$15 / 1M tokens	<= 272K
GPT-5.2	Facebook AI研究实验室	$1.75 / 1M tokens	$14 / 1M tokens	—
GPT-5.1	OpenAI	$1.25 / 1M tokens	$10 / 1M tokens	—

Sources

openai.comopenai.com arcprize.orgarcprize.org