Benchmark Results

GPT-5.4

Benchmark Results

综合评估

12 evaluations

Benchmark / mode

Score

Rank/total

ARC-AGI

Standard Mode

93.70

7 / 65

ARC-AGI

Low

68.20

28 / 65

ARC-AGI

Medium

86.20

18 / 65

ARC-AGI

Extra-High

93.70

7 / 65

GPQA Diamond

Extra-High

92.80

9 / 175

ARC-AGI-2

Standard Mode

77.10

7 / 58

ARC-AGI-2

Low

29.20

29 / 58

ARC-AGI-2

Medium

55.40

18 / 58

ARC-AGI-2

Extra-High

10 / 58

HLE

Extra-High

39.80

45 / 149

HLE

Extra-HighTools

52.10

11 / 149

ARC-AGI-3

High

4 / 6

数学推理

2 evaluations

Benchmark / mode

Score

Rank/total

FrontierMath

Extra-High

47.60

5 / 60

FrontierMath - Tier 4

Extra-High

27.10

11 / 80

编程与软件工程

1 evaluations

Benchmark / mode

Score

Rank/total

SWE-Bench Pro - Public

Extra-High

57.70

6 / 36

Agent能力评测

2 evaluations

Benchmark / mode

Score

Rank/total

τ²-Bench - Telecom

Standard ModeTools

64.30

30 / 35

τ²-Bench - Telecom

Extra-HighTools

98.90

3 / 35

AI Agent - 信息收集

1 evaluations

Benchmark / mode

Score

Rank/total

BrowseComp

Extra-HighTools

82.70

9 / 43

AI Agent - 工具使用

2 evaluations

Benchmark / mode

Score

Rank/total

Terminal Bench 2.0

Extra-HighTools

75.10

4 / 43

OSWorld-Verified

Extra-HighTools

4 / 14

OpenClaw智能体能力综合测评

2 evaluations

Benchmark / mode

Score

Rank/total

Claw Bench

Thinking EnabledTools

92.70

3 / 29

Pinch Bench

Thinking EnabledTools

90.50

1 / 37

Compare with other models

Competitor Comparison

Benchmark scores for GPT-5.4 compared against top models in its class

GPT-5.4Gemini 3.1 Pro Preview Claude Opus 4.6

Benchmark categories:

The chart shows each model’s highest score per benchmark within the current filter. See the table below for per-mode details.

9 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

Benchmark	GPT-5.4Current	Gemini 3.1 Pro Preview	Claude Opus 4.6
ARC-AGI 综合评估	93.70Standard Mode	--	92.00Extended Thinking
ARC-AGI-2 综合评估	77.10Standard Mode	77.10Thinking Level · High	66.30Extended Thinking
HLE 综合评估	52.10Thinking Level · Extra High ｜ Tools	51.40Thinking Level · High ｜ Tools	53.00Extended Thinking ｜ Tools
FrontierMath - Tier 4 数学推理	27.10Thinking Level · Extra High	16.70Standard Mode	22.90Thinking Level · High
τ²-Bench - Telecom Agent能力评测	98.90Thinking Level · Extra High ｜ Tools	99.30Thinking Level · High ｜ Tools	99.25Extended Thinking ｜ Tools
BrowseComp AI Agent - 信息收集	82.70Thinking Level · Extra High ｜ Tools	85.90Thinking Level · High ｜ Tools	84.00Thinking Enabled ｜ Tools
OSWorld-Verified AI Agent - 工具使用	75.00Thinking Level · Extra High ｜ Tools	--	72.70Extended Thinking ｜ Tools
Terminal Bench 2.0 AI Agent - 工具使用	75.10Thinking Level · Extra High ｜ Tools	68.50Thinking Level · High ｜ Tools	65.40Extended Thinking ｜ Tools
Pinch Bench OpenClaw智能体能力综合测评	90.50Thinking Enabled ｜ Tools	86.70Thinking Enabled ｜ Tools	87.40Thinking Enabled ｜ Tools

Standard API Pricing: GPT-5.4 vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

When a context threshold exists, the charted base price only applies within these limits:

GPT-5.4: Base price applies to <= 272K

Gemini 3.1 Pro Preview: Base price applies to <= 200K

Claude Opus 4.6: Base price applies to <= 200K

Model	Supplier	Standard input	Standard output	Base price applies to
GPT-5.4	OpenAI	$2.5 / 1M tokens	$15 / 1M tokens	<= 272K
Gemini 3.1 Pro Preview	Google Deep Mind	$2 / 1M tokens	$12 / 1M tokens	<= 200K
Claude Opus 4.6	Anthropic	$5 / 1M tokens	$25 / 1M tokens	<= 200K

Version History

How each version of the GPT-5.4 series stacks up on benchmark tests

GPT-5.4GPT-5.2 GPT-5.1

Benchmark categories:

The chart shows each model’s highest score per benchmark within the current filter. See the table below for per-mode details.

7 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

Benchmark	GPT-5.4Current	GPT-5.2	GPT-5.1
ARC-AGI 综合评估	93.70Standard Mode	90.50Deep Thinking Mode	72.80Thinking Level · High
ARC-AGI-2 综合评估	77.10Standard Mode	54.20Deep Thinking Mode	17.60Thinking Level · High
HLE 综合评估	52.10Thinking Level · Extra High ｜ Tools	45.50Deep Thinking Mode ｜ Tools	42.70Thinking Level · High ｜ Tools
FrontierMath - Tier 4 数学推理	27.10Thinking Level · Extra High	18.80Thinking Level · Extra High	12.50Thinking Level · High ｜ Tools
τ²-Bench - Telecom Agent能力评测	98.90Thinking Level · Extra High ｜ Tools	98.70Thinking Level · Extra High ｜ Tools	95.60Thinking Level · High ｜ Tools
BrowseComp AI Agent - 信息收集	82.70Thinking Level · Extra High ｜ Tools	65.80Thinking Level · Extra High ｜ Tools	50.80Thinking Level · High
Terminal Bench 2.0 AI Agent - 工具使用	75.10Thinking Level · Extra High ｜ Tools	--	47.60Thinking Level · High ｜ Tools

Single-Benchmark Version Trend

Viewing: ARC-AGI · 综合评估

Benchmark

NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the GPT-5.4 Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

When a context threshold exists, the charted base price only applies within these limits:

GPT-5.4: Base price applies to <= 272K

Model	Supplier	Standard input	Standard output	Base price applies to
GPT-5.4	OpenAI	$2.5 / 1M tokens	$15 / 1M tokens	<= 272K
GPT-5.2	Facebook AI研究实验室	$1.75 / 1M tokens	$14 / 1M tokens	—
GPT-5.1	—	1.25 美元/100万 tokens	10 美元/100万 tokens	—

Benchmark Results

Benchmark Results

综合评估

数学推理

编程与软件工程

Agent能力评测

AI Agent - 信息收集

AI Agent - 工具使用

OpenClaw智能体能力综合测评

Competitor Comparison

Standard API Pricing: GPT-5.4 vs. Peer Models

Version History

Single-Benchmark Version Trend

Standard API Pricing Across the GPT-5.4 Series

GPT-5.4 Benchmark Details

Sources