GPT-5.4 Benchmark Details

GPT-5.4 currently shows benchmark results led by LiveBench (2 / 115, score 80.28), Pinch Bench (1 / 37, score 90.50), GPQA Diamond (10 / 179, score 92.80). This page also compares it with 2 competitor models and 2 predecessor or same-series models, including performance and pricing views when available. 2 source links are attached for reference.

Benchmark Results

GPT-5.4

Benchmark Results

Thinking
Tool usage

General Knowledge

14 evaluations
Benchmark / mode
Score
Rank/total
ARC-AGI
Standard Mode
93.70
7 / 65
68.20
28 / 65
ARC-AGI
Medium
86.20
18 / 65
ARC-AGI
Extra-High
93.70
7 / 65
GPQA Diamond
Extra-High
92.80
10 / 179
75.07
16 / 115
LiveBench
Deep Thinking Mode
80.28
2 / 115
ARC-AGI-2
Standard Mode
77.10
7 / 59
29.20
30 / 59
ARC-AGI-2
Medium
55.40
19 / 59
ARC-AGI-2
Extra-High
74
10 / 59
HLE
Extra-High
39.80
54 / 159
HLE
Extra-HighTools
52.10
15 / 159
0
4 / 6

Math and Reasoning

2 evaluations
Benchmark / mode
Score
Rank/total
FrontierMath
Extra-High
47.60
5 / 60
27.10
11 / 80

Coding and Software Engineer

2 evaluations
Benchmark / mode
Score
Rank/total
57.70
11 / 44
DeepSWE
Extra-HighTools
52
4 / 9

Agent Level Benchmark

2 evaluations
Benchmark / mode
Score
Rank/total
τ²-Bench - Telecom
Standard ModeTools
64.30
30 / 35
τ²-Bench - Telecom
Extra-HighTools
98.90
3 / 35

AI Agent - Information Search

1 evaluations
Benchmark / mode
Score
Rank/total
BrowseComp
Extra-HighTools
82.70
11 / 45

AI Agent - Tool Usage

3 evaluations
Benchmark / mode
Score
Rank/total
Terminal Bench 2.0
Extra-HighTools
75.10
4 / 46
OSWorld-Verified
Extra-HighTools
75
7 / 18
MCP-Atlas
Extra-HighTools
70.60
10 / 23

Claw-style Agent Evaluation

2 evaluations
Benchmark / mode
Score
Rank/total
Claw Bench
Thinking EnabledTools
92.70
3 / 29
Pinch Bench
Thinking EnabledTools
90.50
1 / 37

Competitor Comparison

Benchmark scores for GPT-5.4 compared against top models in its class

Benchmark categories:
The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

10 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

BenchmarkGPT-5.4CurrentGemini 3.1 Pro PreviewClaude Opus 4.6
ARC-AGI
综合评估
93.70Standard Mode
--
92.00Extended Thinking
ARC-AGI-2
综合评估
77.10Standard Mode
77.10Thinking Level · High
66.30Extended Thinking
HLE
综合评估
52.10Thinking Level · Extra High | Tools
51.40Thinking Level · High | Tools
53.00Extended Thinking | Tools
27.10Thinking Level · Extra High
16.70Thinking Level · High
22.90Thinking Level · High
τ²-Bench - Telecom
Agent能力评测
98.90Thinking Level · Extra High | Tools
99.30Thinking Level · High | Tools
99.25Extended Thinking | Tools
BrowseComp
AI Agent - 信息收集
82.70Thinking Level · Extra High | Tools
85.90Thinking Level · High | Tools
84.00Thinking Enabled | Tools
MCP-Atlas
AI Agent - 工具使用
70.60Thinking Level · Extra High | Tools
--
76.80Deep Thinking Mode | Tools
OSWorld-Verified
AI Agent - 工具使用
75.00Thinking Level · Extra High | Tools
--
72.70Extended Thinking | Tools
Terminal Bench 2.0
AI Agent - 工具使用
75.10Thinking Level · Extra High | Tools
68.50Thinking Level · High | Tools
65.40Extended Thinking | Tools
Pinch Bench
OpenClaw智能体能力综合测评
90.50Thinking Enabled | Tools
86.70Thinking Enabled | Tools
87.40Thinking Enabled | Tools

Standard API Pricing: GPT-5.4 vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

When a context threshold exists, the charted base price only applies within these limits:

GPT-5.4: Base price applies to <= 272K
Gemini 3.1 Pro Preview: Base price applies to <= 200K
Claude Opus 4.6: Base price applies to <= 200K
ModelSupplierStandard inputStandard outputBase price applies to
GPT-5.4
OpenAI$2.5 / 1M tokens$15 / 1M tokens<= 272K
Gemini 3.1 Pro Preview
Google Deep Mind$2 / 1M tokens$12 / 1M tokens<= 200K
Claude Opus 4.6
Anthropic$5 / 1M tokens$25 / 1M tokens<= 200K

Version History

How each version of the GPT-5.4 series stacks up on benchmark tests

Benchmark categories:
The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

8 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

BenchmarkGPT-5.4CurrentGPT-5.2GPT-5.1
ARC-AGI
综合评估
93.70Standard Mode
90.50Deep Thinking Mode
72.80Thinking Level · High
ARC-AGI-2
综合评估
77.10Standard Mode
54.20Deep Thinking Mode
17.60Thinking Level · High
HLE
综合评估
52.10Thinking Level · Extra High | Tools
45.50Deep Thinking Mode | Tools
42.70Thinking Level · High | Tools
LiveBench
综合评估
80.28Deep Thinking Mode
48.91Standard Mode
72.04Thinking Level · High
27.10Thinking Level · Extra High
18.80Thinking Level · Extra High
12.50Thinking Level · High | Tools
τ²-Bench - Telecom
Agent能力评测
98.90Thinking Level · Extra High | Tools
98.70Thinking Level · Extra High | Tools
95.60Thinking Level · High | Tools
BrowseComp
AI Agent - 信息收集
82.70Thinking Level · Extra High | Tools
65.80Thinking Level · Extra High | Tools
50.80Thinking Level · High
Terminal Bench 2.0
AI Agent - 工具使用
75.10Thinking Level · Extra High | Tools
--
47.60Thinking Level · High | Tools

Single-Benchmark Version Trend

Viewing: ARC-AGI · 综合评估

Benchmark
NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the GPT-5.4 Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

When a context threshold exists, the charted base price only applies within these limits:

GPT-5.4: Base price applies to <= 272K
ModelSupplierStandard inputStandard outputBase price applies to
GPT-5.4
OpenAI$2.5 / 1M tokens$15 / 1M tokens<= 272K
GPT-5.2
Facebook AI研究实验室$1.75 / 1M tokens$14 / 1M tokens

Sources