Claude Sonnet 4.5 Benchmark Analysis

Claude Sonnet 4.5 currently shows benchmark results led by AIME2025 (1 / 106, score 100), SWE-bench Verified (6 / 108, score 82), MMLU Pro (7 / 126, score 88). This page also compares it with 2 competitor models and 4 predecessor or same-series models, including performance and pricing views when available. 2 source links are attached for reference.

Sonnet 4.5是Anthropic一个中等能力的模型,但很多评测结果不比Opus差。

Benchmark Results

Claude Sonnet 4.5

Benchmark Results

Thinking
Tool usage

General Knowledge

12 evaluations
Benchmark / mode
Score
Rank/total
88
7 / 126
83.40
59 / 179
73.70
98 / 179
LiveBench
Standard Mode
53.69
83 / 115
68.19
46 / 115
63.70
32 / 65
25.50
52 / 65
33.60
69 / 159
17.70
113 / 159
7.10
146 / 159
13.60
35 / 59
3.80
49 / 59

Coding and Software Engineer

6 evaluations
Benchmark / mode
Score
Rank/total
CodeClash
Standard ModeTools
1389
1 / 8
77.20
25 / 108
71
47 / 120
59
71 / 120

Math and Reasoning

8 evaluations
Benchmark / mode
Score
Rank/total
100
1 / 106
87
45 / 106
37
96 / 106
27.10
8 / 16
5.20
38 / 60
2.10
56 / 80
4.20
40 / 80

AI Agent - Tool Usage

5 evaluations
Benchmark / mode
Score
Rank/total
61.40
14 / 18
MCP-Atlas
Thinking EnabledTools
59.50
17 / 23
42.80
41 / 46

Multimodal Understanding

1 evaluations
Benchmark / mode
Score
Rank/total
77.80
14 / 28

常识推理

1 evaluations
Benchmark / mode
Score
Rank/total
Simple Bench
Standard Mode
54.30
22 / 63

Agent Level Benchmark

4 evaluations
Benchmark / mode
Score
Rank/total

Instruction Following

1 evaluations
Benchmark / mode
Score
Rank/total
57.30
21 / 29

AI Agent - Information Search

1 evaluations
Benchmark / mode
Score
Rank/total
24.10
43 / 45

Productivity Knowledge

1 evaluations
Benchmark / mode
Score
Rank/total
39
16 / 21

Long Context

1 evaluations
Benchmark / mode
Score
Rank/total
66
8 / 13

Claw-style Agent Evaluation

2 evaluations
Benchmark / mode
Score
Rank/total
Pinch Bench
Thinking EnabledTools
88.20
4 / 37
Claw Bench
Thinking EnabledTools
88.10
13 / 29

Competitor Comparison

Benchmark scores for Claude Sonnet 4.5 compared against top models in its class

Claude Sonnet 4.5GPT-5.1Gemini 2.5-Pro
Benchmark categories:
The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

BenchmarkClaude Sonnet 4.5CurrentGPT-5.1Gemini 2.5-Pro
ARC-AGI
综合评估
63.70Thinking Enabled
72.80Thinking Level · High
37.00Thinking Enabled
ARC-AGI-2
综合评估
13.60Thinking Enabled
17.60Thinking Level · High
4.90Thinking Enabled
GPQA Diamond
综合评估
83.40Thinking Enabled
88.10Thinking Enabled
86.40Thinking Enabled
HLE
综合评估
33.60Thinking Enabled | Tools
42.70Thinking Level · High | Tools
21.60Thinking Enabled
LiveBench
综合评估
68.1964K
72.04Thinking Level · High
58.33Thinking Level · High
MMLU Pro
综合评估
88.00Thinking Enabled
--
86.00Standard Mode
CodeClash
编程与软件工程
1389.00Standard Mode | Tools
--
1125.00Standard Mode | Tools
LiveCodeBench
编程与软件工程
71.00Thinking Enabled
--
77.10Standard Mode
SWE-Bench Pro - Public
编程与软件工程
43.60Thinking Enabled
50.80Thinking Level · High
--
SWE-bench Verified
编程与软件工程
82.00Thinking Enabled | Tools
76.30Thinking Level · High
67.20Thinking Enabled
AIME2025
数学推理
100.00Thinking Enabled | Tools
94.00Thinking Level · High
88.00Thinking Enabled
FrontierMath
数学推理
5.20Standard Mode
26.70Thinking Level · High | Tools
11.00Standard Mode
13 additional benchmarks remain in the chart above.

Standard API Pricing: Claude Sonnet 4.5 vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier.

Comparable standard text pricing is not available for these models.

Version History

How each version of the Claude Sonnet 4.5 series stacks up on benchmark tests

Benchmark categories:
The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

BenchmarkClaude Sonnet 4.5CurrentClaude Sonnet 4Claude Sonnet 3.7Claude 3.5 Sonnet NewClaude 3.5 Sonnet
ARC-AGI
综合评估
63.70Thinking Enabled
40.00Thinking Enabled
--
--
--
ARC-AGI-2
综合评估
13.60Thinking Enabled
5.90Thinking Enabled
--
--
--
GPQA Diamond
综合评估
83.40Thinking Enabled
83.80Deep Thinking Mode | Tools
77.00Thinking Enabled
65.00Standard Mode
59.40Standard Mode
HLE
综合评估
33.60Thinking Enabled | Tools
9.60Thinking Enabled
10.30Thinking Enabled
--
--
LiveBench
综合评估
68.1964K
61.2764K
--
--
--
MMLU Pro
综合评估
88.00Thinking Enabled
84.00Thinking Enabled
--
78.00Standard Mode
77.64Standard Mode
CodeClash
编程与软件工程
1389.00Standard Mode | Tools
1223.00Standard Mode | Tools
--
--
--
LiveCodeBench
编程与软件工程
71.00Thinking Enabled
66.00Thinking Enabled
--
38.70Standard Mode
--
SWE-Bench Pro - Public
编程与软件工程
43.60Thinking Enabled
42.70Thinking Enabled
--
--
--
SWE-bench Verified
编程与软件工程
82.00Thinking Enabled | Tools
80.20Thinking Enabled | Tools
70.30Thinking Enabled | Tools
49.00Standard Mode
--
AIME2025
数学推理
100.00Thinking Enabled | Tools
85.00Deep Thinking Mode | Tools
54.80Standard Mode
--
--
FrontierMath
数学推理
5.20Standard Mode
4.10Standard Mode
4.10Thinking Enabled
2.10Standard Mode
1.00Standard Mode
11 additional benchmarks remain in the chart above.

Single-Benchmark Version Trend

Viewing: ARC-AGI · 综合评估

Benchmark
NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the Claude Sonnet 4.5 Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier.

Comparable standard text pricing is not available for these models.

Sources