Claude Sonnet 4.5 Benchmark Analysis

Claude Sonnet 4.5 currently shows benchmark results led by AIME2025 (1 / 106, score 100), SWE-bench Verified (6 / 108, score 82), MMLU Pro (7 / 126, score 88). This page also compares it with 2 competitor models and 4 predecessor or same-series models, including performance and pricing views when available. 2 source links are attached for reference.

Sonnet 4.5是Anthropic一个中等能力的模型，但很多评测结果不比Opus差。

Benchmark Results

Claude Sonnet 4.5

Benchmark Results

General Knowledge

12 evaluations

Benchmark / mode

Score

Rank/total

MMLU Pro

7 / 126

GPQA Diamond

83.40

59 / 179

GPQA Diamond

73.70

98 / 179

LiveBench

Standard Mode

53.69

83 / 115

LiveBench

64K

68.19

46 / 115

ARC-AGI

63.70

32 / 65

ARC-AGI

25.50

52 / 65

HLE

33.60

69 / 159

HLE

17.70

113 / 159

HLE

7.10

146 / 159

ARC-AGI-2

13.60

35 / 59

ARC-AGI-2

3.80

49 / 59

Coding and Software Engineer

6 evaluations

Benchmark / mode

Score

Rank/total

CodeClash

Standard ModeTools

1389

1 / 8

SWE-bench Verified

6 / 108

SWE-bench Verified

77.20

25 / 108

LiveCodeBench

47 / 120

LiveCodeBench

71 / 120

SWE-Bench Pro - Public

43.60

37 / 44

Math and Reasoning

8 evaluations

Benchmark / mode

Score

Rank/total

AIME2025

100

1 / 106

AIME2025

45 / 106

AIME2025

96 / 106

IMO-ProofBench

27.10

8 / 16

FrontierMath

5.20

38 / 60

IMO-ProofBench Advanced

4.80

6 / 8

FrontierMath - Tier 4

Standard Mode

2.10

56 / 80

FrontierMath - Tier 4

32K

4.20

40 / 80

AI Agent - Tool Usage

5 evaluations

Benchmark / mode

Score

Rank/total

OSWorld-Verified

61.40

14 / 18

MCP-Atlas

Thinking EnabledTools

59.50

17 / 23

Terminal-Bench

3 / 35

Terminal-Bench

25 / 35

Terminal Bench 2.0

42.80

41 / 46

Multimodal Understanding

1 evaluations

Benchmark / mode

Score

Rank/total

MMMU

77.80

14 / 28

常识推理

1 evaluations

Benchmark / mode

Score

Rank/total

Simple Bench

Standard Mode

54.30

22 / 63

Agent Level Benchmark

4 evaluations

Benchmark / mode

Score

Rank/total

τ²-Bench - Telecom

5 / 35

τ²-Bench

84.70

9 / 40

τ²-Bench

24 / 40

Terminal Bench Hard

8 / 13

Instruction Following

1 evaluations

Benchmark / mode

Score

Rank/total

IF Bench

57.30

21 / 29

AI Agent - Information Search

1 evaluations

Benchmark / mode

Score

Rank/total

BrowseComp

24.10

43 / 45

Productivity Knowledge

1 evaluations

Benchmark / mode

Score

Rank/total

GDPval-AA

16 / 21

Long Context

1 evaluations

Benchmark / mode

Score

Rank/total

AA-LCR

8 / 13

Claw-style Agent Evaluation

2 evaluations

Benchmark / mode

Score

Rank/total

Pinch Bench

Thinking EnabledTools

88.20

4 / 37

Claw Bench

Thinking EnabledTools

88.10

13 / 29

Compare with other models

Competitor Comparison

Benchmark scores for Claude Sonnet 4.5 compared against top models in its class

Claude Sonnet 4.5GPT-5.1 Gemini 2.5-Pro

Benchmark categories:

The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

Benchmark	Claude Sonnet 4.5Current	GPT-5.1	Gemini 2.5-Pro
ARC-AGI 综合评估	63.70Thinking Enabled	72.80Thinking Level · High	37.00Thinking Enabled
ARC-AGI-2 综合评估	13.60Thinking Enabled	17.60Thinking Level · High	4.90Thinking Enabled
GPQA Diamond 综合评估	83.40Thinking Enabled	88.10Thinking Enabled	86.40Thinking Enabled
HLE 综合评估	33.60Thinking Enabled ｜ Tools	42.70Thinking Level · High ｜ Tools	21.60Thinking Enabled
LiveBench 综合评估	68.1964K	72.04Thinking Level · High	58.33Thinking Level · High
MMLU Pro 综合评估	88.00Thinking Enabled	--	86.00Standard Mode
CodeClash 编程与软件工程	1389.00Standard Mode ｜ Tools	--	1125.00Standard Mode ｜ Tools
LiveCodeBench 编程与软件工程	71.00Thinking Enabled	--	77.10Standard Mode
SWE-Bench Pro - Public 编程与软件工程	43.60Thinking Enabled	50.80Thinking Level · High	--
SWE-bench Verified 编程与软件工程	82.00Thinking Enabled ｜ Tools	76.30Thinking Level · High	67.20Thinking Enabled
AIME2025 数学推理	100.00Thinking Enabled ｜ Tools	94.00Thinking Level · High	88.00Thinking Enabled
FrontierMath 数学推理	5.20Standard Mode	26.70Thinking Level · High ｜ Tools	11.00Standard Mode

13 additional benchmarks remain in the chart above.

Standard API Pricing: Claude Sonnet 4.5 vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier.

Comparable standard text pricing is not available for these models.

Version History

How each version of the Claude Sonnet 4.5 series stacks up on benchmark tests

Claude Sonnet 4.5Claude Sonnet 4 Claude Sonnet 3.7 Claude 3.5 Sonnet New Claude 3.5 Sonnet

Benchmark categories:

12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

Benchmark	Claude Sonnet 4.5Current	Claude Sonnet 4	Claude Sonnet 3.7	Claude 3.5 Sonnet New	Claude 3.5 Sonnet
ARC-AGI 综合评估	63.70Thinking Enabled	40.00Thinking Enabled	--	--	--
ARC-AGI-2 综合评估	13.60Thinking Enabled	5.90Thinking Enabled	--	--	--
GPQA Diamond 综合评估	83.40Thinking Enabled	83.80Deep Thinking Mode ｜ Tools	77.00Thinking Enabled	65.00Standard Mode	59.40Standard Mode
HLE 综合评估	33.60Thinking Enabled ｜ Tools	9.60Thinking Enabled	10.30Thinking Enabled	--	--
LiveBench 综合评估	68.1964K	61.2764K	--	--	--
MMLU Pro 综合评估	88.00Thinking Enabled	84.00Thinking Enabled	--	78.00Standard Mode	77.64Standard Mode
CodeClash 编程与软件工程	1389.00Standard Mode ｜ Tools	1223.00Standard Mode ｜ Tools	--	--	--
LiveCodeBench 编程与软件工程	71.00Thinking Enabled	66.00Thinking Enabled	--	38.70Standard Mode	--
SWE-Bench Pro - Public 编程与软件工程	43.60Thinking Enabled	42.70Thinking Enabled	--	--	--
SWE-bench Verified 编程与软件工程	82.00Thinking Enabled ｜ Tools	80.20Thinking Enabled ｜ Tools	70.30Thinking Enabled ｜ Tools	49.00Standard Mode	--
AIME2025 数学推理	100.00Thinking Enabled ｜ Tools	85.00Deep Thinking Mode ｜ Tools	54.80Standard Mode	--	--
FrontierMath 数学推理	5.20Standard Mode	4.10Standard Mode	4.10Thinking Enabled	2.10Standard Mode	1.00Standard Mode

11 additional benchmarks remain in the chart above.

Single-Benchmark Version Trend

Viewing: ARC-AGI · 综合评估

Benchmark

NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the Claude Sonnet 4.5 Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier.

Comparable standard text pricing is not available for these models.

Sources

anthropic.comanthropic.com artificialanalysis.aiartificialanalysis.ai