Claude Sonnet 4.5vsClaude Sonnet 4

在 25 个共同 benchmark 中,Claude Sonnet 4.5 整体领先:Claude Sonnet 4.5 领先 22 项,Claude Sonnet 4 领先 1 项,持平 2 项,平均分差 +8.81。

Anthropic
Claude Sonnet 4.5

Anthropic · 2025-09-30 · 聊天大模型

Anthropic
Claude Sonnet 4

Anthropic · 2025-05-23 · 推理大模型

Claude Sonnet 4.522 (88%)持平2(4%)1 Claude Sonnet 4

评测分数

按能力类目分组,每组内按分差大小排列;共 25 项。

General Knowledge

Claude Sonnet 4.5 领先 5/6
评测项Claude Sonnet 4.5Claude Sonnet 4分差
HLE33.6067 / 1579.60134 / 157+24
ARC-AGI63.7032 / 654046 / 65+23.70
ARC-AGI-213.6035 / 595.9043 / 59+7.70
LiveBench78.264 / 5273.8211 / 52+4.44
MMLU Pro887 / 1268437 / 126+4
GPQA Diamond83.4058 / 17883.8057 / 178-0.40

Math and Reasoning

Claude Sonnet 4.5 领先 3/5
评测项Claude Sonnet 4.5Claude Sonnet 4分差
AIME20251001 / 1068550 / 106+15
FrontierMath - Tier 42.1056 / 80Normal (No Tools)072 / 80Normal (No Tools)+2.10
FrontierMath5.2038 / 604.1041 / 60+1.10
IMO-ProofBench27.108 / 1627.108 / 16持平
IMO-ProofBench Advanced4.806 / 84.806 / 8持平

Coding and Software Engineer

Claude Sonnet 4.5 领先 3/3
评测项Claude Sonnet 4.5Claude Sonnet 4分差
LiveCodeBench7147 / 1206658 / 120+5
SWE-bench Verified826 / 10880.2013 / 108+1.80
SWE-Bench Pro - Public43.6036 / 4342.7037 / 43+0.90

Agent Level Benchmark

Claude Sonnet 4.5 领先 2/2
评测项Claude Sonnet 4.5Claude Sonnet 4分差
τ²-Bench - Telecom985 / 356529 / 35+33
τ²-Bench84.709 / 405233 / 40+32.70

AI Agent - Tool Usage

Claude Sonnet 4.5 领先 2/2
评测项Claude Sonnet 4.5Claude Sonnet 4分差
OSWorld-Verified61.4014 / 1842.2016 / 18+19.20
Terminal-Bench503 / 3541.3010 / 35+8.70

Claw-style Agent Evaluation

Claude Sonnet 4.5 领先 2/2
评测项Claude Sonnet 4.5Claude Sonnet 4分差
Claw Bench88.1013 / 29Thinking (With Tools)77.8023 / 29Thinking (With Tools)+10.30
Pinch Bench88.204 / 37Thinking (With Tools)80.5022 / 37Thinking (With Tools)+7.70

Instruction Following

Claude Sonnet 4.5 领先 1/1
评测项Claude Sonnet 4.5Claude Sonnet 4分差
IF Bench57.3021 / 295522 / 29+2.30

Long Context

Claude Sonnet 4.5 领先 1/1
评测项Claude Sonnet 4.5Claude Sonnet 4分差
AA-LCR668 / 136510 / 13+1

Multimodal Understanding

Claude Sonnet 4.5 领先 1/1
评测项Claude Sonnet 4.5Claude Sonnet 4分差
MMMU77.8014 / 2876.5016 / 28+1.30

Productivity Knowledge

Claude Sonnet 4.5 领先 1/1
评测项Claude Sonnet 4.5Claude Sonnet 4分差
GDPval-AA3916 / 213319 / 21+6

常识推理

Claude Sonnet 4.5 领先 1/1
评测项Claude Sonnet 4.5Claude Sonnet 4分差
Simple Bench54.309 / 2745.5015 / 27+8.80

规格对比

字段Claude Sonnet 4.5Claude Sonnet 4
发布机构AnthropicAnthropic
发布时间2025-09-302025-05-23
模型类型聊天大模型推理大模型
架构稠密模型稠密模型
参数规模暂无数据暂无数据
上下文长度1000K200K
最大输出64K64K

小结

  • Claude Sonnet 4.5在以下类目领先:General Knowledge (5/6)、Math and Reasoning (3/5)、Coding and Software Engineer (3/3)、Agent Level Benchmark (2/2)、AI Agent - Tool Usage (2/2)、Claw-style Agent Evaluation (2/2)、Instruction Following (1/1)、Long Context (1/1)、Multimodal Understanding (1/1)、Productivity Knowledge (1/1)、常识推理 (1/1)

25 个共同 benchmark 上,Claude Sonnet 4.5 平均高出 8.81 分。

单项差距最大的 benchmark:τ²-Bench - Telecom — Claude Sonnet 4.5 98,Claude Sonnet 4 65(分差 +33)。

本页正文由结构化模型、价格与 benchmark 数据生成,不使用实时 LLM 撰写。