Claude Sonnet 4.6 Benchmark Analysis

Claude Sonnet 4.6 currently shows benchmark results led by AA-LCR (1 / 13, score 71), LiveBench (12 / 115, score 75.47), GPQA Diamond (22 / 179, score 89.90). This page also compares it with 3 competitor models and 3 predecessor or same-series models, including performance and pricing views when available. 3 source links are attached for reference.

Anthropic 于 2026 年 2 月 17 日正式发布 Claude Sonnet 4.6,这是其 Sonnet 系列的最新版本。该模型延续了 Anthropic 一贯的“可靠、可控”设计理念,重点在编码、计算机使用(鼠标键盘操作)、长上下文推理、代理规划和知识工作等领域进行优化,同时保持与前代 Sonnet 4.5 相同的 API 定价。官方定位为“最强 Sonnet 模型”,可作为免费/Pro 用户在 claude.ai 上的默认模型,也支持 API 和各大云平台接入。以下分析基于 Anthropic 官方公告、系统卡(system card)及部分第三方报道的数据,力求客观呈现其实际表现,不涉及主观溢美。

核心特性与可用性

  • 上下文窗口:1M token(beta 阶段),支持上下文压缩和自适应思考模式,便于处理完整代码库、长文档或多轮代理任务。
  • 定价:输入 $3/百万 token,输出 $15/百万 token,与 Sonnet 4.5 一致;对比旗舰 Opus 4.6(约 $15/$75),成本约为其五分之一。
  • 其他功能:支持工具调用(网页搜索、代码执行)、视觉输出优化、Claude Code 等产品集成。安全评估显示其幻觉率和逢迎倾向较低,整体对齐水平与 Opus 4.6 相当或略优。

这些特性使 Sonnet 4.6 更适合高频、次旗舰级工作负载,而非必须依赖 Opus 的极端复杂场景。

基准性能数据

Anthropic 系统卡提供了详细对照表(结果多为 10 次平均,采用自适应思考/最大努力配置,除特殊注明)。以下选取代表性指标,与前代及主要竞品(Gemini 3 Pro、GPT-5.2 等)对比:

基准项目Sonnet 4.6Opus 4.6Sonnet 4.5Gemini 3 ProGPT-5.2
SWE-bench Verified (真实编码)79.6%80.8%77.2%76.2%80.0%
OSWorld-Verified (计算机使用)72.5%72.7%61.4%
GDPval-AA Elo (知识/办公任务)16331606127612011462
GPQA Diamond (研究生级推理)89.9%91.3%83.4%91.9%93.2%
ARC-AGI-2 (max effort)60.4%69.2%13.6%31.1%54.2%
Terminal-Bench 2.059.1%65.4%51.0%56.2%64.7%
HLE (Humanity’s Last Exam, with tools)49.0%53.0%33.6%45.8%50.0%
金融代理分析 (准确率)63.3% (max thinking)60.05%58.53%

数据解读

  • 编码与代理任务:SWE-bench 上接近 Opus 水平,OSWorld 计算机使用能力从 2024 年 10 月的 14.9% 提升至 72.5%,反映出 Anthropic 在 GUI 操作上的持续投入。在实际办公/金融场景(GDPval-AA、保险基准 94%)中,Sonnet 4.6 甚至略超 Opus 4.6,说明其在“实用代理”维度已具备高性价比。
  • 长上下文:1M token 下 8-needle MRCR 测试匹配率 65.1%(64k 采样),远高于 Sonnet 4.5 的 18.5%,但仍落后于 Opus 4.6 的 78.3%。
  • 通用推理:GPQA、MMMLU 等指标处于前列,但未全面领先 GPT-5.2 或 Gemini 3 Pro。
  • 用户偏好测试(Claude Code 内部):开发者偏好 Sonnet 4.6 胜过 Sonnet 4.5 的比例约 70%,胜过 Opus 4.5 的比例约 59%,主要反馈为指令跟随更准、幻觉更少、多步执行更一致。

总体而言,Sonnet 4.6 在多数基准中实现了对前代的显著跃升,在部分真实世界代理任务上已逼近或超越更昂贵的旗舰模型,但纯学术推理(如 GPQA)仍与最顶尖竞品存在小幅差距。

实际应用优势与潜在局限

优势

  • 性价比突出:企业用户(如 Hex、Box、Replit、Mercury Banking)反馈显示,在多数编码、文档处理和自动化场景中,Sonnet 4.6 可替代 Opus 完成 80-90% 的工作负载,成本优势在高频调用时尤为明显。
  • 计算机使用与代理:支持无 API 的浏览器/桌面操作,在保险、ERP 等遗留系统自动化中展现较强自纠错能力。
  • 安全表现:单轮违规请求拒绝率 99.38%,提示注入抵抗力较 Sonnet 4.5 有明显提升,整体符合 ASL-3 标准,无重大对齐风险报告。

局限(基于当前公开信息):

  • 发布仅数日,独立第三方大规模评测尚少,多数数据来源于 Anthropic 或其合作伙伴。
  • 部分用户初步测试提到响应速度偶有波动、极少数简单任务可能出现低级错误(需更多验证)。
  • 计算机使用仍处于实验性阶段,复杂 GUI 场景的可靠性距离“完全人类水平”仍有距离。
  • 长上下文在极端 1M 负载下衰减仍存在(虽已大幅改善)。

总结

Claude Sonnet 4.6 是 Anthropic 在“中端高效”路线上的又一成果:它没有追求单一基准的绝对第一,而是通过平衡能力、成本和可靠性,在实际知识工作和代理场景中提供了极具吸引力的选项。对于预算敏感的开发者、企业自动化或高频交互应用而言,它可能是当前最具实用价值的升级选择;对于需要极致前沿推理的用户,仍可结合 Opus 4.6 形成梯度部署。

客观来看,AI 模型迭代已进入“边际收益递减但实用价值持续提升”的阶段。Sonnet 4.6 的意义更多在于“让旗舰级能力变得可规模化”,而非颠覆性突破。建议开发者结合自身工作流进行小规模测试,以获取最贴合的结论。官方参考链接:https://www.anthropic.com/news/claude-sonnet-4-6 及系统卡文件。

(本文数据截至 2026 年 2 月 17-18 日公开信息,后续独立评测可能进一步补充或修正。)

Benchmark Results

Claude Sonnet 4.6

Benchmark Results

Thinking
Tool usage

General Knowledge

7 evaluations
Benchmark / mode
Score
Rank/total
89.90
22 / 179
LiveBench
Thinking Level · Low
70.44
36 / 115
LiveBench
Thinking Level · Medium
75.47
12 / 115
LiveBench
Thinking Level · High
75.32
15 / 115
58.30
18 / 59
49
27 / 161
33.20
72 / 161

Coding and Software Engineer

2 evaluations
Benchmark / mode
Score
Rank/total
79.60
17 / 108
DeepSWE
Thinking Level · HighTools
30
8 / 9

Math and Reasoning

1 evaluations
Benchmark / mode
Score
Rank/total
8.30
34 / 80

Agent Level Benchmark

1 evaluations
Benchmark / mode
Score
Rank/total

AI Agent - Information Search

1 evaluations
Benchmark / mode
Score
Rank/total
74.70
21 / 46

AI Agent - Tool Usage

3 evaluations
Benchmark / mode
Score
Rank/total
72.50
11 / 19
MCP-Atlas
Standard ModeTools
69.50
13 / 23
59.10
22 / 46

Productivity Knowledge

1 evaluations
Benchmark / mode
Score
Rank/total
57
11 / 21

Long Context

1 evaluations
Benchmark / mode
Score
Rank/total
71
1 / 13

Claw-style Agent Evaluation

1 evaluations
Benchmark / mode
Score
Rank/total
Pinch Bench
Thinking EnabledTools
88
5 / 37

Competitor Comparison

Benchmark scores for Claude Sonnet 4.6 compared against top models in its class

Benchmark categories:
The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

BenchmarkClaude Sonnet 4.6CurrentClaude Opus 4.6GPT-5.2Gemini 3.0 Pro (Preview 11-2025)
ARC-AGI-2
综合评估
58.30Thinking Enabled
66.30Extended Thinking
54.20Deep Thinking Mode
45.10Thinking Enabled
GPQA Diamond
综合评估
89.90Thinking Enabled
91.31Extended Thinking
93.20Deep Thinking Mode
93.80Thinking Enabled
HLE
综合评估
49.00Thinking Enabled | Tools
53.00Extended Thinking | Tools
45.50Deep Thinking Mode | Tools
45.80Thinking Level · High | Tools
LiveBench
综合评估
75.47Thinking Level · Medium
--
48.91Standard Mode
73.39Thinking Level · High
SWE-bench Verified
编程与软件工程
79.60Thinking Enabled
80.84Extended Thinking | Tools
--
76.20Thinking Enabled
8.3016K
22.90Thinking Level · High
18.80Thinking Level · Extra High
18.80Thinking Enabled
τ²-Bench - Telecom
Agent能力评测
97.90Thinking Enabled | Tools
99.25Extended Thinking | Tools
--
98.00Thinking Level · High | Tools
BrowseComp
AI Agent - 信息收集
74.70Thinking Enabled | Tools
84.00Thinking Enabled | Tools
65.80Deep Thinking Mode | Tools
59.20Thinking Level · High | Tools
MCP-Atlas
AI Agent - 工具使用
69.50Standard Mode | Tools
76.80Deep Thinking Mode | Tools
67.60Thinking Level · Extra High | Tools
70.30Standard Mode | Tools
OSWorld-Verified
AI Agent - 工具使用
72.50Thinking Enabled | Tools
72.70Extended Thinking | Tools
--
--
Terminal Bench 2.0
AI Agent - 工具使用
59.10Thinking Enabled | Tools
65.40Extended Thinking | Tools
--
56.90Thinking Level · High | Tools
GDPval-AA
生产力知识
57.00Thinking Enabled
1606.00Extended Thinking | Tools
70.90Thinking Level · High | Tools
35.00Thinking Level · High
2 additional benchmarks remain in the chart above.

Standard API Pricing: Claude Sonnet 4.6 vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

When a context threshold exists, the charted base price only applies within these limits:

Claude Sonnet 4.6: Base price applies to <= 200K
Claude Opus 4.6: Base price applies to <= 200K
ModelSupplierStandard inputStandard outputBase price applies to
Claude Sonnet 4.6
Anthropic$3 / 1M tokens$15 / 1M tokens<= 200K
Claude Opus 4.6
Anthropic$5 / 1M tokens$25 / 1M tokens<= 200K
GPT-5.2
Facebook AI研究实验室$1.75 / 1M tokens$14 / 1M tokens

Version History

How each version of the Claude Sonnet 4.6 series stacks up on benchmark tests

Benchmark categories:
The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

BenchmarkClaude Sonnet 4.6CurrentClaude Sonnet 4.5Claude Sonnet 4Claude Sonnet 3.7
ARC-AGI-2
综合评估
58.30Thinking Enabled
13.60Thinking Enabled
5.90Thinking Enabled
--
GPQA Diamond
综合评估
89.90Thinking Enabled
83.40Thinking Enabled
83.80Deep Thinking Mode | Tools
77.00Thinking Enabled
HLE
综合评估
49.00Thinking Enabled | Tools
33.60Thinking Enabled | Tools
9.60Thinking Enabled
10.30Thinking Enabled
LiveBench
综合评估
75.47Thinking Level · Medium
68.1964K
61.2764K
--
SWE-bench Verified
编程与软件工程
79.60Thinking Enabled
82.00Thinking Enabled | Tools
80.20Thinking Enabled | Tools
70.30Thinking Enabled | Tools
8.3016K
4.2032K
0.00Standard Mode
--
τ²-Bench - Telecom
Agent能力评测
97.90Thinking Enabled | Tools
98.00Thinking Enabled | Tools
65.00Thinking Enabled | Tools
55.00Thinking Enabled | Tools
BrowseComp
AI Agent - 信息收集
74.70Thinking Enabled | Tools
24.10Thinking Enabled | Tools
--
--
MCP-Atlas
AI Agent - 工具使用
69.50Standard Mode | Tools
59.50Thinking Enabled | Tools
--
--
OSWorld-Verified
AI Agent - 工具使用
72.50Thinking Enabled | Tools
61.40Thinking Enabled | Tools
42.20Thinking Enabled | Tools
28.00Thinking Enabled | Tools
Terminal Bench 2.0
AI Agent - 工具使用
59.10Thinking Enabled | Tools
42.80Thinking Enabled | Tools
--
--
GDPval-AA
生产力知识
57.00Thinking Enabled
39.00Thinking Enabled
33.00Thinking Enabled
28.00Thinking Enabled
2 additional benchmarks remain in the chart above.

Single-Benchmark Version Trend

Viewing: ARC-AGI-2 · 综合评估

Benchmark
NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the Claude Sonnet 4.6 Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

When a context threshold exists, the charted base price only applies within these limits:

Claude Sonnet 4.6: Base price applies to <= 200K
ModelSupplierStandard inputStandard outputBase price applies to
Claude Sonnet 4.6
Anthropic$3 / 1M tokens$15 / 1M tokens<= 200K

Sources