Claude Sonnet 4.6 Benchmark Analysis

Claude Sonnet 4.6 currently shows benchmark results led by AA-LCR (1 / 13, score 71), LiveBench (12 / 115, score 75.47), GPQA Diamond (22 / 179, score 89.90). This page also compares it with 3 competitor models and 3 predecessor or same-series models, including performance and pricing views when available. 3 source links are attached for reference.

Anthropic 于 2026 年 2 月 17 日正式发布 Claude Sonnet 4.6，这是其 Sonnet 系列的最新版本。该模型延续了 Anthropic 一贯的“可靠、可控”设计理念，重点在编码、计算机使用（鼠标键盘操作）、长上下文推理、代理规划和知识工作等领域进行优化，同时保持与前代 Sonnet 4.5 相同的 API 定价。官方定位为“最强 Sonnet 模型”，可作为免费/Pro 用户在 claude.ai 上的默认模型，也支持 API 和各大云平台接入。以下分析基于 Anthropic 官方公告、系统卡（system card）及部分第三方报道的数据，力求客观呈现其实际表现，不涉及主观溢美。

核心特性与可用性

上下文窗口：1M token（beta 阶段），支持上下文压缩和自适应思考模式，便于处理完整代码库、长文档或多轮代理任务。
定价：输入 $3/百万 token，输出 $15/百万 token，与 Sonnet 4.5 一致；对比旗舰 Opus 4.6（约 $15/$75），成本约为其五分之一。
其他功能：支持工具调用（网页搜索、代码执行）、视觉输出优化、Claude Code 等产品集成。安全评估显示其幻觉率和逢迎倾向较低，整体对齐水平与 Opus 4.6 相当或略优。

这些特性使 Sonnet 4.6 更适合高频、次旗舰级工作负载，而非必须依赖 Opus 的极端复杂场景。

基准性能数据

Anthropic 系统卡提供了详细对照表（结果多为 10 次平均，采用自适应思考/最大努力配置，除特殊注明）。以下选取代表性指标，与前代及主要竞品（Gemini 3 Pro、GPT-5.2 等）对比：

基准项目	Sonnet 4.6	Opus 4.6	Sonnet 4.5	Gemini 3 Pro	GPT-5.2
SWE-bench Verified (真实编码)	79.6%	80.8%	77.2%	76.2%	80.0%
OSWorld-Verified (计算机使用)	72.5%	72.7%	61.4%	—	—
GDPval-AA Elo (知识/办公任务)	1633	1606	1276	1201	1462
GPQA Diamond (研究生级推理)	89.9%	91.3%	83.4%	91.9%	93.2%
ARC-AGI-2 (max effort)	60.4%	69.2%	13.6%	31.1%	54.2%
Terminal-Bench 2.0	59.1%	65.4%	51.0%	56.2%	64.7%
HLE (Humanity’s Last Exam, with tools)	49.0%	53.0%	33.6%	45.8%	50.0%
金融代理分析 (准确率)	63.3% (max thinking)	60.05%	—	—	58.53%

数据解读：

编码与代理任务：SWE-bench 上接近 Opus 水平，OSWorld 计算机使用能力从 2024 年 10 月的 14.9% 提升至 72.5%，反映出 Anthropic 在 GUI 操作上的持续投入。在实际办公/金融场景（GDPval-AA、保险基准 94%）中，Sonnet 4.6 甚至略超 Opus 4.6，说明其在“实用代理”维度已具备高性价比。
长上下文：1M token 下 8-needle MRCR 测试匹配率 65.1%（64k 采样），远高于 Sonnet 4.5 的 18.5%，但仍落后于 Opus 4.6 的 78.3%。
通用推理：GPQA、MMMLU 等指标处于前列，但未全面领先 GPT-5.2 或 Gemini 3 Pro。
用户偏好测试（Claude Code 内部）：开发者偏好 Sonnet 4.6 胜过 Sonnet 4.5 的比例约 70%，胜过 Opus 4.5 的比例约 59%，主要反馈为指令跟随更准、幻觉更少、多步执行更一致。

总体而言，Sonnet 4.6 在多数基准中实现了对前代的显著跃升，在部分真实世界代理任务上已逼近或超越更昂贵的旗舰模型，但纯学术推理（如 GPQA）仍与最顶尖竞品存在小幅差距。

实际应用优势与潜在局限

优势：

性价比突出：企业用户（如 Hex、Box、Replit、Mercury Banking）反馈显示，在多数编码、文档处理和自动化场景中，Sonnet 4.6 可替代 Opus 完成 80-90% 的工作负载，成本优势在高频调用时尤为明显。
计算机使用与代理：支持无 API 的浏览器/桌面操作，在保险、ERP 等遗留系统自动化中展现较强自纠错能力。
安全表现：单轮违规请求拒绝率 99.38%，提示注入抵抗力较 Sonnet 4.5 有明显提升，整体符合 ASL-3 标准，无重大对齐风险报告。

局限（基于当前公开信息）：

发布仅数日，独立第三方大规模评测尚少，多数数据来源于 Anthropic 或其合作伙伴。
部分用户初步测试提到响应速度偶有波动、极少数简单任务可能出现低级错误（需更多验证）。
计算机使用仍处于实验性阶段，复杂 GUI 场景的可靠性距离“完全人类水平”仍有距离。
长上下文在极端 1M 负载下衰减仍存在（虽已大幅改善）。

总结

Claude Sonnet 4.6 是 Anthropic 在“中端高效”路线上的又一成果：它没有追求单一基准的绝对第一，而是通过平衡能力、成本和可靠性，在实际知识工作和代理场景中提供了极具吸引力的选项。对于预算敏感的开发者、企业自动化或高频交互应用而言，它可能是当前最具实用价值的升级选择；对于需要极致前沿推理的用户，仍可结合 Opus 4.6 形成梯度部署。

客观来看，AI 模型迭代已进入“边际收益递减但实用价值持续提升”的阶段。Sonnet 4.6 的意义更多在于“让旗舰级能力变得可规模化”，而非颠覆性突破。建议开发者结合自身工作流进行小规模测试，以获取最贴合的结论。官方参考链接：https://www.anthropic.com/news/claude-sonnet-4-6 及系统卡文件。

（本文数据截至 2026 年 2 月 17-18 日公开信息，后续独立评测可能进一步补充或修正。）

Benchmark Results

Claude Sonnet 4.6

Benchmark Results

General Knowledge

7 evaluations

Benchmark / mode

Score

Rank/total

GPQA Diamond

89.90

22 / 179

LiveBench

Thinking Level · Low

70.44

36 / 115

LiveBench

Thinking Level · Medium

75.47

12 / 115

LiveBench

Thinking Level · High

75.32

15 / 115

ARC-AGI-2

58.30

18 / 59

HLE

27 / 161

HLE

33.20

72 / 161

Coding and Software Engineer

2 evaluations

Benchmark / mode

Score

Rank/total

SWE-bench Verified

79.60

17 / 108

DeepSWE

Thinking Level · HighTools

8 / 9

Math and Reasoning

1 evaluations

Benchmark / mode

Score

Rank/total

FrontierMath - Tier 4

16K

8.30

34 / 80

Agent Level Benchmark

1 evaluations

Benchmark / mode

Score

Rank/total

τ²-Bench - Telecom

97.90

9 / 35

AI Agent - Information Search

1 evaluations

Benchmark / mode

Score

Rank/total

BrowseComp

74.70

21 / 46

AI Agent - Tool Usage

3 evaluations

Benchmark / mode

Score

Rank/total

OSWorld-Verified

72.50

11 / 19

MCP-Atlas

Standard ModeTools

69.50

13 / 23

Terminal Bench 2.0

59.10

22 / 46

Productivity Knowledge

1 evaluations

Benchmark / mode

Score

Rank/total

GDPval-AA

11 / 21

Long Context

1 evaluations

Benchmark / mode

Score

Rank/total

AA-LCR

1 / 13

Claw-style Agent Evaluation

1 evaluations

Benchmark / mode

Score

Rank/total

Pinch Bench

Thinking EnabledTools

5 / 37

Compare with other models

Competitor Comparison

Benchmark scores for Claude Sonnet 4.6 compared against top models in its class

Claude Sonnet 4.6Claude Opus 4.6 GPT-5.2 Gemini 3.0 Pro (Preview 11-2025)

Benchmark categories:

The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

Benchmark	Claude Sonnet 4.6Current	Claude Opus 4.6	GPT-5.2	Gemini 3.0 Pro (Preview 11-2025)
ARC-AGI-2 综合评估	58.30Thinking Enabled	66.30Extended Thinking	54.20Deep Thinking Mode	45.10Thinking Enabled
GPQA Diamond 综合评估	89.90Thinking Enabled	91.31Extended Thinking	93.20Deep Thinking Mode	93.80Thinking Enabled
HLE 综合评估	49.00Thinking Enabled ｜ Tools	53.00Extended Thinking ｜ Tools	45.50Deep Thinking Mode ｜ Tools	45.80Thinking Level · High ｜ Tools
LiveBench 综合评估	75.47Thinking Level · Medium	--	48.91Standard Mode	73.39Thinking Level · High
SWE-bench Verified 编程与软件工程	79.60Thinking Enabled	80.84Extended Thinking ｜ Tools	--	76.20Thinking Enabled
FrontierMath - Tier 4 数学推理	8.3016K	22.90Thinking Level · High	18.80Thinking Level · Extra High	18.80Thinking Enabled
τ²-Bench - Telecom Agent能力评测	97.90Thinking Enabled ｜ Tools	99.25Extended Thinking ｜ Tools	--	98.00Thinking Level · High ｜ Tools
BrowseComp AI Agent - 信息收集	74.70Thinking Enabled ｜ Tools	84.00Thinking Enabled ｜ Tools	65.80Deep Thinking Mode ｜ Tools	59.20Thinking Level · High ｜ Tools
MCP-Atlas AI Agent - 工具使用	69.50Standard Mode ｜ Tools	76.80Deep Thinking Mode ｜ Tools	67.60Thinking Level · Extra High ｜ Tools	70.30Standard Mode ｜ Tools
OSWorld-Verified AI Agent - 工具使用	72.50Thinking Enabled ｜ Tools	72.70Extended Thinking ｜ Tools	--	--
Terminal Bench 2.0 AI Agent - 工具使用	59.10Thinking Enabled ｜ Tools	65.40Extended Thinking ｜ Tools	--	56.90Thinking Level · High ｜ Tools
GDPval-AA 生产力知识	57.00Thinking Enabled	1606.00Extended Thinking ｜ Tools	70.90Thinking Level · High ｜ Tools	35.00Thinking Level · High

2 additional benchmarks remain in the chart above.

Standard API Pricing: Claude Sonnet 4.6 vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

When a context threshold exists, the charted base price only applies within these limits:

Claude Sonnet 4.6: Base price applies to <= 200K

Claude Opus 4.6: Base price applies to <= 200K

Model	Supplier	Standard input	Standard output	Base price applies to
Claude Sonnet 4.6	Anthropic	$3 / 1M tokens	$15 / 1M tokens	<= 200K
Claude Opus 4.6	Anthropic	$5 / 1M tokens	$25 / 1M tokens	<= 200K
GPT-5.2	Facebook AI研究实验室	$1.75 / 1M tokens	$14 / 1M tokens	—

Version History

How each version of the Claude Sonnet 4.6 series stacks up on benchmark tests

Claude Sonnet 4.6Claude Sonnet 4.5 Claude Sonnet 4 Claude Sonnet 3.7

Benchmark categories:

12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

Benchmark	Claude Sonnet 4.6Current	Claude Sonnet 4.5	Claude Sonnet 4	Claude Sonnet 3.7
ARC-AGI-2 综合评估	58.30Thinking Enabled	13.60Thinking Enabled	5.90Thinking Enabled	--
GPQA Diamond 综合评估	89.90Thinking Enabled	83.40Thinking Enabled	83.80Deep Thinking Mode ｜ Tools	77.00Thinking Enabled
HLE 综合评估	49.00Thinking Enabled ｜ Tools	33.60Thinking Enabled ｜ Tools	9.60Thinking Enabled	10.30Thinking Enabled
LiveBench 综合评估	75.47Thinking Level · Medium	68.1964K	61.2764K	--
SWE-bench Verified 编程与软件工程	79.60Thinking Enabled	82.00Thinking Enabled ｜ Tools	80.20Thinking Enabled ｜ Tools	70.30Thinking Enabled ｜ Tools
FrontierMath - Tier 4 数学推理	8.3016K	4.2032K	0.00Standard Mode	--
τ²-Bench - Telecom Agent能力评测	97.90Thinking Enabled ｜ Tools	98.00Thinking Enabled ｜ Tools	65.00Thinking Enabled ｜ Tools	55.00Thinking Enabled ｜ Tools
BrowseComp AI Agent - 信息收集	74.70Thinking Enabled ｜ Tools	24.10Thinking Enabled ｜ Tools	--	--
MCP-Atlas AI Agent - 工具使用	69.50Standard Mode ｜ Tools	59.50Thinking Enabled ｜ Tools	--	--
OSWorld-Verified AI Agent - 工具使用	72.50Thinking Enabled ｜ Tools	61.40Thinking Enabled ｜ Tools	42.20Thinking Enabled ｜ Tools	28.00Thinking Enabled ｜ Tools
Terminal Bench 2.0 AI Agent - 工具使用	59.10Thinking Enabled ｜ Tools	42.80Thinking Enabled ｜ Tools	--	--
GDPval-AA 生产力知识	57.00Thinking Enabled	39.00Thinking Enabled	33.00Thinking Enabled	28.00Thinking Enabled

2 additional benchmarks remain in the chart above.

Single-Benchmark Version Trend

Viewing: ARC-AGI-2 · 综合评估

Benchmark

NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the Claude Sonnet 4.6 Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

When a context threshold exists, the charted base price only applies within these limits:

Claude Sonnet 4.6: Base price applies to <= 200K

Model	Supplier	Standard input	Standard output	Base price applies to
Claude Sonnet 4.6	Anthropic	$3 / 1M tokens	$15 / 1M tokens	<= 200K

Sources

anthropic.comanthropic.com artificialanalysis.aiartificialanalysis.ai pinchbench.compinchbench.com