Benchmark Results

Claude Opus 4.6

Benchmark Results

综合评估

8 evaluations

Benchmark / mode

Score

Rank/total

ARC-AGI

Low

20 / 65

ARC-AGI

Extended

11 / 65

GPQA Diamond

Extended

91.31

12 / 175

MMLU

Extended

91.05

7 / 65

ARC-AGI-2

Low

64.60

15 / 58

ARC-AGI-2

Extended

66.30

14 / 58

HLE

ExtendedToolsInternet

8 / 149

ARC-AGI-3

Thinking Level · Max

1 / 6

编程与软件工程

5 evaluations

Benchmark / mode

Score

Rank/total

HumanEval

Extended

2 / 39

SWE-bench Verified

ExtendedTools

80.84

6 / 103

SWE-bench

ExtendedTools

77.83

1 / 2

LiveCodeBench

Extended

35 / 118

SWE-bench Multilingual

ExtendedTools

9 / 17

常识问答

1 evaluations

Benchmark / mode

Score

Rank/total

SimpleQA

Extended

6 / 45

数学推理

7 evaluations

Benchmark / mode

Score

Rank/total

AIME2025

Extended

99.79

7 / 106

MATH-500

Extended

97.60

10 / 44

FrontierMath

Thinking Level · Max

40.70

7 / 60

FrontierMath - Tier 4

64K

20.80

14 / 80

FrontierMath - Tier 4

32K

20.80

14 / 80

FrontierMath - Tier 4

High

14.60

23 / 80

FrontierMath - Tier 4

Thinking Level · Max

22.90

12 / 80

多模态理解

2 evaluations

Benchmark / mode

Score

Rank/total

MMMU

Extended

73.90

18 / 28

MMMU

ExtendedTools

77.30

15 / 28

Agent能力评测

2 evaluations

Benchmark / mode

Score

Rank/total

τ²-Bench - Telecom

ExtendedTools

99.25

2 / 35

τ²-Bench

ExtendedTools

91.89

1 / 40

指令跟随

1 evaluations

Benchmark / mode

Score

Rank/total

IF Bench

Extended

1 / 27

AI Agent - 信息收集

1 evaluations

Benchmark / mode

Score

Rank/total

BrowseComp

Thinking EnabledToolsInternet

6 / 43

AI Agent - 工具使用

2 evaluations

Benchmark / mode

Score

Rank/total

OSWorld-Verified

ExtendedTools

72.70

6 / 14

Terminal Bench 2.0

ExtendedTools

65.40

9 / 43

生产力知识

1 evaluations

Benchmark / mode

Score

Rank/total

GDPval-AA

ExtendedToolsInternet

1606

1 / 20

OpenClaw智能体能力综合测评

1 evaluations

Benchmark / mode

Score

Rank/total

Pinch Bench

Thinking EnabledTools

87.40

7 / 37

Compare with other models

Competitor Comparison

Benchmark scores for Claude Opus 4.6 compared against top models in its class

Benchmark categories:

The chart shows each model’s highest score per benchmark within the current filter. See the table below for per-mode details.

12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

Benchmark	Claude Opus 4.6Current	GPT-5.4	Gemini 3.1 Pro Preview
ARC-AGI 综合评估	92.00Extended Thinking	93.70Standard Mode	--
ARC-AGI-2 综合评估	66.30Extended Thinking	77.10Standard Mode	77.10Thinking Level · High
GPQA Diamond 综合评估	91.31Extended Thinking	--	94.30Thinking Level · High
HLE 综合评估	53.00Extended Thinking ｜ Tools	52.10Thinking Level · Extra High ｜ Tools	51.40Thinking Level · High ｜ Tools
MMLU 综合评估	91.05Extended Thinking	--	92.60Thinking Level · High
LiveCodeBench 编程与软件工程	76.00Extended Thinking	--	91.70Thinking Level · High ｜ Tools
SWE-bench Verified 编程与软件工程	80.84Extended Thinking ｜ Tools	--	80.60Thinking Level · High ｜ Tools
FrontierMath 数学推理	40.70Thinking Level · High	--	36.90Thinking Level · High
FrontierMath - Tier 4 数学推理	22.90Thinking Level · High	27.10Thinking Level · Extra High	16.70Standard Mode
MMMU 多模态理解	77.30Extended Thinking ｜ Tools	--	80.50Thinking Level · High
τ²-Bench Agent能力评测	91.89Extended Thinking ｜ Tools	--	90.80Thinking Level · High ｜ Tools
τ²-Bench - Telecom Agent能力评测	99.25Extended Thinking ｜ Tools	98.90Thinking Level · Extra High ｜ Tools	99.30Thinking Level · High ｜ Tools

4 additional benchmarks remain in the chart above.

Standard API Pricing: Claude Opus 4.6 vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

When a context threshold exists, the charted base price only applies within these limits:

Claude Opus 4.6: Base price applies to <= 200K

GPT-5.4: Base price applies to <= 272K

Gemini 3.1 Pro Preview: Base price applies to <= 200K

Model	Supplier	Standard input	Standard output	Base price applies to
Claude Opus 4.6	Anthropic	$5 / 1M tokens	$25 / 1M tokens	<= 200K
GPT-5.4	OpenAI	$2.5 / 1M tokens	$15 / 1M tokens	<= 272K
Gemini 3.1 Pro Preview	Google Deep Mind	$2 / 1M tokens	$12 / 1M tokens	<= 200K

Version History

How each version of the Claude Opus 4.6 series stacks up on benchmark tests

Benchmark categories:

The chart shows each model’s highest score per benchmark within the current filter. See the table below for per-mode details.

12 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

Benchmark	Claude Opus 4.6Current	Opus 4.5	Opus 4.1	Claude Opus 4
ARC-AGI 综合评估	92.00Extended Thinking	80.00Extended Thinking	--	35.70Standard Mode
ARC-AGI-2 综合评估	66.30Extended Thinking	37.60Extended Thinking	--	8.60Standard Mode
GPQA Diamond 综合评估	91.31Extended Thinking	87.00Extended Thinking	81.00Extended Thinking	79.60Standard Mode
HLE 综合评估	53.00Extended Thinking ｜ Tools	43.20Extended Thinking ｜ Tools	--	10.70Standard Mode
LiveCodeBench 编程与软件工程	76.00Extended Thinking	87.00Extended Thinking ｜ Tools	--	56.60Standard Mode
SWE-bench Verified 编程与软件工程	80.84Extended Thinking ｜ Tools	80.90Extended Thinking ｜ Tools	74.50Extended Thinking ｜ Tools	72.50Standard Mode
AIME2025 数学推理	99.79Extended Thinking	--	78.00Extended Thinking	75.50Standard Mode
FrontierMath 数学推理	40.70Thinking Level · High	20.70Extended Thinking	7.20Extended Thinking	4.50Standard Mode
FrontierMath - Tier 4 数学推理	22.90Thinking Level · High	4.20Standard Mode	4.2032K	4.20Thinking Enabled
MATH-500 数学推理	97.60Extended Thinking	--	--	98.20Standard Mode
MMMU 多模态理解	77.30Extended Thinking ｜ Tools	80.70Extended Thinking	--	--
τ²-Bench Agent能力评测	91.89Extended Thinking ｜ Tools	81.99Extended Thinking ｜ Tools	--	72.50Thinking Enabled ｜ Tools

4 additional benchmarks remain in the chart above.

Single-Benchmark Version Trend

Viewing: ARC-AGI · 综合评估

Benchmark

NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the Claude Opus 4.6 Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

When a context threshold exists, the charted base price only applies within these limits:

Claude Opus 4.6: Base price applies to <= 200K

Model	Supplier	Standard input	Standard output	Base price applies to
Claude Opus 4.6	Anthropic	$5 / 1M tokens	$25 / 1M tokens	<= 200K
Opus 4.5	Facebook AI研究实验室	$5 / 1M tokens	$25 / 1M tokens	—
Opus 4.1	Anthropic	$15 / 1M tokens	$75 / 1M tokens	—
Claude Opus 4	—	15 美元/ 100万tokens	75 美元/100万tokens	—

Claude Opus 4.6 Benchmark Analysis

Claude Opus 4.6 currently shows benchmark results led by τ²-Bench (1 / 40, score 91.89), IF Bench (1 / 27, score 94), GDPval-AA (1 / 20, score 1606). This page also compares it with 2 competitor models and 3 predecessor or same-series models, including performance and pricing views when available. 6 source links are attached for reference.

Claude Opus 4.6 评测结果深度解读

模型概述

Claude Opus 4.6 是 Anthropic 于 2026年2月5日发布的旗舰级大语言模型，作为 Opus 系列的最新迭代版本，该模型在推理能力、长文本处理和AI Agent应用方面实现了重大突破。模型支持高达 1M tokens 的上下文窗口和 131K tokens 的输出长度，并首次引入了"思考模式"(Thinking Mode)，通过扩展思维链处理展现更清晰的推理过程。

核心技术特性：

上下文长度：1,000K tokens（业界领先）
双推理模式：常规模式 + 思考模式
多模态支持：文本、图像输入输出
中文全面支持

评测表现总览

Claude Opus 4.6 参与了 25项权威评测，覆盖综合评估、编程、Agent能力、长上下文等8大领域，在多个关键维度获得业界第一或前三的成绩，充分展现了其作为2026年顶级大模型的实力。

🏆 顶级表现领域

核心能力深度解析

一、抽象推理能力：业界天花板

Claude Opus 4.6 在 ARC-AGI 系列评测中的表现堪称现象级。ARC-AGI 是公认的AI抽象推理"试金石"，要求模型在零样本情况下识别复杂视觉模式并进行逻辑推理。

数据亮点：

ARC-AGI（原版）思考高强度模式：94分（排名1/41，超越所有参赛模型）
ARC-AGI-2（难度升级版）思考高强度：69.2分（排名1/31，在更难测试中保持领先）

技术解读：这一成绩证明 Opus 4.6 不仅能处理语言任务，更具备接近人类的视觉-逻辑联合推理能力。在思考强度越高的模式下，模型表现越优异，显示其深度推理链路的有效性。

二、思考模式的革命性突破

Opus 4.6 的"思考模式"是其核心创新之一。通过对比常规模式和思考模式在同一评测中的表现，我们发现思考模式平均提升14分，在某些任务中提升幅度高达21分。

典型对比案例：

关键发现：思考模式在需要多步推理、复杂决策的任务中优势显著，但在Terminal Bench Hard等注重快速执行的任务中，常规模式反而表现更优（49分 vs 46分），说明不同模式适用于不同场景。

三、AI Agent能力：接近完美的工具掌控

在 τ²-Bench - Telecom 评测中，Opus 4.6 达到了惊人的 99.3分（思考+工具模式），这是电信领域Agent任务的近乎完美表现。该成绩证明模型能够：

准确理解复杂的行业需求
高效调用专业工具链
完成多步骤任务编排

同时在 Terminal Bench 系列中排名第一（常规+工具模式49分），在 Terminal Bench 2.0 中排名第二（思考+工具65.4分），展现了强大的终端操作和系统交互能力，适合DevOps、自动化运维等场景。

四、编程与软件工程：真实世界的验证

在 SWE-bench Verified 这一真实软件工程任务评测中，Opus 4.6 取得 80.8分（排名3/81）。该评测要求模型：

理解GitHub真实代码库
定位和修复实际bug
编写符合工程规范的代码

这一成绩证明 Opus 4.6 不仅能完成教科书式的编程题目，更能胜任真实开发环境中的复杂任务，是AI辅助编程工具的理想选择。

五、长上下文处理：1M token的商业价值

凭借 1M token 的超长上下文窗口，Opus 4.6 在 AA-LCR（长上下文检索）评测中思考模式达到 71分（排名1/2），相比常规模式的58分提升13分。

实际应用价值：

一次性处理整本书籍或完整代码库
企业级知识库全文分析
长期对话历史保持上下文连贯性
法律文档、合同等超长文本的精准理解

需要改进的方向

指令跟随能力有待加强

在 IF Bench（指令跟随）评测中，Opus 4.6 的表现相对一般：

思考+工具模式：53分（排名16/20）
常规+工具模式：45分（排名19/20）

这表明在严格遵循复杂、多层级指令的任务中，模型仍有提升空间。对于需要精确执行用户指令的应用场景（如格式化输出、严格约束条件的生成），建议进行额外的提示工程优化。

定价与性价比分析

Opus 4.6 提供三种定价模式以满足不同需求：

成本优化建议：

对于非紧急的批量任务，使用批量模式可节省75%输入成本
简单任务使用常规模式，复杂推理任务启用思考模式
根据任务特性选择模式，避免过度使用思考模式增加成本

应用场景推荐

✅ 强烈推荐场景

科研与学术：GPQA Diamond 91.3分证明其处理研究生级科学问题的能力
软件开发：SWE-bench 80.8分验证的真实代码能力，适合AI编程助手
企业知识管理：1M token上下文支持全文档库分析和智能检索
复杂Agent开发：τ²-Bench 99.3分展现的工具调用和任务编排能力
战略分析与决策：ARC-AGI 94分证明的抽象推理和深度思考能力
：法律文档、学术论文、技术手册的深度理解与摘要

⚠️ 需谨慎评估场景

严格指令遵循任务：IF Bench排名靠后，需要额外优化
成本敏感型应用：作为旗舰模型，推理成本较Sonnet系列高
实时交互场景：思考模式会增加响应延迟，需权衡准确性与速度

抽象推理	ARC-AGI 思考·高强度 94分	🥇 1/41	模式识别、逻辑推理
Agent能力	τ²-Bench Telecom 99.3分	🥇 1/23	工具调用、任务执行
长上下文	AA-LCR 思考模式 71分	🥇 1/2	长文档理解
编程工程	SWE-bench Verified 80.8分	🥉 3/81	代码理解、问题解决
科学推理	GPQA Diamond 思考 91.3分	Top 5 / 146	研究生级问题

GPQA Diamond	84	91.3	+7.3分	科学问题、学术研究
HLE（类人评估）	18.6	40	+21.4分	复杂决策、多步推理
τ²-Bench Telecom	85	99.3	+14.3分	专业领域Agent
AA-LCR（长上下文）	58	71	+13分	长文档分析

标准模式	$10/1M tokens	$25/1M tokens	常规应用
批量模式	$2.5/1M tokens（75%折扣）	$12.5/1M tokens（50%折扣）	大规模处理
加速模式	$30/1M tokens	$150/1M tokens	低延迟需求

Benchmark Results

Benchmark Results

综合评估

编程与软件工程

常识问答

数学推理

多模态理解

Agent能力评测

指令跟随

AI Agent - 信息收集

AI Agent - 工具使用

生产力知识

OpenClaw智能体能力综合测评

Competitor Comparison

Standard API Pricing: Claude Opus 4.6 vs. Peer Models

Version History

Single-Benchmark Version Trend

Standard API Pricing Across the Claude Opus 4.6 Series

Claude Opus 4.6 Benchmark Analysis

Claude Opus 4.6 评测结果深度解读

模型概述

评测表现总览

🏆 顶级表现领域

核心能力深度解析

一、抽象推理能力：业界天花板

二、思考模式的革命性突破

三、AI Agent能力：接近完美的工具掌控

四、编程与软件工程：真实世界的验证

五、长上下文处理：1M token的商业价值

需要改进的方向

指令跟随能力有待加强

定价与性价比分析

应用场景推荐

✅ 强烈推荐场景

⚠️ 需谨慎评估场景

Sources