大模型 Agent 能力评测排行榜

本页面提供大模型 Agent 能力评测排行榜，涵盖 Aider-Polyglot、τ²-Bench、Terminal Bench 2.0、Tool Decathlon、OSWorld-Verified 等主流 Agent 评测基准，深度对比 GPT、Claude、Qwen、DeepSeek 等模型的工具使用、任务规划与自主执行能力。

数据更新于 2026-07-06 20:28:17

截至 2026年7月，本页覆盖 Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon 等评测基准，聚焦 大模型 Agent 能力评测排行榜 方向的模型对比。

点击模型名称可进入详情页查看上下文长度、许可方式与 API 价格。数据口径说明见数据方法论。

基准评测

Agent能力评测Aider-Polyglot τ²-Bench

AI Agent - 工具使用Terminal Bench 2.0 Tool Decathlon OSWorld-Verified

更多评测

参数规模:全部 3B及以下 7B 13B 34B 65B 100B及以上

模型类型:全部推理大模型基座大模型指令优化/聊天优化大模型编程大模型

开源：全部开源闭源

来源：全部国产模型

模型发布时间截止:

榜单亮点

按 Tool Decathlon 排序

当前 SOTA

Kimi K2.6

Moonshot AI

50.00Tool Decathlon

查看详情

最佳开源

Kimi K2.6

Moonshot AI

50.00Tool Decathlon

查看详情

最佳国产

Kimi K2.6

Moonshot AI

50.00Tool Decathlon

查看详情

大模型性能评测结果

数据来源：DataLearnerAI

点击任意行查看模型详情；勾选左侧可对比最多 4 个模型。

排名	模型						开源情况
	Kimi K2.6 Moonshot AI	—	—	66.70	50.00	73.10	免费商用	详情
	Hy3 腾讯AI实验室	—	—	—	48.50	—	免费商用	详情
	o3-pro OpenAI	84.90	—	—	—	—	闭源	详情
4	GPT-4.1 nano OpenAI	8.90	—	—	—	—	闭源	详情
5	Gemini-2.5-Pro-Preview-05-06 Google Deep Mind	76.90	—	—	—	—	闭源	详情
6	DeepSeek V3.2-Exp DeepSeek-AI	74.20	66.70	—	—	—	免费商用	详情
7	Claude Opus 4 Anthropic	72.00	72.50	—	—	—	闭源	详情
8	OpenAI o4 - mini OpenAI	72.00	56.90	—	—	—	闭源	详情
9	DeepSeek-R1-0528 DeepSeek-AI	71.40	—	—	—	—	免费商用	详情
10	Claude Sonnet 3.7 Anthropic	64.90	61.80	—	—	28.00	闭源	详情
11	OpenAI o1 OpenAI	61.70	—	—	—	—	闭源	详情
12	Qwen3-235B-A22B 阿里巴巴	59.60	34.40	—	—	—	免费商用	详情
13	Kimi K2 Moonshot AI	59.10	64.30	—	—	—	免费商用	详情
14	DeepSeek-R1 DeepSeek-AI	56.90	—	—	—	—	免费商用	详情
15	DeepSeek-V3-0324 DeepSeek-AI	55.10	38.80	—	—	—	免费商用	详情
16	Gemini 2.5 Flash Google Deep Mind	55.10	—	—	—	—	闭源	详情
17	Grok 3 xAI	53.30	—	—	—	—	闭源	详情
18	GPT-4.1 OpenAI	52.40	54.70	—	—	—	闭源	详情
19	Grok 3 mini xAI	49.30	—	—	—	—	闭源	详情
20	DeepSeek-V3 DeepSeek-AI	48.40	—	—	—	—	免费商用	详情
21	GPT-4.5 OpenAI	44.90	—	—	—	—	闭源	详情
22	Gemini 2.0 Flash Experimental DeepMind	38.20	—	—	—	—	闭源	详情
23	Gemini 2.0 Pro Experimental DeepMind	35.60	—	—	—	—	闭源	详情
24	OpenAI o1-mini OpenAI	32.90	—	—	—	—	闭源	详情
25	GPT-4.1 mini OpenAI	32.40	53.00	—	—	—	闭源	详情
26	GPT-4o(2025-01-29) OpenAI	27.10	—	—	—	—	闭源	详情
27	Qwen2.5-Max 阿里巴巴	21.80	—	—	—	—	闭源	详情
28	GPT-4o(2024-11-20) OpenAI	18.20	—	—	—	—	闭源	详情
29	DeepSeek-V2-236B-Chat DeepSeek-AI	17.80	—	—	—	—	免费商用	详情
30	Llama 4 Maverick Facebook AI研究实验室	15.60	—	—	—	—	免费商用	详情
31	C4AI Command A (202503) CohereAI	12.00	—	—	—	—	不可商用	详情
32	Codestral 25.01 MistralAI	11.10	—	—	—	—	闭源	详情
33	MiniMax M3 MiniMaxAI	—	—	—	—	70.00	不可商用	详情
34	M2.1 MiniMaxAI	—	—	47.90	—	—	免费商用	详情
35	Kimi K2.5 Moonshot AI	—	—	50.80	—	—	免费商用	详情
36	MiniMax M2.5 MiniMaxAI	—	—	51.70	—	—	免费商用	详情
37	DeepSeek-V4-Flash DeepSeek-AI	—	—	56.90	—	—	免费商用	详情
38	Qwen3.6-Max-Preview 阿里巴巴	—	—	65.40	—	—	闭源	详情
39	DeepSeek-V4-Pro DeepSeek-AI	—	—	67.90	—	—	免费商用	详情
40	Qwen3.7-Max-Preview 阿里巴巴	—	—	69.70	—	—	闭源	详情
41	DeepSeek-V3.1 Terminus DeepSeek-AI	—	37.00	—	—	—	免费商用	详情
42	GLM-4.6 智谱AI	—	75.90	—	—	—	免费商用	详情
43	MiniMax M2 MiniMaxAI	—	77.20	—	—	—	免费商用	详情
44	DeepSeek V3.2 DeepSeek-AI	—	80.30	46.40	—	—	免费商用	详情
45	Qwen3-Max-Thinking 阿里巴巴	—	82.10	—	—	—	闭源	详情
46	GLM-4.7 智谱AI	—	87.40	41.00	—	—	免费商用	详情
47	Step 3.5 Flash StepFunAI	—	88.20	51.00	—	—	免费商用	详情
48	GLM-5 智谱AI	—	89.70	61.10	—	—	免费商用	详情

Kimi K2.6 Moonshot AI

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.066.70

Tool Decathlon50.00

OSWorld-Verified73.10

免费商用

Hy3 腾讯AI实验室

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon48.50

OSWorld-Verified—

免费商用

o3-pro OpenAI

Aider-Polyglot84.90

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

GPT-4.1 nano OpenAI

Aider-Polyglot8.90

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

Gemini-2.5-Pro-Preview-05-06 Google Deep Mind

Aider-Polyglot76.90

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

DeepSeek V3.2-Exp DeepSeek-AI

Aider-Polyglot74.20

τ²-Bench66.70

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

Claude Opus 4 Anthropic

Aider-Polyglot72.00

τ²-Bench72.50

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

OpenAI o4 - mini OpenAI

Aider-Polyglot72.00

τ²-Bench56.90

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

DeepSeek-R1-0528 DeepSeek-AI

Aider-Polyglot71.40

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

Claude Sonnet 3.7 Anthropic

Aider-Polyglot64.90

τ²-Bench61.80

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified28.00

闭源

OpenAI o1 OpenAI

Aider-Polyglot61.70

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

Qwen3-235B-A22B 阿里巴巴

Aider-Polyglot59.60

τ²-Bench34.40

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

Kimi K2 Moonshot AI

Aider-Polyglot59.10

τ²-Bench64.30

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

DeepSeek-R1 DeepSeek-AI

Aider-Polyglot56.90

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

DeepSeek-V3-0324 DeepSeek-AI

Aider-Polyglot55.10

τ²-Bench38.80

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

Gemini 2.5 Flash Google Deep Mind

Aider-Polyglot55.10

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

Grok 3 xAI

Aider-Polyglot53.30

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

GPT-4.1 OpenAI

Aider-Polyglot52.40

τ²-Bench54.70

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

Grok 3 mini xAI

Aider-Polyglot49.30

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

DeepSeek-V3 DeepSeek-AI

Aider-Polyglot48.40

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

GPT-4.5 OpenAI

Aider-Polyglot44.90

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

Gemini 2.0 Flash Experimental DeepMind

Aider-Polyglot38.20

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

Gemini 2.0 Pro Experimental DeepMind

Aider-Polyglot35.60

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

OpenAI o1-mini OpenAI

Aider-Polyglot32.90

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

GPT-4.1 mini OpenAI

Aider-Polyglot32.40

τ²-Bench53.00

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

GPT-4o(2025-01-29)OpenAI

Aider-Polyglot27.10

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

Qwen2.5-Max 阿里巴巴

Aider-Polyglot21.80

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

GPT-4o(2024-11-20)OpenAI

Aider-Polyglot18.20

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

DeepSeek-V2-236B-Chat DeepSeek-AI

Aider-Polyglot17.80

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

Llama 4 Maverick Facebook AI研究实验室

Aider-Polyglot15.60

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

C4AI Command A (202503)CohereAI

Aider-Polyglot12.00

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

不可商用

Codestral 25.01 MistralAI

Aider-Polyglot11.10

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

MiniMax M3 MiniMaxAI

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified70.00

不可商用

M2.1 MiniMaxAI

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.047.90

Tool Decathlon—

OSWorld-Verified—

免费商用

Kimi K2.5 Moonshot AI

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.050.80

Tool Decathlon—

OSWorld-Verified—

免费商用

MiniMax M2.5 MiniMaxAI

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.051.70

Tool Decathlon—

OSWorld-Verified—

免费商用

DeepSeek-V4-Flash DeepSeek-AI

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.056.90

Tool Decathlon—

OSWorld-Verified—

免费商用

Qwen3.6-Max-Preview 阿里巴巴

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.065.40

Tool Decathlon—

OSWorld-Verified—

闭源

DeepSeek-V4-Pro DeepSeek-AI

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.067.90

Tool Decathlon—

OSWorld-Verified—

免费商用

Qwen3.7-Max-Preview 阿里巴巴

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.069.70

Tool Decathlon—

OSWorld-Verified—

闭源

DeepSeek-V3.1 Terminus DeepSeek-AI

Aider-Polyglot—

τ²-Bench37.00

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

GLM-4.6 智谱AI

Aider-Polyglot—

τ²-Bench75.90

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

MiniMax M2 MiniMaxAI

Aider-Polyglot—

τ²-Bench77.20

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

DeepSeek V3.2 DeepSeek-AI

Aider-Polyglot—

τ²-Bench80.30

Terminal Bench 2.046.40

Tool Decathlon—

OSWorld-Verified—

免费商用

Qwen3-Max-Thinking 阿里巴巴

Aider-Polyglot—

τ²-Bench82.10

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

GLM-4.7 智谱AI

Aider-Polyglot—

τ²-Bench87.40

Terminal Bench 2.041.00

Tool Decathlon—

OSWorld-Verified—

免费商用

Step 3.5 Flash StepFunAI

Aider-Polyglot—

τ²-Bench88.20

Terminal Bench 2.051.00

Tool Decathlon—

OSWorld-Verified—

免费商用

GLM-5 智谱AI

Aider-Polyglot—

τ²-Bench89.70

Terminal Bench 2.061.10

Tool Decathlon—

OSWorld-Verified—

免费商用

排序：