DataLearner logoDataLearnerAI
Latest AI Insights
Model Leaderboards
Benchmarks
Model Directory
Model Comparison
Resource Center
Tools
LanguageEnglish
DataLearner logoDataLearner AI

A knowledge platform focused on LLM benchmarking, datasets, and practical instruction with continuously updated capability maps.

Products

  • Leaderboards
  • Model comparison
  • Datasets

Resources

  • Tutorials
  • Editorial
  • Tool directory

Company

  • About
  • Privacy policy
  • Data methodology
  • Contact

© 2026 DataLearner AI. DataLearner curates industry data and case studies so researchers, enterprises, and developers can rely on trustworthy intelligence.

Privacy policyTerms of service
Back to Main Leaderboard

大模型 Agent 能力评测排行榜

本页面提供大模型 Agent 能力评测排行榜,涵盖 Aider-Polyglot、τ²-Bench、Terminal Bench 2.0、Tool Decathlon、OSWorld-Verified 等主流 Agent 评测基准,深度对比 GPT、Claude、Qwen、DeepSeek 等模型的工具使用、任务规划与自主执行能力。

Updated on 2026-04-28 13:02:03

As of 2026-04, this page covers Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon and related benchmarks for 大模型 Agent 能力评测排行榜, making it straightforward to compare within the same task family.

Click any model name to check context length, licensing, and pricing on its detail page. See Data Methodology for scoring details.

Benchmark
Agent能力评测Aider-Polyglotτ²-Bench
AI Agent - 工具使用Terminal Bench 2.0Tool DecathlonOSWorld-Verified
More Benchmarks
Model Size:All3B and below7B13B34B65B100B and above
Model Type:AllReasoning ModelsFoundation ModelsInstruction/Chat ModelsCoding Models
Source:AllOpen SourceClosed Source
Model release cutoff:

LLM Performance Results

Data source: DataLearnerAI
RankModelLicense
Moonshot AI
Kimi K2.6
Thinking EnabledTools
——66.7050.0073.10Free commercial
OpenAI
GPT-5.4 mini
Thinking Level · Extra HighTools
——60.0042.9072.10不开源
智谱AI
GLM 5.1
Thinking EnabledTools
——63.5040.70—Free commercial
4
阿里巴巴
Qwen 3.6 Plus Preview
Thinking EnabledTools
——61.6039.80—不开源
5
阿里巴巴
Qwen3.5-397B-A17B
Thinking EnabledTools
—86.7052.5038.3062.20Free commercial
6
OpenAI
GPT-5.4 nano
Thinking Level · Extra HighTools
——46.3035.5039.00不开源
7
阿里巴巴
Qwen3.6-35B-A3B
Thinking Enabled
——51.5026.90—Free commercial
8
OpenAI
o3-pro
Thinking Level · High
84.90————不开源
9
Google Deep Mind
Gemini 2.5-Pro
Thinking Enabled
83.10————不开源
10
OpenAI
OpenAI o3
Thinking Level · High
81.30————不开源
11
xAI
Grok 4
Thinking Enabled
79.60————不开源
12
DeepSeek-AI
DeepSeek-V3.1
Thinking Enabled
76.30————Free commercial
13
DeepSeek-AI
DeepSeek-V3.1 Terminus
76.10————Free commercial
14
DeepSeek-AI
DeepSeek V3.2-Exp
Thinking EnabledTools
74.5066.70———Free commercial
15
OpenAI
OpenAI o4 - mini
Thinking Level · High
72.00————不开源
16
Anthropic
Claude Opus 4
Thinking Enabled
72.00————不开源
17
DeepSeek-AI
DeepSeek-R1-0528
Thinking Enabled
71.40————Free commercial
18
Anthropic
Claude Opus 4
70.10————不开源
19
DeepSeek-AI
DeepSeek V3.2
Thinking EnabledTools
69.9080.3046.40——Free commercial
20
DeepSeek-AI
DeepSeek-V3.1
68.40————Free commercial
21
阿里巴巴
Qwen3-Coder-Next
Standard ModeTools
66.20—36.20——Free commercial
22
Anthropic
Claude Sonnet 3.7
Thinking Enabled
64.90————不开源
23
Anthropic
Claude Sonnet 4
Thinking Enabled
61.30————不开源
24
MiniMaxAI
M2.1
Thinking EnabledTools
61.00—47.90——Free commercial
25
Anthropic
Claude Sonnet 3.7
60.40————不开源
26
Moonshot AI
Kimi K2
59.10————Free commercial
27
Google Deep Mind
Gemini 2.5 Flash
Thinking Enabled
56.70————不开源
28
DeepSeek-AI
DeepSeek-V3-0324
55.10————Free commercial
29
智谱AI
GLM-4.7
Thinking EnabledTools
52.1087.4041.00——Free commercial
30
Anthropic
Claude 3.5 Sonnet New
51.60————不开源
Kimi K2.6
Thinking EnabledTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.066.70
Tool Decathlon50.00
OSWorld-Verified73.10
Free commercial
GPT-5.4 mini
Thinking Level · Extra HighTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.060.00
Tool Decathlon42.90
OSWorld-Verified72.10
不开源
GLM 5.1
Thinking EnabledTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.063.50
Tool Decathlon40.70
OSWorld-Verified—
Free commercial
4
Qwen 3.6 Plus Preview
Thinking EnabledTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.061.60
Tool Decathlon39.80
OSWorld-Verified—
不开源
5
Qwen3.5-397B-A17B
Thinking EnabledTools
Aider-Polyglot—
τ²-Bench86.70
Terminal Bench 2.052.50
Tool Decathlon38.30
OSWorld-Verified62.20
Free commercial
6
GPT-5.4 nano
Thinking Level · Extra HighTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.046.30
Tool Decathlon35.50
OSWorld-Verified39.00
不开源
7
Qwen3.6-35B-A3B
Thinking Enabled
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.051.50
Tool Decathlon26.90
OSWorld-Verified—
Free commercial
8
o3-pro
Thinking Level · High
Aider-Polyglot84.90
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
不开源
9
Gemini 2.5-Pro
Thinking Enabled
Aider-Polyglot83.10
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
不开源
10
OpenAI o3
Thinking Level · High
Aider-Polyglot81.30
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
不开源
11
Grok 4
Thinking Enabled
Aider-Polyglot79.60
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
不开源
12
DeepSeek-V3.1
Thinking Enabled
Aider-Polyglot76.30
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
Free commercial
13
DeepSeek-V3.1 Terminus
Aider-Polyglot76.10
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
Free commercial
14
DeepSeek V3.2-Exp
Thinking EnabledTools
Aider-Polyglot74.50
τ²-Bench66.70
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
Free commercial
15
OpenAI o4 - mini
Thinking Level · High
Aider-Polyglot72.00
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
不开源
16
Claude Opus 4
Thinking Enabled
Aider-Polyglot72.00
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
不开源
17
DeepSeek-R1-0528
Thinking Enabled
Aider-Polyglot71.40
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
Free commercial
18
Claude Opus 4
Aider-Polyglot70.10
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
不开源
19
DeepSeek V3.2
Thinking EnabledTools
Aider-Polyglot69.90
τ²-Bench80.30
Terminal Bench 2.046.40
Tool Decathlon—
OSWorld-Verified—
Free commercial
20
DeepSeek-V3.1
Aider-Polyglot68.40
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
Free commercial
21
Qwen3-Coder-Next
Standard ModeTools
Aider-Polyglot66.20
τ²-Bench—
Terminal Bench 2.036.20
Tool Decathlon—
OSWorld-Verified—
Free commercial
22
Claude Sonnet 3.7
Thinking Enabled
Aider-Polyglot64.90
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
不开源
23
Claude Sonnet 4
Thinking Enabled
Aider-Polyglot61.30
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
不开源
24
M2.1
Thinking EnabledTools
Aider-Polyglot61.00
τ²-Bench—
Terminal Bench 2.047.90
Tool Decathlon—
OSWorld-Verified—
Free commercial
25
Claude Sonnet 3.7
Aider-Polyglot60.40
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
不开源
26
Kimi K2
Aider-Polyglot59.10
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
Free commercial
27
Gemini 2.5 Flash
Thinking Enabled
Aider-Polyglot56.70
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
不开源
28
DeepSeek-V3-0324
Aider-Polyglot55.10
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
Free commercial
29
GLM-4.7
Thinking EnabledTools
Aider-Polyglot52.10
τ²-Bench87.40
Terminal Bench 2.041.00
Tool Decathlon—
OSWorld-Verified—
Free commercial
30
Claude 3.5 Sonnet New
Aider-Polyglot51.60
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
不开源
Sort by:
View all 93 models on the Tool Decathlon benchmark page