DataLearner logoDataLearnerAI
Latest AI Insights
Model Leaderboards
Benchmarks
Model Directory
Model Comparison
Resource Center
Tools
LanguageEnglish
DataLearner logoDataLearner AI

A knowledge platform focused on LLM benchmarking, datasets, and practical instruction with continuously updated capability maps.

Products

  • Leaderboards
  • Model comparison
  • Datasets

Resources

  • Tutorials
  • Editorial
  • Tool directory

Company

  • About
  • Privacy policy
  • Data methodology
  • Contact

© 2026 DataLearner AI. DataLearner curates industry data and case studies so researchers, enterprises, and developers can rely on trustworthy intelligence.

Privacy policyTerms of service
Back to Main Leaderboard

大模型 Agent 能力评测排行榜

本页面提供大模型 Agent 能力评测排行榜,涵盖 Aider-Polyglot、τ²-Bench、Terminal Bench 2.0、Tool Decathlon、OSWorld-Verified 等主流 Agent 评测基准,深度对比 GPT、Claude、Qwen、DeepSeek 等模型的工具使用、任务规划与自主执行能力。

Updated on 2026-04-28 13:02:03

As of 2026-04, this page covers Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon and related benchmarks for 大模型 Agent 能力评测排行榜, making it straightforward to compare within the same task family.

Click any model name to check context length, licensing, and pricing on its detail page. See Data Methodology for scoring details.

Benchmark
Agent能力评测Aider-Polyglotτ²-Bench
AI Agent - 工具使用Terminal Bench 2.0Tool DecathlonOSWorld-Verified
More Benchmarks
Model Size:All

LLM Performance Results

Data source: DataLearnerAI
RankModelLicense
DeepSeek-AI
DeepSeek-V4-Pro
——67.90——Free commercial
Moonshot AI
Kimi K2.6
——66.7050.0073.10Free commercial
阿里巴巴
Qwen3.6-Max-Preview
——65.40——不开源
4
智谱AI
GLM-5
—89.7061.10——Free commercial
5
DeepSeek-AI
DeepSeek-V4-Flash
——56.90——Free commercial
6
MiniMaxAI
MiniMax M2.5
——51.70——Free commercial
7
StepFunAI
Step 3.5 Flash
—88.2051.00——Free commercial
8
Moonshot AI
Kimi K2.5
——50.80——Free commercial
9
MiniMaxAI
M2.1
61.00—47.90——Free commercial
10
DeepSeek-AI
DeepSeek V3.2
69.9080.3046.40——Free commercial
11
智谱AI
GLM-4.7
52.1087.4041.00——Free commercial
12
OpenAI
o3-pro
84.90————不开源
13
DeepSeek-AI
DeepSeek-V3.1
76.30————Free commercial
14
DeepSeek-AI
DeepSeek-V3.1 Terminus
76.1037.00———Free commercial
15
DeepSeek-AI
DeepSeek V3.2-Exp
74.5066.70———Free commercial
16
OpenAI
OpenAI o4 - mini
72.0056.90———不开源
17
Anthropic
Claude Opus 4
72.0072.50———不开源
18
DeepSeek-AI
DeepSeek-R1-0528
71.40————Free commercial
19
Anthropic
Claude Sonnet 3.7
64.9061.80——28.00不开源
20
Moonshot AI
Kimi K2
59.1064.30———Free commercial
21
Google Deep Mind
Gemini 2.5 Flash
56.70————不开源
22
DeepSeek-AI
DeepSeek-V3-0324
55.1038.80———Free commercial
23
阿里巴巴
Qwen3-235B-A22B
—34.40———Free commercial
24
OpenAI
GPT-4.1 mini
—53.00———不开源
25
OpenAI
GPT-4.1
—54.70———不开源
26
智谱AI
GLM-4.6
—75.90———Free commercial
27
MiniMaxAI
MiniMax M2
—77.20———Free commercial
28
阿里巴巴
Qwen3-Max-Thinking
—82.10———不开源
DeepSeek-V4-Pro
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.067.90
Tool Decathlon—
OSWorld-Verified—
Free commercial
Kimi K2.6
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.066.70
Tool Decathlon50.00
OSWorld-Verified73.10
Free commercial
Qwen3.6-Max-Preview
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.065.40
Tool Decathlon—
OSWorld-Verified—
不开源
4
GLM-5
Aider-Polyglot—
τ²-Bench89.70
Terminal Bench 2.061.10
Tool Decathlon—
OSWorld-Verified—
Free commercial
5
DeepSeek-V4-Flash
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.056.90
Tool Decathlon—
OSWorld-Verified—
Free commercial
6
MiniMax M2.5
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.051.70
Tool Decathlon—
OSWorld-Verified—
Free commercial
7
Step 3.5 Flash
Aider-Polyglot—
τ²-Bench88.20
Terminal Bench 2.051.00
Tool Decathlon—
OSWorld-Verified—
Free commercial
8
Kimi K2.5
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.050.80
Tool Decathlon—
OSWorld-Verified—
Free commercial
9
M2.1
Aider-Polyglot61.00
τ²-Bench—
Terminal Bench 2.047.90
Tool Decathlon—
OSWorld-Verified—
Free commercial
10
DeepSeek V3.2
Aider-Polyglot69.90
τ²-Bench80.30
Terminal Bench 2.046.40
Tool Decathlon—
OSWorld-Verified—
Free commercial
11
GLM-4.7
Aider-Polyglot52.10
τ²-Bench87.40
Terminal Bench 2.041.00
Tool Decathlon—
OSWorld-Verified—
Free commercial
12
o3-pro
Aider-Polyglot84.90
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
不开源
13
DeepSeek-V3.1
Aider-Polyglot76.30
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
Free commercial
14
DeepSeek-V3.1 Terminus
Aider-Polyglot76.10
τ²-Bench37.00
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
Free commercial
15
DeepSeek V3.2-Exp
Aider-Polyglot74.50
τ²-Bench66.70
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
Free commercial
16
OpenAI o4 - mini
Aider-Polyglot72.00
τ²-Bench56.90
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
不开源
17
Claude Opus 4
Aider-Polyglot72.00
τ²-Bench72.50
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
不开源
18
DeepSeek-R1-0528
Aider-Polyglot71.40
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
Free commercial
19
Claude Sonnet 3.7
Aider-Polyglot64.90
τ²-Bench61.80
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified28.00
不开源
20
Kimi K2
Aider-Polyglot59.10
τ²-Bench64.30
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
Free commercial
21
Gemini 2.5 Flash
Aider-Polyglot56.70
τ²-Bench—
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
不开源
22
DeepSeek-V3-0324
Aider-Polyglot55.10
τ²-Bench38.80
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
Free commercial
23
Qwen3-235B-A22B
Aider-Polyglot—
τ²-Bench34.40
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
Free commercial
24
GPT-4.1 mini
Aider-Polyglot—
τ²-Bench53.00
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
不开源
25
GPT-4.1
Aider-Polyglot—
τ²-Bench54.70
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
不开源
26
GLM-4.6
Aider-Polyglot—
τ²-Bench75.90
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
Free commercial
27
MiniMax M2
Aider-Polyglot—
τ²-Bench77.20
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
Free commercial
28
Qwen3-Max-Thinking
Aider-Polyglot—
τ²-Bench82.10
Terminal Bench 2.0—
Tool Decathlon—
OSWorld-Verified—
不开源
Sort by:
3B and below
7B
13B
34B
65B
100B and above
Model Type:AllReasoning ModelsFoundation ModelsInstruction/Chat ModelsCoding Models
Source:AllOpen SourceClosed Source
Model release cutoff: