DataLearner logoDataLearnerAI
Latest AI Insights
Model Leaderboards
Benchmarks
Model Directory
Model Comparison
Resource Center
Tools
LanguageEnglish
DataLearner logoDataLearner AI

A knowledge platform focused on LLM benchmarking, datasets, and practical instruction with continuously updated capability maps.

Products

  • Leaderboards
  • Model comparison
  • Datasets

Resources

  • Tutorials
  • Editorial
  • Tool directory

Company

  • About
  • Privacy policy
  • Data methodology
  • Contact

© 2026 DataLearner AI. DataLearner curates industry data and case studies so researchers, enterprises, and developers can rely on trustworthy intelligence.

Privacy policyTerms of service
Back to Main Leaderboard

大模型 Agent 能力评测排行榜

本页面提供大模型 Agent 能力评测排行榜,涵盖 Aider-Polyglot、τ²-Bench、Terminal Bench 2.0、Tool Decathlon、OSWorld-Verified 等主流 Agent 评测基准,深度对比 GPT、Claude、Qwen、DeepSeek 等模型的工具使用、任务规划与自主执行能力。

Updated on 2026-04-28 13:02:03

As of 2026-04, this page covers Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon and related benchmarks for 大模型 Agent 能力评测排行榜, making it straightforward to compare within the same task family.

Click any model name to check context length, licensing, and pricing on its detail page. See Data Methodology for scoring details.

Benchmark
Agent能力评测Aider-Polyglotτ²-Bench
AI Agent - 工具使用Terminal Bench 2.0Tool DecathlonOSWorld-Verified
More Benchmarks
Model Size:All3B and below7B13B34B65B100B and above
Model Type:AllReasoning ModelsFoundation ModelsInstruction/Chat ModelsCoding Models
Source:AllOpen SourceClosed Source
Model release cutoff:

LLM Performance Results

Data source: DataLearnerAI
RankModelLicense
OpenAI
GPT-5.5
Thinking Level · HighTools
——82.70—78.70不开源
Anthropic
Claude Mythos Preview
Extended ThinkingTools
——82.00—79.60不开源
OpenAI
GPT-5.3 Codex
Thinking EnabledTools
——77.30——不开源
4
OpenAI
GPT-5.4
Thinking Level · Extra HighTools
——75.10—75.00不开源
5
Anthropic
Opus 4.7
Extended ThinkingTools
——69.40—78.00不开源
6
Google Deep Mind
Gemini 3.1 Pro Preview
Thinking Level · HighTools
—90.8068.50——不开源
7
DeepSeek-AI
DeepSeek-V4-Pro
Thinking Level · Extra HighTools
——67.90——Free commercial
8
Moonshot AI
Kimi K2.6
Thinking EnabledTools
——66.7050.0073.10Free commercial
9
阿里巴巴
Qwen3.6-Max-Preview
Deep Thinking ModeTools
——65.40——不开源
10
Anthropic
Claude Opus 4.6
Extended ThinkingTools
—91.8965.40—72.70不开源
11
智谱AI
GLM 5.1
Thinking EnabledTools
——63.5040.70—Free commercial
12
DeepSeek-AI
DeepSeek-V4-Pro
Thinking Level · HighTools
——63.30——Free commercial
13
Cursor
Composer 2
Thinking Enabled
——61.70——不开源
14
阿里巴巴
Qwen 3.6 Plus Preview
Thinking EnabledTools
——61.6039.80—不开源
15
智谱AI
GLM-5
Thinking EnabledTools
—89.7061.10——Free commercial
16
OpenAI
GPT-5.4 mini
Thinking Level · Extra HighTools
——60.0042.9072.10不开源
17
阿里巴巴
Qwen3.6-27B
Thinking EnabledTools
——59.30——Free commercial
18
Anthropic
Opus 4.5
Extended ThinkingTools
—81.9959.30——不开源
19
Anthropic
Claude Sonnet 4.6
Thinking EnabledTools
——59.10—72.50不开源
20
DeepSeek-AI
DeepSeek-V4-Pro
Standard ModeTools
——59.10——Free commercial
21
Facebook AI研究实验室
Muse Spark
Thinking EnabledTools
——59.00——不开源
22
Google Deep Mind
Gemini 3.0 Pro (Preview 11-2025)
Thinking Level · HighTools
——56.90——不开源
23
DeepSeek-AI
DeepSeek-V4-Flash
Thinking Level · Extra HighTools
——56.90——Free commercial
24
DeepSeek-AI
DeepSeek-V4-Flash
Thinking Level · HighTools
——56.60——Free commercial
25
Google Deep Mind
Gemini 3.0 Pro (Preview 11-2025)
Thinking EnabledTools
—85.4054.20——不开源
26
阿里巴巴
Qwen3.5-397B-A17B
Thinking EnabledTools
—86.7052.5038.3062.20Free commercial
27
MiniMaxAI
MiniMax M2.5
Thinking EnabledTools
——51.70——Free commercial
28
阿里巴巴
Qwen3.6-35B-A3B
Thinking Enabled
——51.5026.90—Free commercial
29
StepFunAI
Step 3.5 Flash
Thinking EnabledTools
—88.2051.00——Free commercial
30
Moonshot AI
Kimi K2.5
Thinking EnabledTools
——50.80——Free commercial
GPT-5.5
Thinking Level · HighTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.082.70
Tool Decathlon—
OSWorld-Verified78.70
不开源
Claude Mythos Preview
Extended ThinkingTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.082.00
Tool Decathlon—
OSWorld-Verified79.60
不开源
GPT-5.3 Codex
Thinking EnabledTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.077.30
Tool Decathlon—
OSWorld-Verified—
不开源
4
GPT-5.4
Thinking Level · Extra HighTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.075.10
Tool Decathlon—
OSWorld-Verified75.00
不开源
5
Opus 4.7
Extended ThinkingTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.069.40
Tool Decathlon—
OSWorld-Verified78.00
不开源
6
Gemini 3.1 Pro Preview
Thinking Level · HighTools
Aider-Polyglot—
τ²-Bench90.80
Terminal Bench 2.068.50
Tool Decathlon—
OSWorld-Verified—
不开源
7
DeepSeek-V4-Pro
Thinking Level · Extra HighTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.067.90
Tool Decathlon—
OSWorld-Verified—
Free commercial
8
Kimi K2.6
Thinking EnabledTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.066.70
Tool Decathlon50.00
OSWorld-Verified73.10
Free commercial
9
Qwen3.6-Max-Preview
Deep Thinking ModeTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.065.40
Tool Decathlon—
OSWorld-Verified—
不开源
10
Claude Opus 4.6
Extended ThinkingTools
Aider-Polyglot—
τ²-Bench91.89
Terminal Bench 2.065.40
Tool Decathlon—
OSWorld-Verified72.70
不开源
11
GLM 5.1
Thinking EnabledTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.063.50
Tool Decathlon40.70
OSWorld-Verified—
Free commercial
12
DeepSeek-V4-Pro
Thinking Level · HighTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.063.30
Tool Decathlon—
OSWorld-Verified—
Free commercial
13
Composer 2
Thinking Enabled
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.061.70
Tool Decathlon—
OSWorld-Verified—
不开源
14
Qwen 3.6 Plus Preview
Thinking EnabledTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.061.60
Tool Decathlon39.80
OSWorld-Verified—
不开源
15
GLM-5
Thinking EnabledTools
Aider-Polyglot—
τ²-Bench89.70
Terminal Bench 2.061.10
Tool Decathlon—
OSWorld-Verified—
Free commercial
16
GPT-5.4 mini
Thinking Level · Extra HighTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.060.00
Tool Decathlon42.90
OSWorld-Verified72.10
不开源
17
Qwen3.6-27B
Thinking EnabledTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.059.30
Tool Decathlon—
OSWorld-Verified—
Free commercial
18
Opus 4.5
Extended ThinkingTools
Aider-Polyglot—
τ²-Bench81.99
Terminal Bench 2.059.30
Tool Decathlon—
OSWorld-Verified—
不开源
19
Claude Sonnet 4.6
Thinking EnabledTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.059.10
Tool Decathlon—
OSWorld-Verified72.50
不开源
20
DeepSeek-V4-Pro
Standard ModeTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.059.10
Tool Decathlon—
OSWorld-Verified—
Free commercial
21
Muse Spark
Thinking EnabledTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.059.00
Tool Decathlon—
OSWorld-Verified—
不开源
22
Gemini 3.0 Pro (Preview 11-2025)
Thinking Level · HighTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.056.90
Tool Decathlon—
OSWorld-Verified—
不开源
23
DeepSeek-V4-Flash
Thinking Level · Extra HighTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.056.90
Tool Decathlon—
OSWorld-Verified—
Free commercial
24
DeepSeek-V4-Flash
Thinking Level · HighTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.056.60
Tool Decathlon—
OSWorld-Verified—
Free commercial
25
Gemini 3.0 Pro (Preview 11-2025)
Thinking EnabledTools
Aider-Polyglot—
τ²-Bench85.40
Terminal Bench 2.054.20
Tool Decathlon—
OSWorld-Verified—
不开源
26
Qwen3.5-397B-A17B
Thinking EnabledTools
Aider-Polyglot—
τ²-Bench86.70
Terminal Bench 2.052.50
Tool Decathlon38.30
OSWorld-Verified62.20
Free commercial
27
MiniMax M2.5
Thinking EnabledTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.051.70
Tool Decathlon—
OSWorld-Verified—
Free commercial
28
Qwen3.6-35B-A3B
Thinking Enabled
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.051.50
Tool Decathlon26.90
OSWorld-Verified—
Free commercial
29
Step 3.5 Flash
Thinking EnabledTools
Aider-Polyglot—
τ²-Bench88.20
Terminal Bench 2.051.00
Tool Decathlon—
OSWorld-Verified—
Free commercial
30
Kimi K2.5
Thinking EnabledTools
Aider-Polyglot—
τ²-Bench—
Terminal Bench 2.050.80
Tool Decathlon—
OSWorld-Verified—
Free commercial
Sort by:
View all 93 models on the Terminal Bench 2.0 benchmark page