LLM Agent Benchmark Leaderboard

Name: LLM Agent Benchmark Leaderboard
Creator: DataLearner
License: https://creativecommons.org/licenses/by/4.0/

This page provides the LLM Agent benchmark leaderboard, covering Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon, and OSWorld-Verified. Compare GPT, Claude, Qwen, and DeepSeek on tool use, task planning, and autonomous execution.

Updated on 2026-07-28 08:43:41

As of 2026-07, this page covers Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon and related benchmarks for LLM Agent Benchmark Leaderboard, making it straightforward to compare within the same task family.

Click any model name to check context length, licensing, and pricing on its detail page. See Data Methodology for scoring details.

Benchmark

Agent能力评测Aider-Polyglot τ²-Bench

AI Agent - 工具使用Terminal Bench 2.0 Tool Decathlon OSWorld-Verified

More Benchmarks

Model Size:All 3B and below 7B 13B 34B 65B 100B and above

Model Type:All Reasoning Models Foundation Models Instruction/Chat Models Coding Models

Source:All Open Source Closed Source

Origin:All China

Model release cutoff:

Top picks

Ranked by OSWorld-Verified

Current SOTA

Claude Fable 5

Anthropic

85.00OSWorld-Verified

View model

Best Open-Source

Kimi K3

Moonshot AI

84.80OSWorld-Verified−0.20

View model

Best China-Made

Kimi K3

Moonshot AI

84.80OSWorld-Verified−0.20

View model

LLM Performance Results

Data source: DataLearnerAI

Click any row to open the model page. Tick the checkboxes to compare up to 4 models side by side.

Rank	Model						License
	Claude Fable 5 Thinking EnabledTools Anthropic	—	—	—	—	85.00	Proprietary	Details
	Kimi K3 Thinking Level · HighTools Moonshot AI	—	—	—	—	84.80	Free commercial	Details
	Claude Opus 4.8 Extended ThinkingTools Anthropic	—	—	—	—	83.40	Proprietary	Details
4	Gemini 3.6 Flash Thinking EnabledTools Google Deep Mind	—	—	—	—	83.00	Proprietary	Details
5	Claude Sonnet 5 Thinking Level · Extra HighTools Anthropic	—	—	—	—	81.20	Proprietary	Details
6	Muse Spark 1.1 Thinking EnabledTools Facebook AI研究实验室	—	—	—	75.60	80.80	Proprietary	Details
7	Claude Mythos Preview Extended ThinkingTools Anthropic	—	—	82.00	—	79.60	Proprietary	Details
8	GPT-5.5 Thinking EnabledTools OpenAI	—	—	82.70	—	78.70	Proprietary	Details
9	Gemini 3.5 Flash Thinking EnabledTools Google Deep Mind	—	—	—	—	78.40	Proprietary	Details
10	Opus 4.7 Extended ThinkingTools Anthropic	—	—	69.40	—	78.00	Proprietary	Details
11	Gemini 3.1 Pro Preview Thinking EnabledTools Google Deep Mind	—	—	—	—	76.20	Proprietary	Details
12	GPT-5.4 Thinking Level · Extra HighTools OpenAI	—	—	75.10	—	75.00	Proprietary	Details
13	Gemini 3.5 Flash-Lite Thinking EnabledTools Google Deep Mind	—	—	—	—	74.00	Proprietary	Details
14	Kimi K2.6 Thinking EnabledTools Moonshot AI	—	—	66.70	50.00	73.10	Free commercial	Details
15	Claude Opus 4.6 Extended ThinkingTools Anthropic	—	91.89	65.40	—	72.70	Proprietary	Details
16	Claude Sonnet 4.6 Thinking EnabledTools Anthropic	—	—	59.10	—	72.50	Proprietary	Details
17	GPT-5.4 mini Thinking Level · Extra HighTools OpenAI	—	—	60.00	42.90	72.10	Proprietary	Details
18	MiniMax M3 Thinking EnabledTools MiniMaxAI	—	—	—	—	70.00	Non-commercial	Details
19	Qwen3.5-397B-A17B Thinking EnabledTools 阿里巴巴	—	86.70	52.50	38.30	62.20	Free commercial	Details
20	Claude Sonnet 4.5 Thinking EnabledTools Anthropic	—	84.70	42.80	—	61.40	Proprietary	Details
21	Qwen3.5-27B Thinking EnabledTools 阿里巴巴	—	79.00	41.60	—	56.20	Free commercial	Details
22	Claude Sonnet 4 Thinking EnabledTools Anthropic	—	—	—	—	42.20	Proprietary	Details
23	GPT-5.4 nano Thinking Level · Extra HighTools OpenAI	—	—	46.30	35.50	39.00	Proprietary	Details
24	Claude Sonnet 3.7 Thinking EnabledTools Anthropic	—	61.80	—	—	28.00	Proprietary	Details
25	GPT-5 Thinking Enabled OpenAI	88.00	—	—	—	—	Proprietary	Details
26	GPT-5 Thinking Enabled OpenAI	86.70	—	—	—	—	Proprietary	Details
27	o3-pro Thinking Enabled OpenAI	84.90	—	—	—	—	Proprietary	Details
28	Gemini 2.5-Pro Thinking Enabled Google Deep Mind	83.10	—	—	—	—	Proprietary	Details
29	OpenAI o3 Thinking Enabled OpenAI	81.30	—	—	—	—	Proprietary	Details
30	GPT-5 Thinking Enabled OpenAI	81.30	—	—	—	—	Proprietary	Details
31	GPT-4.1 nano OpenAI	8.90	—	—	—	—	Proprietary	Details
32	Grok 4 Thinking Enabled xAI	79.60	—	—	—	—	Proprietary	Details
33	Gemini 2.5-Pro Thinking Enabled Google Deep Mind	79.10	—	—	—	—	Proprietary	Details
34	OpenAI o3 OpenAI	76.90	—	—	—	—	Proprietary	Details
35	Gemini-2.5-Pro-Preview-05-06 Google Deep Mind	76.90	—	—	—	—	Proprietary	Details
36	DeepSeek V3.2-Exp Thinking Enabled DeepSeek-AI	74.20	—	—	—	—	Free commercial	Details
37	Gemini 2.5 Pro Experimental 03-25 Google Deep Mind	72.90	—	—	—	—	Proprietary	Details
38	OpenAI o4 - mini Thinking Enabled OpenAI	72.00	—	—	—	—	Proprietary	Details
39	Claude Opus 4 Thinking Enabled Anthropic	72.00	—	—	—	—	Proprietary	Details
40	DeepSeek-R1-0528 Thinking Enabled DeepSeek-AI	71.40	—	—	—	—	Free commercial	Details
41	Claude Opus 4 Anthropic	70.70	—	—	—	—	Proprietary	Details
42	DeepSeek V3.2-Exp DeepSeek-AI	70.20	—	—	—	—	Free commercial	Details
43	Claude Sonnet 3.7 Thinking Enabled Anthropic	64.90	—	—	—	—	Proprietary	Details
44	OpenAI o1 Thinking Enabled OpenAI	61.70	—	—	—	—	Proprietary	Details
45	Claude Sonnet 4 Thinking Enabled Anthropic	61.30	—	—	—	—	Proprietary	Details
46	OpenAI o3-mini Thinking Enabled OpenAI	60.40	—	—	—	—	Proprietary	Details
47	Claude Sonnet 3.7 Anthropic	60.40	—	—	—	—	Proprietary	Details
48	Qwen3-235B-A22B 阿里巴巴	59.60	—	—	—	—	Free commercial	Details
49	Kimi K2 Moonshot AI	59.10	—	—	—	—	Free commercial	Details
50	DeepSeek-R1 Thinking Enabled DeepSeek-AI	56.90	—	—	—	—	Free commercial	Details

Claude Fable 5 Anthropic

Thinking EnabledTools

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified85.00

Proprietary

Kimi K3 Moonshot AI

Thinking Level · HighTools

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified84.80

Free commercial

Claude Opus 4.8 Anthropic

Extended ThinkingTools

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified83.40

Proprietary

Gemini 3.6 Flash Google Deep Mind

Thinking EnabledTools

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified83.00

Proprietary

Claude Sonnet 5 Anthropic

Thinking Level · Extra HighTools

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified81.20

Proprietary

Muse Spark 1.1 Facebook AI研究实验室

Thinking EnabledTools

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon75.60

OSWorld-Verified80.80

Proprietary

Claude Mythos Preview Anthropic

Extended ThinkingTools

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.082.00

Tool Decathlon—

OSWorld-Verified79.60

Proprietary

GPT-5.5 OpenAI

Thinking EnabledTools

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.082.70

Tool Decathlon—

OSWorld-Verified78.70

Proprietary

Gemini 3.5 Flash Google Deep Mind

Thinking EnabledTools

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified78.40

Proprietary

Opus 4.7 Anthropic

Extended ThinkingTools

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.069.40

Tool Decathlon—

OSWorld-Verified78.00

Proprietary

Gemini 3.1 Pro Preview Google Deep Mind

Thinking EnabledTools

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified76.20

Proprietary

GPT-5.4 OpenAI

Thinking Level · Extra HighTools

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.075.10

Tool Decathlon—

OSWorld-Verified75.00

Proprietary

Gemini 3.5 Flash-Lite Google Deep Mind

Thinking EnabledTools

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified74.00

Proprietary

Kimi K2.6 Moonshot AI

Thinking EnabledTools

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.066.70

Tool Decathlon50.00

OSWorld-Verified73.10

Free commercial

Claude Opus 4.6 Anthropic

Extended ThinkingTools

Aider-Polyglot—

τ²-Bench91.89

Terminal Bench 2.065.40

Tool Decathlon—

OSWorld-Verified72.70

Proprietary

Claude Sonnet 4.6 Anthropic

Thinking EnabledTools

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.059.10

Tool Decathlon—

OSWorld-Verified72.50

Proprietary

GPT-5.4 mini OpenAI

Thinking Level · Extra HighTools

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.060.00

Tool Decathlon42.90

OSWorld-Verified72.10

Proprietary

MiniMax M3 MiniMaxAI

Thinking EnabledTools

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified70.00

Non-commercial

Qwen3.5-397B-A17B 阿里巴巴

Thinking EnabledTools

Aider-Polyglot—

τ²-Bench86.70

Terminal Bench 2.052.50

Tool Decathlon38.30

OSWorld-Verified62.20

Free commercial

Claude Sonnet 4.5 Anthropic

Thinking EnabledTools

Aider-Polyglot—

τ²-Bench84.70

Terminal Bench 2.042.80

Tool Decathlon—

OSWorld-Verified61.40

Proprietary

Qwen3.5-27B 阿里巴巴

Thinking EnabledTools

Aider-Polyglot—

τ²-Bench79.00

Terminal Bench 2.041.60

Tool Decathlon—

OSWorld-Verified56.20

Free commercial

Claude Sonnet 4 Anthropic

Thinking EnabledTools

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified42.20

Proprietary

GPT-5.4 nano OpenAI

Thinking Level · Extra HighTools

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.046.30

Tool Decathlon35.50

OSWorld-Verified39.00

Proprietary

Claude Sonnet 3.7 Anthropic

Thinking EnabledTools

Aider-Polyglot—

τ²-Bench61.80

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified28.00

Proprietary

GPT-5 OpenAI

Thinking Enabled

Aider-Polyglot88.00

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

GPT-5 OpenAI

Thinking Enabled

Aider-Polyglot86.70

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

o3-pro OpenAI

Thinking Enabled

Aider-Polyglot84.90

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

Gemini 2.5-Pro Google Deep Mind

Thinking Enabled

Aider-Polyglot83.10

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

OpenAI o3 OpenAI

Thinking Enabled

Aider-Polyglot81.30

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

GPT-5 OpenAI

Thinking Enabled

Aider-Polyglot81.30

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

GPT-4.1 nano OpenAI

Aider-Polyglot8.90

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

Grok 4 xAI

Thinking Enabled

Aider-Polyglot79.60

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

Gemini 2.5-Pro Google Deep Mind

Thinking Enabled

Aider-Polyglot79.10

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

OpenAI o3 OpenAI

Aider-Polyglot76.90

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

Gemini-2.5-Pro-Preview-05-06 Google Deep Mind

Aider-Polyglot76.90

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

DeepSeek V3.2-Exp DeepSeek-AI

Thinking Enabled

Aider-Polyglot74.20

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Free commercial

Gemini 2.5 Pro Experimental 03-25 Google Deep Mind

Aider-Polyglot72.90

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

OpenAI o4 - mini OpenAI

Thinking Enabled

Aider-Polyglot72.00

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

Claude Opus 4 Anthropic

Thinking Enabled

Aider-Polyglot72.00

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

DeepSeek-R1-0528 DeepSeek-AI

Thinking Enabled

Aider-Polyglot71.40

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Free commercial

Claude Opus 4 Anthropic

Aider-Polyglot70.70

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

DeepSeek V3.2-Exp DeepSeek-AI

Aider-Polyglot70.20

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Free commercial

Claude Sonnet 3.7 Anthropic

Thinking Enabled

Aider-Polyglot64.90

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

OpenAI o1 OpenAI

Thinking Enabled

Aider-Polyglot61.70

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

Claude Sonnet 4 Anthropic

Thinking Enabled

Aider-Polyglot61.30

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

OpenAI o3-mini OpenAI

Thinking Enabled

Aider-Polyglot60.40

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

Claude Sonnet 3.7 Anthropic

Aider-Polyglot60.40

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Proprietary

Qwen3-235B-A22B 阿里巴巴

Aider-Polyglot59.60

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Free commercial

Kimi K2 Moonshot AI

Aider-Polyglot59.10

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Free commercial

DeepSeek-R1 DeepSeek-AI

Thinking Enabled

Aider-Polyglot56.90

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

Free commercial

Sort by:

Showing 50 of 149 modelsView OSWorld-Verified benchmark page