LLM Agent Benchmark Leaderboard
This page provides the LLM Agent benchmark leaderboard, covering Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon, and OSWorld-Verified. Compare GPT, Claude, Qwen, and DeepSeek on tool use, task planning, and autonomous execution.
As of 2026-06, this page covers Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon and related benchmarks for LLM Agent Benchmark Leaderboard, making it straightforward to compare within the same task family.
Click any model name to check context length, licensing, and pricing on its detail page. See Data Methodology for scoring details.
Top picks
Ranked by τ²-Bench
Gemma 4 31B
DeepMind

Gemma 4 31B
DeepMind
No qualifying model on this benchmark.
LLM Performance Results
Data source: DataLearnerAIClick any row to open the model page. Tick the checkboxes to compare up to 4 models side by side.