LLM Coding Benchmark Leaderboard

Name: LLM Coding Benchmark Leaderboard
Creator: DataLearner
License: https://creativecommons.org/licenses/by/4.0/

This page provides the LLM coding benchmark leaderboard, covering SWE-Bench Verified, SWE-Bench Pro, LiveCodeBench, and SWE-bench Multilingual datasets, comparing GPT, Claude, Qwen, and DeepSeek models.

Updated on 2026-05-02 07:10:24

As of 2026-05, this page covers SWE-bench Verified, LiveCodeBench, SWE-Bench Pro - Public, SWE-bench Multilingual and related benchmarks for LLM Coding Benchmark Leaderboard, making it straightforward to compare within the same task family.

Click any model name to check context length, licensing, and pricing on its detail page. See Data Methodology for scoring details.

Reference: Composite Coding Rankings

There is no single, universally accepted coding leaderboard. Static benchmarks like SWE-bench and HumanEval measure specific skills but can be gamed through targeted fine-tuning. We selected two complementary human-preference leaderboards: LMArena Coding Arena ranks models on general programming tasks (debugging, algorithms, code generation) via anonymous crowd-sourced voting; DesignArena Code Category focuses specifically on visual, front-end code generation (websites, UI components, games) using the same blind-voting methodology. Reading both together gives a fuller picture of coding capability.

LMArena Coding Arena

Full ranking

Elo ratings from anonymous A/B voting on real general coding tasks (debugging, algorithms, code generation) submitted by developers.

Updated 2026-05-07

#ModelElo

Opus 4.7 (thinking)Anthropic

1569

Claude Opus 4.6 (thinking)Anthropic

1553

Opus 4.7Anthropic

1550

Claude Opus 4.6Anthropic

1550

Claude Opus 4 (thinking-32k)Anthropic

1531

Muse SparkFacebook AI研究实验室

1530

Gemini 3.1 Pro PreviewGoogle Deep Mind

1529

gpt-5.4-highOpenAI

1528

智

GLM 5.1智谱AI

1525

gpt-5.5-highOpenAI

1524

Benchmark

SWE-bench Verified LiveCodeBench SWE-Bench Pro - Public SWE-bench Multilingual

More Benchmarks

Model Size:All 3B and below 7B 13B

LLM Performance Results

Data source: DataLearnerAI

Rank	Model					License
	GPT-5.1-Codex-Max OpenAI	76.80	—	—	—	Proprietary
	GPT-5 Codex OpenAI	74.50	—	—	—	Proprietary
	Grok 4 Code xAI	72.00	—	—	—	Proprietary
4	Grok Code Fast 1 xAI	70.80	—	—	—	Proprietary
5	Qwen3-Coder-Next 阿里巴巴	70.60	—	44.30	—	Free commercial
6	GPT-5.1 Codex OpenAI	70.40	85.50	—	—	Proprietary
7	Qwen3-Coder-480B-A35B 阿里巴巴	67.00	—	—	—	Free commercial
8	Devstral Medium MistralAI	61.60	—	—	—	Proprietary
9	Devstral Small 1.1 MistralAI	53.60	—	—	—	Free commercial
10	Qwen3-Coder-Flash 阿里巴巴	51.60	—	—	—	Free commercial
11	Devstral Small 1.0 MistralAI	46.80	—	—	—	Free commercial
12	Codestral 25.01 MistralAI	—	37.90	—	—	Proprietary
13	Codestral MistralAI	—	31.50	—	—	Non-commercial
14	GPT-5.3 Codex OpenAI	—	—	56.80	—	Proprietary
15	Composer 2 Cursor	—	—	—	73.70	Proprietary