DataLearner logoDataLearnerAI
Latest AI Insights
Model Leaderboards
Benchmarks
Model Directory
Model Comparison
Resource Center
Tools
LanguageEnglish
DataLearner logoDataLearner AI

A knowledge platform focused on LLM benchmarking, datasets, and practical instruction with continuously updated capability maps.

Products

  • Leaderboards
  • Model comparison
  • Datasets

Resources

  • Tutorials
  • Editorial
  • Tool directory

Company

  • About
  • Privacy policy
  • Data methodology
  • Contact

© 2026 DataLearner AI. DataLearner curates industry data and case studies so researchers, enterprises, and developers can rely on trustworthy intelligence.

Privacy policyTerms of service
Back to Main Leaderboard

LLM Coding Benchmark Leaderboard

This page provides the LLM coding benchmark leaderboard, covering SWE-Bench Verified, SWE-Bench Pro, LiveCodeBench, and SWE-bench Multilingual datasets, comparing GPT, Claude, Qwen, and DeepSeek models.

Updated on 2026-05-02 07:10:24

As of 2026-05, this page covers SWE-bench Verified, LiveCodeBench, SWE-Bench Pro - Public, SWE-bench Multilingual and related benchmarks for LLM Coding Benchmark Leaderboard, making it straightforward to compare within the same task family.

Click any model name to check context length, licensing, and pricing on its detail page. See Data Methodology for scoring details.

Reference: Composite Coding Rankings

There is no single, universally accepted coding leaderboard. Static benchmarks like SWE-bench and HumanEval measure specific skills but can be gamed through targeted fine-tuning. We selected two complementary human-preference leaderboards: LMArena Coding Arena ranks models on general programming tasks (debugging, algorithms, code generation) via anonymous crowd-sourced voting; DesignArena Code Category focuses specifically on visual, front-end code generation (websites, UI components, games) using the same blind-voting methodology. Reading both together gives a fuller picture of coding capability.

LMArena Coding Arena

Full ranking

Elo ratings from anonymous A/B voting on real general coding tasks (debugging, algorithms, code generation) submitted by developers.

Updated 2026-05-07

#ModelElo
1
Anthropic
Opus 4.7 (thinking)Anthropic
1569
2
Anthropic
Claude Opus 4.6 (thinking)Anthropic
1553
3
Anthropic
Opus 4.7Anthropic
1550
4
Anthropic
Claude Opus 4.6Anthropic
1550
5
Anthropic
Claude Opus 4 (thinking-32k)Anthropic
1531
6
F
Muse SparkFacebook AI研究实验室
1530
7
Google Deep Mind
Gemini 3.1 Pro PreviewGoogle Deep Mind
1529
8
OpenAI
gpt-5.4-highOpenAI
1528
9
智
GLM 5.1智谱AI
1525
10
OpenAI
gpt-5.5-highOpenAI
1524
Benchmark
SWE-bench VerifiedLiveCodeBenchSWE-Bench Pro - PublicSWE-bench Multilingual
More Benchmarks
Model Size:All3B and below7B13B

LLM Performance Results

Data source: DataLearnerAI
RankModelLicense
OpenAI
GPT-5.1-Codex-Max
OpenAI
76.80———Proprietary
OpenAI
GPT-5 Codex
OpenAI
74.50———Proprietary
xAI
Grok 4 Code
xAI
72.00———Proprietary
4
xAI
Grok Code Fast 1
xAI
70.80———Proprietary
5
阿里巴巴
Qwen3-Coder-Next
阿里巴巴
70.60—44.30—Free commercial
6
OpenAI
GPT-5.1 Codex
OpenAI
70.4085.50——Proprietary
7
阿里巴巴
Qwen3-Coder-480B-A35B
阿里巴巴
67.00———Free commercial
8
MistralAI
Devstral Medium
MistralAI
61.60———Proprietary
9
MistralAI
Devstral Small 1.1
MistralAI
53.60———Free commercial
10
阿里巴巴
Qwen3-Coder-Flash
阿里巴巴
51.60———Free commercial
11
MistralAI
Devstral Small 1.0
MistralAI
46.80———Free commercial
12
MistralAI
Codestral 25.01
MistralAI
—37.90——Proprietary
13
MistralAI
Codestral
MistralAI
—31.50——Non-commercial
14
OpenAI
GPT-5.3 Codex
OpenAI
——56.80—Proprietary
15
Cursor
Composer 2
Cursor
———73.70Proprietary
GPT-5.1-Codex-Max
OpenAI
SWE-bench Verified76.80
LiveCodeBench—
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
GPT-5 Codex
OpenAI
SWE-bench Verified74.50
LiveCodeBench—
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
Grok 4 Code
xAI
SWE-bench Verified72.00
LiveCodeBench—
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
4
Grok Code Fast 1
xAI
SWE-bench Verified70.80
LiveCodeBench—
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
5
Qwen3-Coder-Next
阿里巴巴
SWE-bench Verified70.60
LiveCodeBench—
SWE-Bench Pro - Public44.30
SWE-bench Multilingual—
Free commercial
6
GPT-5.1 Codex
OpenAI
SWE-bench Verified70.40
LiveCodeBench85.50
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
7
Qwen3-Coder-480B-A35B
阿里巴巴
SWE-bench Verified67.00
LiveCodeBench—
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Free commercial
8
Devstral Medium
MistralAI
SWE-bench Verified61.60
LiveCodeBench—
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
9
Devstral Small 1.1
MistralAI
SWE-bench Verified53.60
LiveCodeBench—
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Free commercial
10
Qwen3-Coder-Flash
阿里巴巴
SWE-bench Verified51.60
LiveCodeBench—
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Free commercial
11
Devstral Small 1.0
MistralAI
SWE-bench Verified46.80
LiveCodeBench—
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Free commercial
12
Codestral 25.01
MistralAI
SWE-bench Verified—
LiveCodeBench37.90
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
13
Codestral
MistralAI
SWE-bench Verified—
LiveCodeBench31.50
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Non-commercial
14
GPT-5.3 Codex
OpenAI
SWE-bench Verified—
LiveCodeBench—
SWE-Bench Pro - Public56.80
SWE-bench Multilingual—
Proprietary
15
Composer 2
Cursor
SWE-bench Verified—
LiveCodeBench—
SWE-Bench Pro - Public—
SWE-bench Multilingual73.70
Proprietary
Sort by:
Source: LMArena

DesignArena Code Category

Full ranking

Elo ratings from anonymous voting on visual front-end code tasks (websites, UI components, games, data viz) by Arcada Labs.

Updated 2026-05-10

#ModelElo
1
Anthropic
Claude Opus 4.7 (Thinking)Anthropic
1350
2
Anthropic
Claude Opus 4.6Anthropic
1346
3
Anthropic
Claude Opus 4.6 (Thinking)Anthropic
1344
4
Moonshot AI
Kimi K2.6Moonshot AI
1343
5
Z
GLM 5.1Zhipu AI
1341
6
Anthropic
Opus 4.7Anthropic
1338
7
Z
GLM 5 TurboZhipu AI
1336
8
Anthropic
Claude Sonnet 4.6Anthropic
1331
9
OpenAI
GPT-5.5OpenAI
1314
10
DeepSeek-AI
DeepSeek-V4-ProDeepSeek-AI
1313
Source: DesignArena
34B
65B
100B and above
Model Type:AllReasoning ModelsFoundation ModelsInstruction/Chat ModelsCoding Models
Source:AllOpen SourceClosed Source
Origin:AllChina
Model release cutoff: