DataLearner logoDataLearnerAI
Latest AI Insights
Model Leaderboards
Benchmarks
Model Directory
Model Comparison
Resource Center
Tools
LanguageEnglish
DataLearner logoDataLearner AI

A knowledge platform focused on LLM benchmarking, datasets, and practical instruction with continuously updated capability maps.

Products

  • Leaderboards
  • Model comparison
  • Datasets

Resources

  • Tutorials
  • Editorial
  • Tool directory

Company

  • About
  • Privacy policy
  • Data methodology
  • Contact

© 2026 DataLearner AI. DataLearner curates industry data and case studies so researchers, enterprises, and developers can rely on trustworthy intelligence.

Privacy policyTerms of service
Back to Main Leaderboard

LLM Coding Benchmark Leaderboard

This page provides the LLM coding benchmark leaderboard, covering SWE-Bench Verified, SWE-Bench Pro, LiveCodeBench, and SWE-bench Multilingual datasets, comparing GPT, Claude, Qwen, and DeepSeek models.

Updated on 2026-05-21 22:14:17

As of 2026-05, this page covers SWE-bench Verified, LiveCodeBench, SWE-Bench Pro - Public, SWE-bench Multilingual and related benchmarks for LLM Coding Benchmark Leaderboard, making it straightforward to compare within the same task family.

Click any model name to check context length, licensing, and pricing on its detail page. See Data Methodology for scoring details.

Reference: Composite Coding Rankings

There is no single, universally accepted coding leaderboard. Static benchmarks like SWE-bench and HumanEval measure specific skills but can be gamed through targeted fine-tuning. We selected two complementary human-preference leaderboards: LMArena Coding Arena ranks models on general programming tasks (debugging, algorithms, code generation) via anonymous crowd-sourced voting; DesignArena Code Category focuses specifically on visual, front-end code generation (websites, UI components, games) using the same blind-voting methodology. Reading both together gives a fuller picture of coding capability.

LMArena Coding Arena

Full ranking

Elo ratings from anonymous A/B voting on real general coding tasks (debugging, algorithms, code generation) submitted by developers.

Updated 2026-05-14

#ModelElo
1
Anthropic
Opus 4.7 (thinking)Anthropic
1563
2
Anthropic
Opus 4.7Anthropic
1551
3
Anthropic
Claude Opus 4.6 (thinking)Anthropic
1550
4
Anthropic
Claude Opus 4.6Anthropic
1549
5
Anthropic
Claude Opus 4 (thinking-32k)Anthropic
1531
6
F
Muse SparkFacebook AI研究实验室
1530
7
OpenAI
GPT-5.4 (high)OpenAI
1527
8
智
GLM 5.1智谱AI
1527
9
Google Deep Mind
Gemini 3.1 Pro PreviewGoogle Deep Mind
1526
10
Anthropic
Claude Sonnet 4.6Anthropic
1522
Benchmark
SWE-bench VerifiedLiveCodeBenchSWE-Bench Pro - PublicSWE-bench Multilingual
More Benchmarks
Model Size:All3B and below7B

Top picks

Ranked by SWE-bench Multilingual
Current SOTA
阿里巴巴

Qwen3.7-Max-Preview

阿里巴巴

78.30SWE-bench Multilingual
View model
Best Open-Source

No qualifying model on this benchmark.

Best China-Made
阿里巴巴

Qwen3-Max-Thinking

阿里巴巴

—SWE-bench Multilingual
View model

LLM Performance Results

Data source: DataLearnerAI

Click any row to open the model page. Tick the checkboxes to compare up to 4 models side by side.

RankModel
License
阿里巴巴
Qwen3.7-Max-Preview
阿里巴巴
80.4091.6060.6078.30ProprietaryDetailsDetails
Anthropic
Claude Opus 4.6
Anthropic
80.8476.00—72.00ProprietaryDetailsDetails
Cursor
Composer 1.5
Cursor
———65.90ProprietaryDetailsDetails
4
Anthropic
Opus 4.7
Anthropic
87.60—64.30—ProprietaryDetailsDetails
5
Anthropic
Opus 4.5
Anthropic
80.9087.00——ProprietaryDetailsDetails
6
Anthropic
Claude Sonnet 4
Anthropic
80.2066.0042.70—ProprietaryDetailsDetails
7
Facebook AI研究实验室
Muse Spark
Facebook AI研究实验室
77.40———ProprietaryDetailsDetails
8
OpenAI
GPT-5.1
OpenAI
76.30—50.80—ProprietaryDetailsDetails
9
阿里巴巴
Qwen3-Max-Thinking
阿里巴巴
75.3085.90——ProprietaryDetailsDetails
10
OpenAI
o3-pro
OpenAI
75.00———ProprietaryDetailsDetails
11
Anthropic
Opus 4.1
Anthropic
74.50———ProprietaryDetailsDetails
12
Anthropic
Claude Opus 4
Anthropic
72.5056.60——ProprietaryDetailsDetails
13
OpenAI
OpenAI o3
OpenAI
69.1075.80——ProprietaryDetailsDetails
14
OpenAI
OpenAI o4 - mini
OpenAI
68.10———ProprietaryDetailsDetails
15
Google Deep Mind
Gemini 2.5-Pro
Google Deep Mind
67.2077.10——ProprietaryDetailsDetails
16
Google Deep Mind
Gemini 2.5 Pro Experimental 03-25
Google Deep Mind
63.8070.40——ProprietaryDetailsDetails
17
Google Deep Mind
Gemini-2.5-Pro-Preview-05-06
Google Deep Mind
63.2077.10——ProprietaryDetailsDetails
18
xAI
Grok 4
xAI
58.6082.00——ProprietaryDetailsDetails
19
xAI
Grok 4.1
xAI
54.60———ProprietaryDetailsDetails
20
Google Deep Mind
Gemini 2.5 Flash
Google Deep Mind
50.0055.40——ProprietaryDetailsDetails
21
OpenAI
OpenAI o3-mini (high)
OpenAI
49.3069.50——ProprietaryDetailsDetails
22
OpenAI
OpenAI o1
OpenAI
48.9071.00——ProprietaryDetailsDetails
23
OpenAI
OpenAI o3-mini
OpenAI
40.80———ProprietaryDetailsDetails
24
Google Deep Mind
Gemini 2.5 Flash-Lite
Google Deep Mind
27.6034.30——ProprietaryDetailsDetails
25
MistralAI
Magistral-Medium-2506
MistralAI
—59.36——ProprietaryDetailsDetails
26
OpenAI
OpenAI o1-mini
OpenAI
—52.00——ProprietaryDetailsDetails
27
腾讯AI实验室
Hunyuan-TurboS
腾讯AI实验室
—32.00——ProprietaryDetailsDetails
28
OpenAI
GPT-5.5
OpenAI
——58.60—ProprietaryDetailsDetails
29
OpenAI
GPT-5.4 mini
OpenAI
——54.40—ProprietaryDetailsDetails
30
腾讯AI实验室
Hunyuan-T1
腾讯AI实验室
—64.90——ProprietaryDetailsDetails
31
Moonshot AI
Kimi-k1.6-IOI
Moonshot AI
—65.90——ProprietaryDetailsDetails
32
OpenAI
OpenAI o3-mini (medium)
OpenAI
—67.40——ProprietaryDetailsDetails
33
Moonshot AI
Kimi-k1.6-IOI-high
Moonshot AI
—73.80——ProprietaryDetailsDetails
34
xAI
Grok-3 - Reasoning Beta
xAI
—79.40——ProprietaryDetailsDetails
35
Google Deep Mind
Gemini 2.5 Pro Deep Think
Google Deep Mind
—80.40——ProprietaryDetailsDetails
36
xAI
Grok 4.1 Fast
xAI
—82.00——ProprietaryDetailsDetails
Qwen3.7-Max-Preview
阿里巴巴
SWE-bench Verified80.40
LiveCodeBench91.60
SWE-Bench Pro - Public60.60
SWE-bench Multilingual78.30
Proprietary
Claude Opus 4.6
Anthropic
SWE-bench Verified80.84
LiveCodeBench76.00
SWE-Bench Pro - Public—
SWE-bench Multilingual72.00
Proprietary
Composer 1.5
Cursor
SWE-bench Verified—
LiveCodeBench—
SWE-Bench Pro - Public—
SWE-bench Multilingual65.90
Proprietary
4
Opus 4.7
Anthropic
SWE-bench Verified87.60
LiveCodeBench—
SWE-Bench Pro - Public64.30
SWE-bench Multilingual—
Proprietary
5
Opus 4.5
Anthropic
SWE-bench Verified80.90
LiveCodeBench87.00
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
6
Claude Sonnet 4
Anthropic
SWE-bench Verified80.20
LiveCodeBench66.00
SWE-Bench Pro - Public42.70
SWE-bench Multilingual—
Proprietary
7
Muse Spark
Facebook AI研究实验室
SWE-bench Verified77.40
LiveCodeBench—
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
8
GPT-5.1
OpenAI
SWE-bench Verified76.30
LiveCodeBench—
SWE-Bench Pro - Public50.80
SWE-bench Multilingual—
Proprietary
9
Qwen3-Max-Thinking
阿里巴巴
SWE-bench Verified75.30
LiveCodeBench85.90
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
10
o3-pro
OpenAI
SWE-bench Verified75.00
LiveCodeBench—
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
11
Opus 4.1
Anthropic
SWE-bench Verified74.50
LiveCodeBench—
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
12
Claude Opus 4
Anthropic
SWE-bench Verified72.50
LiveCodeBench56.60
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
13
OpenAI o3
OpenAI
SWE-bench Verified69.10
LiveCodeBench75.80
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
14
OpenAI o4 - mini
OpenAI
SWE-bench Verified68.10
LiveCodeBench—
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
15
Gemini 2.5-Pro
Google Deep Mind
SWE-bench Verified67.20
LiveCodeBench77.10
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
16
Gemini 2.5 Pro Experimental 03-25
Google Deep Mind
SWE-bench Verified63.80
LiveCodeBench70.40
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
17
Gemini-2.5-Pro-Preview-05-06
Google Deep Mind
SWE-bench Verified63.20
LiveCodeBench77.10
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
18
Grok 4
xAI
SWE-bench Verified58.60
LiveCodeBench82.00
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
19
Grok 4.1
xAI
SWE-bench Verified54.60
LiveCodeBench—
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
20
Gemini 2.5 Flash
Google Deep Mind
SWE-bench Verified50.00
LiveCodeBench55.40
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
21
OpenAI o3-mini (high)
OpenAI
SWE-bench Verified49.30
LiveCodeBench69.50
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
22
OpenAI o1
OpenAI
SWE-bench Verified48.90
LiveCodeBench71.00
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
23
OpenAI o3-mini
OpenAI
SWE-bench Verified40.80
LiveCodeBench—
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
24
Gemini 2.5 Flash-Lite
Google Deep Mind
SWE-bench Verified27.60
LiveCodeBench34.30
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
25
Magistral-Medium-2506
MistralAI
SWE-bench Verified—
LiveCodeBench59.36
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
26
OpenAI o1-mini
OpenAI
SWE-bench Verified—
LiveCodeBench52.00
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
27
Hunyuan-TurboS
腾讯AI实验室
SWE-bench Verified—
LiveCodeBench32.00
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
28
GPT-5.5
OpenAI
SWE-bench Verified—
LiveCodeBench—
SWE-Bench Pro - Public58.60
SWE-bench Multilingual—
Proprietary
29
GPT-5.4 mini
OpenAI
SWE-bench Verified—
LiveCodeBench—
SWE-Bench Pro - Public54.40
SWE-bench Multilingual—
Proprietary
30
Hunyuan-T1
腾讯AI实验室
SWE-bench Verified—
LiveCodeBench64.90
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
31
Kimi-k1.6-IOI
Moonshot AI
SWE-bench Verified—
LiveCodeBench65.90
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
32
OpenAI o3-mini (medium)
OpenAI
SWE-bench Verified—
LiveCodeBench67.40
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
33
Kimi-k1.6-IOI-high
Moonshot AI
SWE-bench Verified—
LiveCodeBench73.80
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
34
Grok-3 - Reasoning Beta
xAI
SWE-bench Verified—
LiveCodeBench79.40
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
35
Gemini 2.5 Pro Deep Think
Google Deep Mind
SWE-bench Verified—
LiveCodeBench80.40
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
36
Grok 4.1 Fast
xAI
SWE-bench Verified—
LiveCodeBench82.00
SWE-Bench Pro - Public—
SWE-bench Multilingual—
Proprietary
Sort by:
Source: LMArena

DesignArena Code Category

Full ranking

Elo ratings from anonymous voting on visual front-end code tasks (websites, UI components, games, data viz) by Arcada Labs.

Updated 2026-05-17

#ModelElo
1
Anthropic
Claude Opus 4.6Anthropic
1348
2
Anthropic
Opus 4.7 (thinking)Anthropic
1345
3
Anthropic
Claude Opus 4.6 (thinking)Anthropic
1344
4
Moonshot AI
Kimi K2.6Moonshot AI
1343
5
智
GLM 5.1智谱AI
1338
6
Anthropic
Opus 4.7Anthropic
1335
7
智
GLM-5-Turbo智谱AI
1334
8
Anthropic
Claude Sonnet 4.6Anthropic
1331
9
X
MiMo-V2.5-ProXiaomi
1329
10
OpenAI
GPT-5.5OpenAI
1320
Source: DesignArena
13B
34B
65B
100B and above
Model Type:AllReasoning ModelsFoundation ModelsInstruction/Chat ModelsCoding Models
Source:AllOpen SourceClosed Source
Origin:AllChina
Model release cutoff: