Text Generation Arena Leaderboard
The latest AI text generation leaderboard based on LMArena anonymous user voting. Covers Elo scores, confidence intervals, and vote counts for leading language models.
Top Model
ernie-5.1
Top Score
1,474
Model Count
357
Data version
2026年05月07日
Data source: LM Arena
About This Leaderboard
This leaderboard ranks the strongest AI models for text generation. Data comes from LMArena (formerly LMSYS Chatbot Arena), the world's largest crowdsourced AI evaluation platform. Users chat with two anonymous models side-by-side and vote for the better response — rankings are determined entirely by real user preferences, not lab benchmarks.
Methodology Overview
Blind testing: Users chat with two anonymous models and vote based on response quality, eliminating brand bias.
Elo scoring: Using the Bradley-Terry model (adapted from chess Elo ratings) to calculate each model's strength score from battle outcomes. Higher scores mean users more frequently prefer that model.
Broad scenario coverage: Testing spans coding, creative writing, math reasoning, Q&A, role-playing, and more.
DataLearner provides in-depth analysis on top of the raw data, linking leaderboard models to the DataLearner model database so you can quickly access model details, API pricing, benchmark scores, and more.
Ranking Table
| Rank | Model | Score | 95% CI | Votes | Organization | License |
|---|---|---|---|---|---|---|
| 14 | ernie-5.1Baidu | 1,474 | +/-8 | 5,733 | Baidu | Proprietary |
| 25 | qwen3.5-max-previewAlibaba | 1,464 | +/-5 | 14,558 | Alibaba | Proprietary |
| 27 | DeepSeek-V4-ProDeepSeek-AI | 1,463 | +/-9 | 4,160 | DeepSeek-AI | MIT |
| 28 | Kimi K2.6Moonshot AI | 1,462 | +/-7 | 7,108 | Moonshot AI | Modified MIT |
| 29 | deepseek-v4-pro-thinkingDeepSeek | 1,462 | +/-9 | 3,808 | DeepSeek | MIT |
| 31 | dola-seed-2.0-proBytedance | 1,459 | +/-5 | 26,587 | Bytedance | Proprietary |
| 41 | Kimi K2 ThinkingMoonshot AI | 1,449 | +/-4 | 27,282 | Moonshot AI | Modified MIT |
| 53 | deepseek-v4-flash-thinkingDeepSeek | 1,440 | +/-9 | 3,600 | DeepSeek | MIT |
| 62 | DeepSeek-V4-FlashDeepSeek-AI | 1,433 | +/-9 | 3,506 | DeepSeek-AI | MIT |
| 63 | kimi-k2.5-instantMoonshot | 1,432 | +/-7 | 8,207 | Moonshot | Modified MIT |
| 66 | Kimi K2 Thinking (thinking-turbo)Moonshot AI | 1,430 | +/-3 | 52,935 | Moonshot AI | Modified MIT |
| 70 | DeepSeek V3.2-Exp (thinking)DeepSeek-AI | 1,425 | +/-7 | 9,076 | DeepSeek-AI | MIT |
| 71 | DeepSeek V3.2DeepSeek-AI | 1,424 | +/-4 | 44,820 | DeepSeek-AI | MIT |
| 72 | qwen3-max-2025-09-23Alibaba | 1,424 | +/-6 | 9,179 | Alibaba | Proprietary |
| 74 | DeepSeek V3.2-ExpDeepSeek-AI | 1,423 | +/-6 | 11,943 | DeepSeek-AI | MIT |
| 77 | DeepSeek V3.2 (thinking)DeepSeek-AI | 1,422 | +/-4 | 39,071 | DeepSeek-AI | MIT |
| 78 | DeepSeek-R1-0528DeepSeek-AI | 1,422 | +/-6 | 18,469 | DeepSeek-AI | MIT |
| 82 | hunyuan-hy3-previewTencent | 1,418 | +/-8 | 4,582 | Tencent | tencent-hunyuan-community |
| 83 | Kimi K2 0905Moonshot AI | 1,418 | +/-6 | 11,798 | Moonshot AI | Modified MIT |
| 84 | DeepSeek-V3.1DeepSeek-AI | 1,418 | +/-6 | 14,985 | DeepSeek-AI | MIT |
| 85 | Kimi K2Moonshot AI | 1,417 | +/-5 | 27,644 | Moonshot AI | Modified MIT |
| 86 | deepseek-v3.1-terminus-thinkingDeepSeek | 1,417 | +/-10 | 3,474 | DeepSeek | MIT |
| 87 | DeepSeek-V3.1 (thinking)DeepSeek-AI | 1,417 | +/-7 | 11,754 | DeepSeek-AI | MIT |
| 88 | DeepSeek-V3.1 TerminusDeepSeek-AI | 1,416 | +/-10 | 3,713 | DeepSeek-AI | MIT |
| 100 | 1,407 | +/-6 | 13,525 | MiniMaxAI | Modified MIT | |
| 105 | qwen3-235b-a22b-no-thinkingAlibaba | 1,403 | +/-5 | 38,241 | Alibaba | Apache 2.0 |
| 109 | qwen3-235b-a22b-thinking-2507Alibaba | 1,399 | +/-7 | 9,004 | Alibaba | Apache 2.0 |
| 111 | Step 3.5 FlashStepFunAI | 1,398 | +/-5 | 19,649 | StepFunAI | Proprietary |
| 112 | DeepSeek-R1DeepSeek-AI | 1,398 | +/-5 | 18,524 | DeepSeek-AI | MIT |
| 114 | hunyuan-vision-1.5-thinkingTencent | 1,396 | +/-12 | 2,221 | Tencent | Proprietary |
| 117 | DeepSeek-V3-0324DeepSeek-AI | 1,395 | +/-4 | 45,533 | DeepSeek-AI | MIT |
| 118 | 1,395 | +/-4 | 24,885 | MiniMaxAI | Modified MIT | |
| 119 | Step 3.5 FlashStepFunAI | 1,393 | +/-4 | 25,112 | StepFunAI | Apache 2.0 |
| 131 | 1,385 | +/-5 | 17,165 | MiniMaxAI | MIT | |
| 134 | hunyuan-turbos-20250416Tencent | 1,382 | +/-6 | 10,723 | Tencent | Proprietary |
| 149 | minimax-m1MiniMax | 1,363 | +/-4 | 35,233 | MiniMax | Apache 2.0 |
| 154 | DeepSeek-V3DeepSeek-AI | 1,358 | +/-5 | 21,770 | DeepSeek-AI | DeepSeek |
| 164 | hunyuan-turbos-20250226Tencent | 1,348 | +/-12 | 2,220 | Tencent | Proprietary |
| 165 | Step3StepFunAI | 1,348 | +/-7 | 6,551 | StepFunAI | Apache 2.0 |
| 172 | 1,346 | +/-8 | 6,871 | MiniMaxAI | Apache 2.0 | |
| 173 | qwen-plus-0125Alibaba | 1,346 | +/-8 | 5,819 | Alibaba | Proprietary |
| 176 | glm-4-plus-0111Zhipu | 1,343 | +/-8 | 5,760 | Zhipu | Proprietary |
| 179 | hunyuan-turbo-0110Tencent | 1,340 | +/-12 | 2,290 | Tencent | Proprietary |
| 188 | step-2-16k-exp-202412StepFun | 1,334 | +/-9 | 4,833 | StepFun | Proprietary |
| 196 | hunyuan-large-2025-02-10Tencent | 1,326 | +/-10 | 3,738 | Tencent | Proprietary |
| 198 | deepseek-v2.5-1210DeepSeek | 1,323 | +/-8 | 6,795 | DeepSeek | DeepSeek |
| 205 | step-1o-turbo-202506StepFun | 1,320 | +/-7 | 9,039 | StepFun | Proprietary |
| 206 | glm-4-plusZhipu AI | 1,319 | +/-5 | 26,126 | Zhipu AI | Proprietary |
| 209 | qwen-max-0919Alibaba | 1,318 | +/-6 | 16,478 | Alibaba | Qwen |
| 213 | qwen2.5-plus-1127Alibaba | 1,315 | +/-6 | 10,187 | Alibaba | Proprietary |
| 218 | hunyuan-standard-2025-02-10Tencent | 1,311 | +/-10 | 3,904 | Tencent | Proprietary |
| 221 | deepseek-v2.5DeepSeek | 1,307 | +/-5 | 24,572 | DeepSeek | DeepSeek |
| 229 | qwen2.5-72b-instructAlibaba | 1,302 | +/-4 | 39,406 | Alibaba | Qwen |
| 231 | hunyuan-large-visionTencent | 1,294 | +/-9 | 5,370 | Tencent | Proprietary |
| 250 | glm-4-0520Zhipu AI | 1,273 | +/-7 | 9,788 | Zhipu AI | Proprietary |
| 252 | qwen2.5-coder-32b-instructAlibaba | 1,270 | +/-8 | 5,432 | Alibaba | Apache 2.0 |
| 255 | deepseek-coder-v2DeepSeek | 1,264 | +/-6 | 15,147 | DeepSeek | DeepSeek License |
| 257 | qwen2-72b-instructAlibaba | 1,261 | +/-5 | 37,325 | Alibaba | Qianwen LICENSE |
| 269 | qwen1.5-110b-chatAlibaba | 1,233 | +/-6 | 26,195 | Alibaba | Qianwen LICENSE |
| 270 | hunyuan-standard-256kTencent | 1,233 | +/-12 | 2,728 | Tencent | Proprietary |
| 272 | qwen1.5-72b-chatAlibaba | 1,232 | +/-5 | 39,302 | Alibaba | Qianwen LICENSE |
| 286 | qwen1.5-32b-chatAlibaba | 1,203 | +/-6 | 21,741 | Alibaba | Qianwen LICENSE |
| 292 | internlm2_5-20b-chatInternLM | 1,191 | +/-7 | 9,901 | InternLM | Other |
| 293 | qwen1.5-14b-chatAlibaba | 1,190 | +/-7 | 17,839 | Alibaba | Qianwen LICENSE |
| 295 | deepseek-llm-67b-chatDeepSeek | 1,183 | +/-12 | 4,932 | DeepSeek | DeepSeek License |
| 312 | qwq-32b-previewAlibaba | 1,156 | +/-12 | 3,231 | Alibaba | Apache 2.0 |
| 321 | qwen1.5-7b-chatAlibaba | 1,143 | +/-10 | 4,737 | Alibaba | Qianwen LICENSE |
| 325 | qwen-14b-chatAlibaba | 1,137 | +/-11 | 4,964 | Alibaba | Qianwen LICENSE |
| 343 | qwen1.5-4b-chatAlibaba | 1,089 | +/-9 | 7,597 | Alibaba | Qianwen LICENSE |
Data is for reference only. Official sources are authoritative. Click model names to view DataLearner model profiles.
FAQ
What is Text Generation Arena (LMArena)?
Text Generation Arena, formerly LMSYS Chatbot Arena, is one of the most widely followed anonymous LLM evaluation platforms. Users compare answers from two hidden models and vote for the better response; Elo-style scoring aggregates those votes into a dynamic leaderboard.
How is the Arena Elo score calculated?
Arena Elo is adapted from chess rating systems. After each head-to-head comparison, the preferred model gains rating points and the other model loses points, with the size of the change depending on the rating gap. The 95% confidence interval reflects how much comparison data supports the estimate.
Why do some models have both Thinking and regular versions?
Some models offer an extended-thinking mode that spends more inference time reasoning before producing the final answer. This can improve scores on reasoning, math, and coding tasks, but usually increases latency and cost, so Arena tracks these variants separately.
How should I choose an LLM from this leaderboard?
Consider overall Elo, cost, language coverage, open-source availability, and latency. The top-ranked model is not always the best fit for every workflow.







