加载中...
加载中...
基于 Text Generation Arena 用户匿名投票的最新AI文本生成模型排行榜,涵盖各模型的 Elo 得分、95% 置信区间、投票量、机构与许可证。
数据版本: 2026年01月24日
数据来源: LM Arena
| 排名 | 模型名称 | 得分 | 95% CI | 投票数 | 机构 | 许可证 |
|---|---|---|---|---|---|---|
| 1 | gemini-3-pro | 1,490 | ±5 | 27,827 | Proprietary | |
| 2 | grok-4.1-thinking | 1,477 | ±5 | 27,985 | xAI | Proprietary |
| 3 | gemini-3-flash | 1,472 |
数据仅供参考,以官方来源为准。模型名称旁的链接可跳转到 DataLearner 模型详情页。
本排行榜数据来源于 LMSYS Chatbot Arena。其采用业界公认的"众包对战"模式,通过海量真实用户的盲测来评估模型能力。
用户在不知道模型名称的情况下,对两个模型针对同一 Prompt 生成的回复进行 Side-by-Side (SBS) 对比投票,完全并排除品牌偏见。
基于 Bradley-Terry 模型计算的 Elo Rating,能最科学地反映模型在动态对战中的相对实力,是目前 LLM 领域最通用的评价标准。
| ±6 |
| 13,245 |
| Proprietary |
| 4 | claude-opus-4.5-20251101-thinking-32k | 1,470 | ±5 | 19,898 | Anthropic | Proprietary |
| 5 | claude-opus-4.5-20251101 | 1,467 | ±5 | 21,241 | Anthropic | Proprietary |
| 6 | grok-4.1 | 1,465 | ±5 | 32,015 | xAI | Proprietary |
| 7 | gemini-3-flash (thinking-minimal) | 1,462 | ±7 | 9,644 | Proprietary |
| 8 | ernie-5.0-0110 | 1,459 | ±9 | 4,829 | Baidu | Proprietary |
| 9 | gpt-5.1-high | 1,458 | ±5 | 24,439 | OpenAI | Proprietary |
| 10 | gemini-2.5-pro | 1,451 | ±3 | 87,641 | Proprietary |
| 11 | claude-sonnet-4.5-20250929-thinking-32k | 1,451 | ±4 | 38,441 | Anthropic | Proprietary |
| 12 | ernie-5.0-preview-1203 | 1,450 | ±7 | 9,709 | Baidu | Proprietary |
| 13 | claude-sonnet-4.5-20250929 | 1,450 | ±4 | 35,025 | Anthropic | Proprietary |
| 14 | claude-opus-4.1-20250805-thinking-16k | 1,449 | ±4 | 50,061 | Anthropic | Proprietary |
| 15 | claude-opus-4.1-20250805 | 1,445 | ±3 | 67,599 | Anthropic | Proprietary |
| 16 | gpt-5.2 | 1,445 | ±9 | 5,187 | OpenAI | Proprietary |
| 17 | gpt-4.5-preview-2025-02-27 | 1,444 | ±6 | 14,549 | OpenAI | Proprietary |
| 18 | chatgpt-4o-latest-20250326 | 1,442 | ±3 | 74,853 | OpenAI | Proprietary |
| 19 | glm-4.7 | 1,441 | ±7 | 9,556 | Z.ai | MIT |
| 20 | gpt-5.2-high | 1,436 | ±8 | 4,594 | OpenAI | Proprietary |
| 21 | gpt-5.1 | 1,435 | ±5 | 26,241 | OpenAI | Proprietary |
| 22 | gpt-5-high | 1,435 | ±5 | 32,008 | OpenAI | Proprietary |
| 23 | qwen2-max-preview | 1,434 | ±5 | 27,894 | Alibaba | Proprietary |
| 24 | o3-2025-04-16 | 1,433 | ±4 | 81,435 | OpenAI | Proprietary |
| 25 | grok-4.1-fast-reasoning | 1,430 | ±5 | 21,701 | xAI | Proprietary |
| 26 | kimi-k1-thinking-turbo | 1,429 | ±5 | 26,054 | Moonshot | Proprietary |
| 27 | gpt-5-chat | 1,426 | ±6 | 21,883 | OpenAI | Proprietary |
| 28 | glm-4.6 | 1,425 | ±4 | 33,537 | Z.ai | MIT |
| 29 | qwen2-max-2025-09-19 | 1,424 | ±6 | 9,225 | Alibaba | Proprietary |
| 30 | claude-opus-4-20250514-thinking-10k | 1,424 | ±4 | 38,020 | Anthropic | Proprietary |
| 31 | deepseek-v3.2-exp | 1,423 | ±7 | 11,072 | DeepSeek | MIT |
| 32 | deepseek-v3.2-exp-thinking | 1,423 | ±7 | 8,017 | DeepSeek | MIT |
| 33 | qwen2-200b-v22b-instruct-2507 | 1,422 | ±3 | 62,599 | Alibaba | Apache 2.0 |
| 34 | grok-4-fast-chat | 1,422 | ±8 | 7,601 | xAI | Proprietary |
| 35 | deepseek-v3.2-thinking | 1,420 | ±5 | 15,802 | DeepSeek | MIT |
| 36 | deepseek-v3.2 | 1,418 | ±5 | 20,503 | DeepSeek | MIT |
| 37 | deepseek-v3-0928 | 1,418 | ±6 | 16,306 | DeepSeek | MIT |
| 38 | kimi-k1-0905-preview | 1,418 | ±6 | 11,381 | Moonshot | Proprietary |
| 39 | ernie-5.0-preview-1022 | 1,417 | ±9 | 4,843 | Baidu | Proprietary |
| 40 | kimi-k1-0711-preview | 1,417 | ±5 | 28,603 | Moonshot | Proprietary |
| 41 | deepseek-v3.1-thinking | 1,417 | ±7 | 11,364 | DeepSeek | MIT |
| 42 | deepseek-v3.1 | 1,417 | ±6 | 15,294 | DeepSeek | MIT |
| 43 | deepseek-v3.1-twinInUse | 1,416 | ±10 | 3,766 | DeepSeek | MIT |
| 44 | deepseek-v3.1-twinInUse-thinking | 1,416 | ±10 | 3,352 | DeepSeek | MIT |
| 45 | qwen2-vl-236b-a23b-instruct | 1,415 | ±6 | 11,700 | Alibaba | Apache 2.0 |
| 46 | claude-opus-4-20250514 | 1,413 | ±4 | 45,596 | Anthropic | Proprietary |
| 47 | gpt-4.5-2025-04-14 | 1,413 | ±4 | 52,274 | OpenAI | Proprietary |
| 48 | mistral-medium-2506 | 1,412 | ±3 | 55,668 | Mistral | Proprietary |
| 49 | mistral-large-3 | 1,411 | ±5 | 16,762 | Mistral | Apache 2.0 |
| 50 | grok-3-preview-02-24 | 1,410 | ±6 | 13,301 | xAI | Proprietary |
测试数据涵盖了编程、创意写作、数学推理、角色扮演等真实高频场景,确保排名的普适性和参考价值。