加载中...
加载中...
对比大模型在 MMLU Pro、HLE、SWE-Bench 等评测上的表现,选择评测查看排名。
各个评测基准的详细介绍可见:LLM 评测基准列表与介绍
评测切换
在这里切换评测,图表和表格会同步更新
还有更多评测基准
进入评测基准列表,按类别/语言快速筛选
| 0.00 |
| 87.00 |
| 3 | Claude Opus 4.1 | 88.00 | 81.00 | 79.40 | 0.00 | 0.00 | 65.00 |
| 4 | Hunyuan-T1 | 87.20 | 69.30 | 0.00 | 96.20 | 78.20 | 64.90 |
| 5 | Grok 4 | 87.00 | 87.00 | 58.60 | 0.00 | 0.00 | 82.00 |
| 6 | Gemini 2.5-Pro | 86.00 | 86.40 | 67.20 | 98.80 | 92.00 | 77.10 |
| 7 | Qwen3-Max-Thinking | 85.70 | 87.40 | 75.30 | 0.00 | 0.00 | 85.90 |
| 8 | OpenAI o3 | 85.60 | 83.30 | 69.10 | 98.10 | 91.60 | 75.80 |
| 9 | Claude Opus 4 | 85.00 | 79.60 | 72.50 | 98.20 | 76.00 | 56.60 |
| 10 | DeepSeek-R1-0528 | 85.00 | 81.00 | 57.60 | 98.00 | 91.40 | 73.30 |
| 11 | DeepSeek V3.2-Exp | 85.00 | 79.90 | 67.80 | 0.00 | 0.00 | 74.10 |
| 12 | Grok 4.1 Fast | 85.00 | 85.00 | 0.00 | 0.00 | 0.00 | 82.00 |
| 13 | GLM-4.5 | 84.60 | 79.10 | 64.20 | 98.20 | 91.00 | 72.90 |
| 14 | Kimi K2 Thinking | 84.60 | 84.50 | 71.30 | 0.00 | 0.00 | 83.10 |
| 15 | Qwen3-235B-A22B-Thinking-2507 | 84.40 | 81.10 | 0.00 | 0.00 | 0.00 | 74.10 |
| 16 | Qwen3-235B-A22B-Thinking | 84.40 | 81.10 | 0.00 | 0.00 | 0.00 | 74.10 |
| 17 | DeepSeek-R1 | 84.00 | 71.50 | 49.20 | 97.30 | 79.80 | 65.90 |
| 18 | Claude Sonnet 4 | 84.00 | 83.80 | 80.20 | 0.00 | 43.40 | 66.00 |
| 19 | GLM-4.5-Air | 81.40 | 75.00 | 57.60 | 98.10 | 89.40 | 70.70 |
| 20 | MiniMax-M1-80k | 81.10 | 70.00 | 56.00 | 96.80 | 86.00 | 65.00 |
| 21 | OpenAI o4 - mini | 80.60 | 81.40 | 68.10 | 0.00 | 98.70 | 0.00 |
| 22 | MiniMax-M1-40k | 80.60 | 69.20 | 55.60 | 96.00 | 83.30 | 62.30 |
| 23 | OpenAI o1-mini | 80.30 | 60.00 | 0.00 | 90.00 | 63.60 | 52.00 |
| 24 | Hunyuan-TurboS | 79.00 | 57.50 | 0.00 | 0.00 | 0.00 | 32.00 |
| 25 | GPT OSS 120B | 79.00 | 80.10 | 60.10 | 0.00 | 96.60 | 0.00 |
| 26 | QwQ-32B | 76.00 | 58.00 | 0.00 | 91.00 | 79.50 | 0.00 |
| 27 | GPT OSS 20B | 74.00 | 71.50 | 34.00 | 0.00 | 96.00 | 0.00 |
| 28 | Qwen3-235B-A22B | 72.90 | 71.10 | 34.40 | 98.00 | 85.70 | 70.70 |
| 29 | Qwen3-8B | 72.50 | 62.00 | 0.00 | 97.40 | 79.40 | 61.80 |
| 30 | QwQ-32B-Preview | 70.97 | 0.00 | 0.00 | 90.60 | 50.00 | 0.00 |
| 31 | Qwen3-30B-A3B | 69.10 | 54.80 | 0.00 | 0.00 | 0.00 | 29.00 |
| 32 | OpenAI o3-mini | 0.00 | 70.60 | 40.80 | 95.80 | 60.00 | 0.00 |
| 33 | QwQ-Max-Preview | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 65.60 |
| 34 | Kimi-k1.6-IOI | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 65.90 |
| 35 | OpenAI o3-mini (medium) | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 67.40 |
| 36 | Kimi-k1.6-IOI-high | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 73.80 |
| 37 | Gemini 2.5 Pro Deep Think | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 80.40 |
| 38 | Kimi k1.5 (Short-CoT) | 0.00 | 0.00 | 0.00 | 94.60 | 0.00 | 0.00 |
| 39 | Kimi k1.5 (Long-CoT) | 0.00 | 0.00 | 0.00 | 96.20 | 0.00 | 0.00 |
| 40 | Grok 3.5 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 41 | Phi-4-instruct (reasoning-trained) | 0.00 | 49.00 | 0.00 | 90.40 | 50.00 | 0.00 |
| 42 | DeepSeek-R1-Distill-Qwen-7B | 0.00 | 49.50 | 0.00 | 91.40 | 53.30 | 0.00 |
| 43 | DeepSeek-R1-Distill-Llama-70B | 0.00 | 65.20 | 0.00 | 94.50 | 0.00 | 0.00 |
| 44 | Gemini 2.5 Flash-Lite | 0.00 | 66.70 | 27.60 | 0.00 | 0.00 | 34.30 |
| 45 | Magistral-Small-2506 | 0.00 | 68.18 | 0.00 | 0.00 | 70.68 | 55.84 |
| 46 | Qwen3-32B | 0.00 | 68.40 | 0.00 | 97.20 | 81.40 | 65.70 |
| 47 | GPT-5.2 Pro | 0.00 | 93.20 | 0.00 | 0.00 | 0.00 | 0.00 |
| 48 | Magistral-Medium-2506 | 0.00 | 70.83 | 0.00 | 0.00 | 73.59 | 59.36 |
| 49 | GLM-4.7-Flash | 0.00 | 75.20 | 59.20 | 0.00 | 0.00 | 0.00 |
| 50 | OpenAI o3-mini (high) | 0.00 | 79.70 | 49.30 | 97.90 | 87.00 | 69.50 |
| 51 | DeepSeek V3.2 | 0.00 | 82.40 | 73.10 | 0.00 | 0.00 | 83.30 |
| 52 | Gemini 2.5 Flash | 0.00 | 82.80 | 50.00 | 0.00 | 88.00 | 55.40 |
| 53 | Gemini-2.5-Pro-Preview-05-06 | 0.00 | 83.00 | 63.20 | 98.80 | 92.00 | 77.10 |
| 54 | o3-pro | 0.00 | 84.00 | 75.00 | 0.00 | 93.00 | 0.00 |
| 55 | Gemini 2.5 Pro Experimental 03-25 | 0.00 | 84.00 | 63.80 | 0.00 | 92.00 | 70.40 |
| 56 | Grok-3 mini - Reasoning | 0.00 | 84.00 | 0.00 | 0.00 | 96.00 | 0.00 |
| 57 | Grok-3 - Reasoning Beta | 0.00 | 84.60 | 0.00 | 0.00 | 93.30 | 79.40 |
| 58 | Claude Sonnet 3.7-64K Extended Thinking | 0.00 | 84.80 | 0.00 | 96.20 | 80.00 | 0.00 |
| 59 | GPT-5.1 | 0.00 | 88.10 | 76.30 | 0.00 | 0.00 | 0.00 |
| 60 | GPT-5-Pro | 0.00 | 89.40 | 0.00 | 0.00 | 0.00 | 0.00 |