加载中...
加载中...
一个涵盖 57 个主题的多项选择题基准,用于评估大规模语言模型的知识和推理能力。
Source: DataLearnerAI
Data sourced primarily from official releases (GitHub, Hugging Face, papers), then benchmark leaderboards, then third-party evaluators. Learn about our data methodology
| 排名 | 模型 | |||
|---|---|---|---|---|
| 1 | DeepSeek-V3.1default | 93.4 | 2025-08-20 | 6710 |
| 2 | OpenAI o4 - minidefault | 93 | 2025-04-16 | 未知 |
| 3 | OpenAI o1default | 91.8 | 2024-12-05 | 未知 |
| 4 | DeepSeek-V3.1default | 91.8 | 2025-08-20 | 6710 |
| 5 | DeepSeek-R1default | 90.8 | 2025-01-20 | 6710 |
| 6 | GPT-4.1default | 90.2 | 2025-04-14 | 未知 |
| 7 | GPT OSS 120Bdefault | 90 | 2025-08-06 | 117 |
| 8 | Hunyuan-TurboSdefault | 89.5 | 2025-03-10 | 未知 |
| 9 | Kimi K2default | 89.5 | 2025-07-11 | 10000 |
| 10 | Pangu Pro MoEdefault | 89.3 | 2025-06-30 | 719 |
| 11 | GPT-4odefault | 88.7 | 2024-05-13 | 未知 |
| 12 | Llama3.1-405B Instructdefault | 88.6 | 2024-07-23 | 4050 |
| 13 | DeepSeek-V3default | 88.5 | 2024-12-26 | 6810 |
| 14 | Claude 3.5 Sonnetdefault | 88.3 | 2024-06-21 | 未知 |
| 15 | Claude 3.5 Sonnet Newdefault | 88.3 | 2024-10-22 | 未知 |
| 16 | Hunyuan-A13B-Instructdefault | 88.17 | 2025-06-27 | 800 |
| 17 | Qwen2.5-Maxdefault | 87.9 | 2025-01-28 | 未知 |
| 18 | Grok 2default | 87.5 | 2024-08-13 | 2690 |
| 19 | GPT-4.1 minidefault | 87.5 | 2025-04-14 | 未知 |
| 20 | Kimi k1.5 (Short-CoT)default | 87.4 | 2025-01-22 | 未知 |
| 21 | Gemini 1.5 Prodefault | 87.1 | 2024-02-15 | 未知 |
| 22 | OpenAI o3-mini (high)default | 86.9 | 2025-01-31 | 未知 |
| 23 | Claude3-Opusdefault | 86.8 | 2024-03-04 | 未知 |
| 24 | Gemini 2.0 Pro Experimentaldefault | 86.5 | 2025-02-05 | 未知 |
| 25 | DeepSeek-V3-0324default | 86.5 | 2025-03-24 | 6710 |
| 26 | ERNIE-4.5-300B-A47Bdefault | 86.5 | 2025-06-30 | 3000 |
| 27 | Qwen2.5-72Bdefault | 86.1 | 2024-09-18 | 727 |
| 28 | Llama3.1-70B-Instructdefault | 86 | 2024-07-23 | 700 |
| 29 | Llama3.3-70B-Instructdefault | 86 | 2024-12-06 | 700 |
| 30 | Amazon Nova Prodefault | 85.9 | 2024-12-03 | 未知 |