大模型评测基准与性能对比

本页面展示了多个主流大模型在各项评测基准上的表现,包括MMLU、GSM8K、HumanEval等多个标准数据集。我们通过实时更新的评测结果,帮助开发者和研究人员了解不同大模型在各种任务下的表现。用户可以选择自定义模型与评测基准进行对比,快速获取不同模型在实际应用中的优劣势。

各个评测基准的详细介绍可见: LLM 评测基准列表与介绍

自定义评测选择

+
+
模型名称
MMLU Pro
知识问答
MMLU
知识问答
GSM8K
数学推理
MATH
数学推理
GPQA Diamond
常识推理
HumanEval
代码生成
MATH-500
数学推理
LiveCodeBench
代码生成
参数数量 开源情况 发布机构
OpenAI o1 91.04 91.80 0.00 96.40 77.30 0.00 96.40 71.00 未知 OpenAI
Hunyuan-T1 87.20 0.00 0.00 0.00 69.30 0.00 96.20 64.90 未知 腾讯AI实验室
OpenAI o3 85.60 0.00 0.00 0.00 83.30 0.00 0.00 0.00 未知 OpenAI
DeepSeek-R1 84.00 90.80 0.00 0.00 71.50 0.00 97.30 65.90 6710.0 DeepSeek-AI
OpenAI o4 - mini 80.60 0.00 0.00 0.00 81.40 0.00 0.00 0.00 未知 OpenAI
OpenAI o1-mini 80.30 85.20 0.00 0.00 60.00 92.40 90.00 52.00 未知 OpenAI
Hunyuan-TurboS 79.00 89.50 0.00 89.70 57.50 91.00 0.00 32.00 未知 腾讯AI实验室
QwQ-32B 76.00 0.00 0.00 0.00 58.00 19.00 91.00 0.00 325.0 阿里巴巴
QwQ-32B-Preview 70.97 0.00 0.00 0.00 0.00 0.00 90.60 0.00 320.0 阿里巴巴
Qwen3-235B-A22B 68.18 0.00 94.39 0.00 0.00 0.00 0.00 70.70 2350.0 阿里巴巴
Grok-3 - Reasoning Beta 0.00 0.00 0.00 0.00 84.60 0.00 0.00 79.40 未知 xAI
Claude Sonnet 3.7-64K Extended Thinking 0.00 0.00 0.00 0.00 84.80 0.00 96.20 0.00 未知 Anthropic
Phi-4-instruct (reasoning-trained) 0.00 0.00 0.00 0.00 49.00 0.00 90.40 0.00 38.0 Microsoft
DeepSeek-R1-Distill-Qwen-7B 0.00 0.00 0.00 0.00 49.50 0.00 91.40 0.00 70.0 DeepSeek-AI
Grok-3 mini - Reasoning 0.00 0.00 0.00 0.00 84.00 0.00 0.00 0.00 未知 xAI
Kimi-k1.6-IOI-high 0.00 0.00 0.00 0.00 0.00 0.00 0.00 73.80 未知 Moonshot AI
Kimi-k1.6-IOI 0.00 0.00 0.00 0.00 0.00 0.00 0.00 65.90 未知 Moonshot AI
QwQ-Max-Preview 0.00 0.00 0.00 0.00 0.00 0.00 0.00 65.60 未知 阿里巴巴
Kimi k1.5 (Long-CoT) 0.00 0.00 0.00 0.00 0.00 0.00 96.20 0.00 未知 普林斯顿大学
Kimi k1.5 (Short-CoT) 0.00 87.40 0.00 0.00 0.00 0.00 94.60 0.00 未知 Moonshot AI
Gemini 2.5 Pro Experimental 03-25 0.00 0.00 0.00 0.00 84.00 0.00 0.00 70.40 未知 Google Deep Mind
Gemini 2.5 Flash 0.00 0.00 0.00 0.00 78.30 0.00 0.00 63.40 未知 Google Deep Mind
OpenAI o3-mini (high) 0.00 86.90 0.00 97.90 79.70 97.60 97.90 69.50 未知 OpenAI
OpenAI o3-mini (medium) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 67.40 未知 OpenAI
Qwen3-32B 0.00 0.00 0.00 0.00 0.00 0.00 0.00 65.70 220.0 阿里巴巴
DeepSeek-R1-Distill-Llama-70B 0.00 0.00 0.00 0.00 65.20 0.00 94.50 0.00 700.0 DeepSeek-AI
MMLU Pro
91.04
MMLU
91.80
GSM8K
0.00
MATH
96.40
GPQA Diamond
77.30
HumanEval
0.00
MATH-500
96.40
LiveCodeBench
71.00
MMLU Pro
87.20
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
69.30
HumanEval
0.00
MATH-500
96.20
LiveCodeBench
64.90
MMLU Pro
85.60
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
83.30
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
84.00
MMLU
90.80
GSM8K
0.00
MATH
0.00
GPQA Diamond
71.50
HumanEval
0.00
MATH-500
97.30
LiveCodeBench
65.90
MMLU Pro
80.60
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
81.40
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
80.30
MMLU
85.20
GSM8K
0.00
MATH
0.00
GPQA Diamond
60.00
HumanEval
92.40
MATH-500
90.00
LiveCodeBench
52.00
MMLU Pro
79.00
MMLU
89.50
GSM8K
0.00
MATH
89.70
GPQA Diamond
57.50
HumanEval
91.00
MATH-500
0.00
LiveCodeBench
32.00
MMLU Pro
76.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
58.00
HumanEval
19.00
MATH-500
91.00
LiveCodeBench
0.00
MMLU Pro
70.97
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
90.60
LiveCodeBench
0.00
MMLU Pro
68.18
MMLU
0.00
GSM8K
94.39
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
70.70
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
84.60
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
79.40
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
84.80
HumanEval
0.00
MATH-500
96.20
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
49.00
HumanEval
0.00
MATH-500
90.40
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
49.50
HumanEval
0.00
MATH-500
91.40
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
84.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
73.80
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
65.90
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
65.60
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
96.20
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
87.40
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
94.60
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
84.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
70.40
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
78.30
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
63.40
MMLU Pro
0.00
MMLU
86.90
GSM8K
0.00
MATH
97.90
GPQA Diamond
79.70
HumanEval
97.60
MATH-500
97.90
LiveCodeBench
69.50
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
67.40
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
65.70
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
65.20
HumanEval
0.00
MATH-500
94.50
LiveCodeBench
0.00