大模型评测基准与性能对比

本页面展示了多个主流大模型在各项评测基准上的表现,包括MMLU、GSM8K、HumanEval等多个标准数据集。我们通过实时更新的评测结果,帮助开发者和研究人员了解不同大模型在各种任务下的表现。用户可以选择自定义模型与评测基准进行对比,快速获取不同模型在实际应用中的优劣势。

各个评测基准的详细介绍可见: LLM 评测基准列表与介绍

自定义评测选择

+
+
模型名称
MMLU Pro
知识问答
MMLU
知识问答
GSM8K
数学推理
MATH
数学推理
GPQA Diamond
常识推理
HumanEval
代码生成
MATH-500
数学推理
LiveCodeBench
代码生成
参数数量 开源情况 发布机构
OpenAI o1 91.04 91.80 0.00 96.40 77.30 0.00 96.40 71.00 未知 OpenAI
Hunyuan-T1 87.20 0.00 0.00 0.00 69.30 0.00 96.20 64.90 未知 腾讯AI实验室
GPT-4.5 86.10 0.00 0.00 0.00 71.40 0.00 90.70 46.40 未知 OpenAI
OpenAI o3 85.60 0.00 0.00 0.00 83.30 0.00 0.00 0.00 未知 OpenAI
DeepSeek-R1 84.00 90.80 0.00 0.00 71.50 0.00 97.30 65.90 6710.0 DeepSeek-AI
Llama 4 Behemoth Instruct 82.20 0.00 0.00 0.00 73.70 0.00 95.00 49.40 20000.0 Facebook AI研究实验室
DeepSeek-V3-0324 81.20 0.00 0.00 0.00 68.40 0.00 94.00 49.20 6810.0 DeepSeek-AI
OpenAI o4 - mini 80.60 0.00 0.00 0.00 81.40 0.00 0.00 0.00 未知 OpenAI
GPT-4.1 80.50 0.00 0.00 0.00 66.30 0.00 0.00 0.00 未知 OpenAI
Llama 4 Maverick Instruct 80.50 0.00 0.00 0.00 69.80 0.00 0.00 43.40 4000.0 Facebook AI研究实验室
OpenAI o1-mini 80.30 85.20 0.00 0.00 60.00 92.40 90.00 52.00 未知 OpenAI
Gemini 2.0 Pro Experimental 79.10 86.50 0.00 91.80 64.70 0.00 0.00 0.00 未知 DeepMind
Hunyuan-TurboS 79.00 89.50 0.00 89.70 57.50 91.00 0.00 32.00 未知 腾讯AI实验室
Claude 3.5 Sonnet New 78.00 88.30 0.00 78.30 65.00 93.70 78.00 38.70 未知 Anthropic
GPT-4o 77.90 88.70 0.00 75.90 53.60 90.00 75.90 35.10 未知 OpenAI
GPT-4o(2024-11-20) 77.90 85.70 0.00 68.50 0.00 90.20 0.00 0.00 未知 OpenAI
Claude 3.5 Sonnet 77.64 88.30 0.00 71.10 59.40 92.00 0.00 0.00 未知 Anthropic
Gemini 2.0 Flash Experimental 76.24 0.00 0.00 0.00 0.00 0.00 0.00 0.00 未知 DeepMind
Gemini 1.5 Pro 76.10 87.10 0.00 82.90 53.50 89.00 0.00 0.00 未知 Google Deep Mind
Qwen2.5-Max 76.10 87.90 94.50 68.50 0.00 73.20 0.00 0.00 未知 阿里巴巴
QwQ-32B 76.00 0.00 0.00 0.00 58.00 19.00 91.00 0.00 325.0 阿里巴巴
DeepSeek-V3 75.90 88.50 0.00 87.80 59.10 89.00 87.80 34.60 6810.0 DeepSeek-AI
Grok 2 75.50 87.50 0.00 76.10 56.00 88.40 0.00 0.00 未知 xAI
Llama 4 Scout Instruct 74.30 0.00 0.00 0.00 57.20 0.00 0.00 32.80 1090.0 Facebook AI研究实验室
Llama3.1-405B Instruct 73.40 88.60 0.00 73.90 49.00 89.00 0.00 30.20 4050.0 Facebook AI研究实验室
QwQ-32B-Preview 70.97 0.00 0.00 0.00 0.00 0.00 90.60 0.00 320.0 阿里巴巴
Phi 4 - 14B 70.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 140.0 Microsoft
Qwen2.5-32B 69.23 83.30 95.90 83.10 0.00 88.40 0.00 51.20 320.0 阿里巴巴
Llama3.3-70B-Instruct 68.90 86.00 0.00 77.00 50.50 88.40 0.00 33.30 700.0 Facebook AI研究实验室
Claude3-Opus 68.45 86.80 95.00 60.10 50.40 84.90 0.00 0.00 未知 Anthropic
Qwen3-235B-A22B 68.18 0.00 94.39 0.00 0.00 0.00 0.00 70.70 2350.0 阿里巴巴
Gemma 3 - 27B (IT) 67.50 76.90 0.00 89.00 42.40 87.80 0.00 29.70 270.0 Google Deep Mind
Mistral-Small-3.1-24B-Instruct-2503 66.76 80.62 0.00 69.30 45.96 88.41 0.00 0.00 240.0 MistralAI
Llama3.1-70B-Instruct 66.40 86.00 0.00 67.80 48.00 80.50 0.00 33.30 700.0 Facebook AI研究实验室
Claude 3.5 Haiku 65.00 77.60 0.00 69.20 41.60 88.10 0.00 0.00 未知 Anthropic
Qwen2.5-14B 63.69 0.00 0.00 0.00 0.00 0.00 0.00 0.00 140.0 阿里巴巴
Llama 4 Maverick 62.90 85.50 0.00 61.20 0.00 0.00 0.00 0.00 4000.0 Facebook AI研究实验室
GPT-4o mini 61.70 82.00 91.30 70.20 41.10 87.20 0.00 0.00 未知 OpenAI
Llama3.1-405B 61.60 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4050.0 Facebook AI研究实验室
Gemma 3 - 12B (IT) 60.60 0.00 0.00 83.80 40.90 0.00 0.00 24.60 120.0 Google Deep Mind
Llama 4 Scout 58.20 79.60 0.00 50.30 0.00 0.00 0.00 0.00 1090.0 Facebook AI研究实验室
Qwen2.5-72B 58.10 86.10 91.50 62.10 45.90 59.10 0.00 0.00 727.0 阿里巴巴
Claude3-Sonnet 56.80 0.00 0.00 0.00 0.00 0.00 0.00 0.00 未知 Anthropic
Gemma2-27B 56.54 0.00 0.00 0.00 0.00 0.00 0.00 0.00 270.0 Google Deep Mind
Mixtral-8x22B-Instruct-v0.1 56.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1410.0 MistralAI
Llama3-70B-Instruct 56.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 700.0 Facebook AI研究实验室
Phi-4-mini-instruct (3.8B) 52.80 67.30 88.60 64.00 36.00 74.40 71.80 0.00 38.0 Microsoft
Llama3-70B 52.78 0.00 0.00 0.00 0.00 0.00 0.00 0.00 700.0 Facebook AI研究实验室
Llama3.1-70B 52.47 0.00 0.00 0.00 0.00 0.00 0.00 0.00 700.0 Facebook AI研究实验室
Grok-1.5 51.00 81.30 0.00 50.60 35.90 74.10 0.00 0.00 未知 xAI
C4AI Aya Vision 32B 47.16 72.14 0.00 69.30 33.84 62.20 0.00 0.00 320.0 CohereAI
Qwen2.5-7B 45.00 74.20 85.40 49.80 36.40 57.90 0.00 0.00 70.0 阿里巴巴
Gemma 2 - 9B 44.70 71.30 70.70 37.70 32.80 37.80 0.00 0.00 90.0 Google Research
Llama3.1-8B-Instruct 44.00 68.10 82.40 47.60 26.30 66.50 0.00 0.00 80.0 Facebook AI研究实验室
Moonlight-16B-A3B-Instruct 42.40 70.00 77.40 45.30 0.00 48.10 0.00 0.00 160.0 Moonshot AI
Llama3.1-8B 35.40 66.60 55.30 20.50 25.80 33.50 0.00 0.00 80.0 Facebook AI研究实验室
Qwen2.5-3B 34.60 65.60 79.10 42.60 24.30 42.10 0.00 0.00 30.0 阿里巴巴
Mistral-7B-Instruct-v0.3 30.90 64.20 36.20 10.20 24.70 29.30 0.00 0.00 70.0 MistralAI
Llama-3.2-3B 25.00 54.75 34.00 8.50 26.60 28.00 0.00 0.00 32.0 Facebook AI研究实验室
Grok 3 mini 0.00 0.00 0.00 0.00 65.00 0.00 0.00 0.00 未知 xAI
Kimi-k1.6-IOI-high 0.00 0.00 0.00 0.00 0.00 0.00 0.00 73.80 未知 Moonshot AI
Qwen3-32B 0.00 0.00 0.00 0.00 0.00 0.00 0.00 65.70 220.0 阿里巴巴
Claude Sonnet 3.7-64K Extended Thinking 0.00 0.00 0.00 0.00 84.80 0.00 96.20 0.00 未知 Anthropic
Gemini 2.5 Flash 0.00 0.00 0.00 0.00 78.30 0.00 0.00 63.40 未知 Google Deep Mind
Claude Sonnet 3.7 0.00 0.00 0.00 0.00 68.00 0.00 82.20 0.00 未知 Anthropic
Kimi-k1.6-IOI 0.00 0.00 0.00 0.00 0.00 0.00 0.00 65.90 未知 Moonshot AI
GPT-4.1 mini 0.00 87.50 0.00 0.00 65.00 0.00 0.00 0.00 未知 OpenAI
GPT-4.1 nano 0.00 80.10 0.00 0.00 50.30 0.00 0.00 0.00 未知 OpenAI
QwQ-Max-Preview 0.00 0.00 0.00 0.00 0.00 0.00 0.00 65.60 未知 阿里巴巴
Grok-3 mini - Reasoning 0.00 0.00 0.00 0.00 84.00 0.00 0.00 0.00 未知 xAI
Amazon Nova Pro 0.00 85.90 0.00 76.60 0.00 89.00 0.00 0.00 未知 亚马逊
Phi-4-instruct (reasoning-trained) 0.00 0.00 0.00 0.00 49.00 0.00 90.40 0.00 38.0 Microsoft
DeepSeek-R1-Distill-Qwen-7B 0.00 0.00 0.00 0.00 49.50 0.00 91.40 0.00 70.0 DeepSeek-AI
DeepSeek-R1-Distill-Llama-70B 0.00 0.00 0.00 0.00 65.20 0.00 94.50 0.00 700.0 DeepSeek-AI
Kimi k1.5 (Long-CoT) 0.00 0.00 0.00 0.00 0.00 0.00 96.20 0.00 未知 普林斯顿大学
Grok 3 0.00 0.00 0.00 0.00 80.20 0.00 0.00 70.60 未知 xAI
Grok-3 - Reasoning Beta 0.00 0.00 0.00 0.00 84.60 0.00 0.00 79.40 未知 xAI
Gemini 2.5 Pro Experimental 03-25 0.00 0.00 0.00 0.00 84.00 0.00 0.00 70.40 未知 Google Deep Mind
OpenAI o3-mini (medium) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 67.40 未知 OpenAI
OpenAI o3-mini (high) 0.00 86.90 0.00 97.90 79.70 97.60 97.90 69.50 未知 OpenAI
Kimi k1.5 (Short-CoT) 0.00 87.40 0.00 0.00 0.00 0.00 94.60 0.00 未知 Moonshot AI
MMLU Pro
91.04
MMLU
91.80
GSM8K
0.00
MATH
96.40
GPQA Diamond
77.30
HumanEval
0.00
MATH-500
96.40
LiveCodeBench
71.00
MMLU Pro
87.20
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
69.30
HumanEval
0.00
MATH-500
96.20
LiveCodeBench
64.90
MMLU Pro
86.10
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
71.40
HumanEval
0.00
MATH-500
90.70
LiveCodeBench
46.40
MMLU Pro
85.60
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
83.30
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
84.00
MMLU
90.80
GSM8K
0.00
MATH
0.00
GPQA Diamond
71.50
HumanEval
0.00
MATH-500
97.30
LiveCodeBench
65.90
MMLU Pro
82.20
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
73.70
HumanEval
0.00
MATH-500
95.00
LiveCodeBench
49.40
MMLU Pro
81.20
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
68.40
HumanEval
0.00
MATH-500
94.00
LiveCodeBench
49.20
MMLU Pro
80.60
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
81.40
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
80.50
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
66.30
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
80.50
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
69.80
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
43.40
MMLU Pro
80.30
MMLU
85.20
GSM8K
0.00
MATH
0.00
GPQA Diamond
60.00
HumanEval
92.40
MATH-500
90.00
LiveCodeBench
52.00
MMLU Pro
79.10
MMLU
86.50
GSM8K
0.00
MATH
91.80
GPQA Diamond
64.70
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
79.00
MMLU
89.50
GSM8K
0.00
MATH
89.70
GPQA Diamond
57.50
HumanEval
91.00
MATH-500
0.00
LiveCodeBench
32.00
MMLU Pro
78.00
MMLU
88.30
GSM8K
0.00
MATH
78.30
GPQA Diamond
65.00
HumanEval
93.70
MATH-500
78.00
LiveCodeBench
38.70
MMLU Pro
77.90
MMLU
88.70
GSM8K
0.00
MATH
75.90
GPQA Diamond
53.60
HumanEval
90.00
MATH-500
75.90
LiveCodeBench
35.10
MMLU Pro
77.90
MMLU
85.70
GSM8K
0.00
MATH
68.50
GPQA Diamond
0.00
HumanEval
90.20
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
77.64
MMLU
88.30
GSM8K
0.00
MATH
71.10
GPQA Diamond
59.40
HumanEval
92.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
76.24
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
76.10
MMLU
87.10
GSM8K
0.00
MATH
82.90
GPQA Diamond
53.50
HumanEval
89.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
76.10
MMLU
87.90
GSM8K
94.50
MATH
68.50
GPQA Diamond
0.00
HumanEval
73.20
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
76.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
58.00
HumanEval
19.00
MATH-500
91.00
LiveCodeBench
0.00
MMLU Pro
75.90
MMLU
88.50
GSM8K
0.00
MATH
87.80
GPQA Diamond
59.10
HumanEval
89.00
MATH-500
87.80
LiveCodeBench
34.60
MMLU Pro
75.50
MMLU
87.50
GSM8K
0.00
MATH
76.10
GPQA Diamond
56.00
HumanEval
88.40
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
74.30
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
57.20
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
32.80
MMLU Pro
73.40
MMLU
88.60
GSM8K
0.00
MATH
73.90
GPQA Diamond
49.00
HumanEval
89.00
MATH-500
0.00
LiveCodeBench
30.20
MMLU Pro
70.97
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
90.60
LiveCodeBench
0.00
MMLU Pro
70.40
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
69.23
MMLU
83.30
GSM8K
95.90
MATH
83.10
GPQA Diamond
0.00
HumanEval
88.40
MATH-500
0.00
LiveCodeBench
51.20
MMLU Pro
68.90
MMLU
86.00
GSM8K
0.00
MATH
77.00
GPQA Diamond
50.50
HumanEval
88.40
MATH-500
0.00
LiveCodeBench
33.30
MMLU Pro
68.45
MMLU
86.80
GSM8K
95.00
MATH
60.10
GPQA Diamond
50.40
HumanEval
84.90
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
68.18
MMLU
0.00
GSM8K
94.39
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
70.70
MMLU Pro
67.50
MMLU
76.90
GSM8K
0.00
MATH
89.00
GPQA Diamond
42.40
HumanEval
87.80
MATH-500
0.00
LiveCodeBench
29.70
MMLU Pro
66.76
MMLU
80.62
GSM8K
0.00
MATH
69.30
GPQA Diamond
45.96
HumanEval
88.41
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
66.40
MMLU
86.00
GSM8K
0.00
MATH
67.80
GPQA Diamond
48.00
HumanEval
80.50
MATH-500
0.00
LiveCodeBench
33.30
MMLU Pro
65.00
MMLU
77.60
GSM8K
0.00
MATH
69.20
GPQA Diamond
41.60
HumanEval
88.10
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
63.69
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
62.90
MMLU
85.50
GSM8K
0.00
MATH
61.20
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
61.70
MMLU
82.00
GSM8K
91.30
MATH
70.20
GPQA Diamond
41.10
HumanEval
87.20
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
61.60
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
60.60
MMLU
0.00
GSM8K
0.00
MATH
83.80
GPQA Diamond
40.90
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
24.60
MMLU Pro
58.20
MMLU
79.60
GSM8K
0.00
MATH
50.30
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
58.10
MMLU
86.10
GSM8K
91.50
MATH
62.10
GPQA Diamond
45.90
HumanEval
59.10
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
56.80
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
56.54
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
56.33
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
56.20
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
52.80
MMLU
67.30
GSM8K
88.60
MATH
64.00
GPQA Diamond
36.00
HumanEval
74.40
MATH-500
71.80
LiveCodeBench
0.00
MMLU Pro
52.78
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
52.47
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
51.00
MMLU
81.30
GSM8K
0.00
MATH
50.60
GPQA Diamond
35.90
HumanEval
74.10
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
47.16
MMLU
72.14
GSM8K
0.00
MATH
69.30
GPQA Diamond
33.84
HumanEval
62.20
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
45.00
MMLU
74.20
GSM8K
85.40
MATH
49.80
GPQA Diamond
36.40
HumanEval
57.90
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
44.70
MMLU
71.30
GSM8K
70.70
MATH
37.70
GPQA Diamond
32.80
HumanEval
37.80
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
44.00
MMLU
68.10
GSM8K
82.40
MATH
47.60
GPQA Diamond
26.30
HumanEval
66.50
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
42.40
MMLU
70.00
GSM8K
77.40
MATH
45.30
GPQA Diamond
0.00
HumanEval
48.10
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
35.40
MMLU
66.60
GSM8K
55.30
MATH
20.50
GPQA Diamond
25.80
HumanEval
33.50
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
34.60
MMLU
65.60
GSM8K
79.10
MATH
42.60
GPQA Diamond
24.30
HumanEval
42.10
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
30.90
MMLU
64.20
GSM8K
36.20
MATH
10.20
GPQA Diamond
24.70
HumanEval
29.30
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
25.00
MMLU
54.75
GSM8K
34.00
MATH
8.50
GPQA Diamond
26.60
HumanEval
28.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
65.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
73.80
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
65.70
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
84.80
HumanEval
0.00
MATH-500
96.20
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
78.30
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
63.40
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
68.00
HumanEval
0.00
MATH-500
82.20
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
65.90
MMLU Pro
0.00
MMLU
87.50
GSM8K
0.00
MATH
0.00
GPQA Diamond
65.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
80.10
GSM8K
0.00
MATH
0.00
GPQA Diamond
50.30
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
65.60
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
84.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
85.90
GSM8K
0.00
MATH
76.60
GPQA Diamond
0.00
HumanEval
89.00
MATH-500
0.00
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
49.00
HumanEval
0.00
MATH-500
90.40
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
49.50
HumanEval
0.00
MATH-500
91.40
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
65.20
HumanEval
0.00
MATH-500
94.50
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
96.20
LiveCodeBench
0.00
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
80.20
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
70.60
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
84.60
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
79.40
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
84.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
70.40
MMLU Pro
0.00
MMLU
0.00
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
0.00
LiveCodeBench
67.40
MMLU Pro
0.00
MMLU
86.90
GSM8K
0.00
MATH
97.90
GPQA Diamond
79.70
HumanEval
97.60
MATH-500
97.90
LiveCodeBench
69.50
MMLU Pro
0.00
MMLU
87.40
GSM8K
0.00
MATH
0.00
GPQA Diamond
0.00
HumanEval
0.00
MATH-500
94.60
LiveCodeBench
0.00