服务器渲染的对比结果
模型: Hunyuan-TurboS, GPT-4o, Llama3.1-405B Instruct, Claude 3.5 Sonnet New, DeepSeek-V3。基准数量: 9。
| 模型 | MMLU | MMLU Pro | HumanEval | BBH | GPQA Diamond | LiveCodeBench | SimpleQA | MATH | MATH-500 |
|---|---|---|---|---|---|---|---|---|---|
| Hunyuan-TurboS | 89.5 normal | 79 normal | 91 normal | 92.2 normal | 57.5 normal | 32 normal | 22.8 normal | 89.7 normal | - |
| GPT-4o | 88.7 normal | 77.9 normal | 90 normal | 91.7 normal | 70.1 normal | 35.1 normal | 38.2 normal | 75.9 normal | 75.9 normal |
| Llama3.1-405B Instruct | 88.6 normal | 73.4 normal | 89 normal | 89.2 normal | 49 normal | 30.2 normal | 17.1 normal | 73.9 normal | - |
| Claude 3.5 Sonnet New | 88.3 normal | 78 normal | 93.7 normal | 92.6 normal | 65 normal | 38.7 normal | 28.4 normal | 78.3 normal | 78 normal |
| DeepSeek-V3 | 88.5 normal | 75.9 normal | 89 normal | 92.3 normal | 59.1 normal | 34.6 normal | 24.9 normal | 87.8 normal | 87.8 normal |