服务器渲染的对比结果
模型: GPT OSS 20B, Kimi K2, Qwen3-235B-A22B-Thinking, GPT OSS 120B。基准数量: 5。
| 模型 | MMLU | GPQA Diamond | AIME 2024 | AIME2025 | HLE |
|---|---|---|---|---|---|
| GPT OSS 20B | 85.3 thinking | 71.5 thinking | - | 79 thinking | 10.9 thinking |
| Kimi K2 | 89.5 normal | 75.1 normal | 69.6 normal | 54 normal | 4.7 normal |
| Qwen3-235B-A22B-Thinking | - | 81.1 thinking | - | 92.3 thinking | 18.2 thinking |
| GPT OSS 120B | 90 thinking | 80.1 thinking | - | 83 thinking | 14.9 thinking |