Gemma 4 31BvsKimi K2.5

Across 5 shared benchmarks, Kimi K2.5 leads overall: Gemma 4 31B wins 1, Kimi K2.5 wins 4, with 0 ties and an average score difference of -5.72.

DeepMind
Gemma 4 31B

DeepMind · 2026-04-02 · Chat model

Moonshot AI
Kimi K2.5

Moonshot AI · 2026-01-27 · Multimodal model

Gemma 4 31B1 win(20%)(80%)4 winsKimi K2.5

Benchmark scores

Grouped by capability, sorted by largest gap within each. 5 shared benchmarks.

General Knowledge

Kimi K2.5 2/3
BenchmarkGemma 4 31BKimi K2.5Diff
HLE26.5083 / 157Thinking (With Tools + Internet)50.2020 / 157Thinking (With Tools)-23.70
MMLU Pro85.2023 / 126Thinking (No Tools)78.5066 / 126Thinking (No Tools)+6.70
GPQA Diamond84.3053 / 178Thinking (No Tools)87.6034 / 178Thinking (No Tools)-3.30

Coding and Software Engineer

Kimi K2.5 1/1
BenchmarkGemma 4 31BKimi K2.5Diff
LiveCodeBench8030 / 120Thinking (No Tools)8516 / 120Thinking (No Tools)-5

Math and Reasoning

Kimi K2.5 1/1
BenchmarkGemma 4 31BKimi K2.5Diff
AIME 202689.2013 / 14Thinking (No Tools)92.5010 / 14Thinking (No Tools)-3.30

Specs

FieldGemma 4 31BKimi K2.5
PublisherDeepMindMoonshot AI
Release date2026-04-022026-01-27
Model typeChat modelMultimodal model
ArchitectureDenseMoE
Parameters3.1B1T
Context length256K256K
Max output32K16K

Summary

  • Kimi K2.5leads in:General Knowledge (2/3), Coding and Software Engineer (1/1), Math and Reasoning (1/1)

On average across the 5 shared benchmarks, Kimi K2.5 scores 5.72 higher.

Largest single-benchmark gap: HLE — Gemma 4 31B 26.50 vs Kimi K2.5 50.20 (-23.70).

Page generated from structured model, pricing and benchmark records. No real-time LLM is used to write the prose.