Gemma 4 31BvsQwen3.5-27B

Across 5 shared benchmarks, Qwen3.5-27B leads overall: Gemma 4 31B wins 0, Qwen3.5-27B wins 5, with 0 ties and an average score difference of -5.38.

DeepMind
Gemma 4 31B

DeepMind · 2026-04-02 · Chat model

阿里巴巴
Qwen3.5-27B

阿里巴巴 · 2026-02-25 · Reasoning model

Gemma 4 31B0 wins(0%)(100%)5 winsQwen3.5-27B

Benchmark scores

Grouped by capability, sorted by largest gap within each. 5 shared benchmarks.

General Knowledge

Qwen3.5-27B 3/3
BenchmarkGemma 4 31BQwen3.5-27BDiff
HLE26.5083 / 157Thinking (With Tools + Internet)48.5026 / 157Thinking (With Tools)-22
GPQA Diamond84.3053 / 178Thinking (No Tools)85.5047 / 178Thinking (No Tools)-1.20
MMLU Pro85.2023 / 126Thinking (No Tools)86.1018 / 126Thinking (No Tools)-0.90

Agent Level Benchmark

Qwen3.5-27B 1/1
BenchmarkGemma 4 31BQwen3.5-27BDiff
τ²-Bench76.9019 / 40Thinking (With Tools)7917 / 40Thinking (With Tools)-2.10

Coding and Software Engineer

Qwen3.5-27B 1/1
BenchmarkGemma 4 31BQwen3.5-27BDiff
LiveCodeBench8030 / 120Thinking (No Tools)80.7027 / 120Thinking (With Tools)-0.70

Specs

FieldGemma 4 31BQwen3.5-27B
PublisherDeepMind阿里巴巴
Release date2026-04-022026-02-25
Model typeChat modelReasoning model
ArchitectureDenseDense
Parameters3.1B27B
Context length256K1010K
Max output32K248320

Summary

  • Qwen3.5-27Bleads in:General Knowledge (3/3), Agent Level Benchmark (1/1), Coding and Software Engineer (1/1)

On average across the 5 shared benchmarks, Qwen3.5-27B scores 5.38 higher.

Largest single-benchmark gap: HLE — Gemma 4 31B 26.50 vs Qwen3.5-27B 48.50 (-22).

Page generated from structured model, pricing and benchmark records. No real-time LLM is used to write the prose.