DeepSeek V3.2 Benchmark Details

DeepSeek V3.2 currently shows benchmark results led by LiveCodeBench (21 / 123, score 83.30), AIME2025 (30 / 107, score 93.10), τ²-Bench (14 / 43, score 80.30). This page also tracks comparisons against 3 predecessor or same-series models. 1 source link is attached for reference.

Benchmark Results

DeepSeek V3.2

Benchmark Results

General Knowledge

6 evaluations

Benchmark / mode

Score

Rank/total

GPQA Diamond

Thinking Mode

82.40

69 / 187

LiveBench

Standard Mode

51.84

87 / 115

LiveBench

Thinking Mode

62.20

58 / 115

ARC-AGI

Thinking Mode

41 / 68

HLE

Thinking Mode

25.10

102 / 172

ARC-AGI-2

Thinking Mode

51 / 62

Coding and Software Engineer

5 evaluations

Benchmark / mode

Score

Rank/total

CodeForces

Thinking Mode

2386

11 / 16

LiveCodeBench

Thinking Mode

83.30

21 / 123

SWE-bench Verified

73.10

49 / 112

SWE-bench Verified

Thinking Mode

70.20

60 / 112

SWE-Bench Pro - Public

Thinking Mode

40.90

49 / 54

Math and Reasoning

3 evaluations

Benchmark / mode

Score

Rank/total

AIME2025

Thinking Mode

93.10

30 / 107

AIME 2026

Thinking Mode

92.70

9 / 18

FrontierMath - Tier 4

Thinking Mode

2.10

56 / 80

Agent Level Benchmark

1 evaluations

Benchmark / mode

Score

Rank/total

τ²-Bench

80.30

14 / 43

AI Agent - Information Search

1 evaluations

Benchmark / mode

Score

Rank/total

BrowseComp

Thinking Mode

51.40

42 / 53

AI Agent - Tool Usage

1 evaluations

Benchmark / mode

Score

Rank/total

Terminal Bench 2.0

46.40

40 / 47

Claw-style Agent Evaluation

2 evaluations

Benchmark / mode

Score

Rank/total

Pinch Bench

Thinking ModeTools

84.30

18 / 37

Claw Bench

Thinking ModeTools

21 / 29

Compare with other models

Version History

How each version of the DeepSeek V3.2 series stacks up on benchmark tests

DeepSeek V3.2DeepSeek-V3.1 DeepSeek-V3-0324 DeepSeek-V3

Benchmark categories:

The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

7 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

Benchmark	DeepSeek V3.2Current	DeepSeek-V3.1	DeepSeek-V3-0324	DeepSeek-V3
ARC-AGI 综合评估	57.00Thinking Enabled	--	9.00Standard Mode	--
GPQA Diamond 综合评估	82.40Thinking Enabled	80.10Thinking Enabled	68.40Standard Mode	59.10Standard Mode
HLE 综合评估	25.10Thinking Enabled	15.90Thinking Enabled	5.20Standard Mode	--
LiveCodeBench 编程与软件工程	83.30Thinking Enabled	74.80Thinking Enabled	49.20Standard Mode	34.60Standard Mode
SWE-bench Verified 编程与软件工程	73.10Thinking Enabled ｜ Tools	66.00Standard Mode	38.80Standard Mode	--
AIME2025 数学推理	93.10Thinking Enabled	88.40Thinking Enabled	47.70Standard Mode	--
τ²-Bench Agent能力评测	80.30Thinking Enabled ｜ Tools	--	38.80Standard Mode ｜ Tools	--

Single-Benchmark Version Trend

Viewing: ARC-AGI · 综合评估

Benchmark

NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the DeepSeek V3.2 Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

Model	Supplier	Standard input	Standard output	Base price applies to
DeepSeek V3.2	DeepSeek-AI	$0.28 / 1M tokens	$0.42 / 1M tokens	—
DeepSeek-V3.1	Fireworks AI	$0.56 / 1M tokens	$1.68 / 1M tokens	—
DeepSeek-V3-0324	DeepInfra	$0.2 / 1M tokens	$0.88 / 1M tokens	—
DeepSeek-V3	DeepSeek-AI	$0.27 / 1M tokens	$1.1 / 1M tokens	—

Sources

arcprize.orgarcprize.org