DeepSeek-V4-Flash Benchmark Details

DeepSeek-V4-Flash currently shows benchmark results led by LiveCodeBench (4 / 120, score 91.60), MMLU Pro (15 / 126, score 86.40), GPQA Diamond (29 / 179, score 88.10).

Benchmark Results

DeepSeek-V4-Flash

Benchmark Results

Thinking
Tool usage

General Knowledge

11 evaluations
Benchmark / mode
Score
Rank/total
GPQA Diamond
Standard Mode
71.20
107 / 179
87.40
36 / 179
88.10
29 / 179
MMLU Pro
Standard Mode
83
43 / 126
86.40
15 / 126
86.20
16 / 126
HLE
Standard Mode
8.10
142 / 159
HLE
High
29.40
79 / 159
HLE
HighTools
40.30
52 / 159
HLE
Max
34.80
63 / 159
HLE
Thinking Level · Extra HighTools
45.10
34 / 159

Coding and Software Engineer

14 evaluations
Benchmark / mode
Score
Rank/total
2816
5 / 16
3052
3 / 16
LiveCodeBench
Standard Mode
55.20
82 / 120
88.40
8 / 120
91.60
4 / 120
SWE-bench Verified
Standard ModeTools
73.70
40 / 108
78.60
22 / 108
SWE-bench Verified
Thinking Level · Extra HighTools
79
19 / 108
SWE-bench Multilingual
Standard ModeTools
69.70
16 / 20
70.20
14 / 20
SWE-bench Multilingual
Thinking Level · Extra HighTools
73.30
10 / 20
SWE-Bench Pro - Public
Standard ModeTools
49.10
35 / 44
52.30
28 / 44
SWE-Bench Pro - Public
Thinking Level · Extra HighTools
52.60
26 / 44

AI Agent - Information Search

2 evaluations
Benchmark / mode
Score
Rank/total
BrowseComp
HighTools
53.50
33 / 45
BrowseComp
Thinking Level · Extra HighTools
73.20
21 / 45

AI Agent - Tool Usage

3 evaluations
Benchmark / mode
Score
Rank/total
Terminal Bench 2.0
Standard ModeTools
49.10
34 / 46
56.60
27 / 46
Terminal Bench 2.0
Thinking Level · Extra HighTools
56.90
25 / 46

Math and Reasoning

3 evaluations
Benchmark / mode
Score
Rank/total
IMO-AnswerBench
Standard Mode
41.90
19 / 20
85.10
9 / 20
88.40
4 / 20

Productivity Knowledge

1 evaluations
Benchmark / mode
Score
Rank/total
GDPval-AA
Thinking Level · Extra HighTools
1395
6 / 21