Claude Sonnet 4 Benchmark Details

Claude Sonnet 4 currently shows benchmark results led by SWE-bench Verified (13 / 108, score 80.20), Terminal-Bench (10 / 35, score 41.30), MMLU Pro (37 / 126, score 84). 1 source link is attached for reference.

Benchmark Results

Claude Sonnet 4

Benchmark Results

Thinking
Tool usage

General Knowledge

12 evaluations
Benchmark / mode
Score
Rank/total
84
37 / 126
83.80
58 / 179
75.40
92 / 179
68
123 / 179
LiveBench
Standard Mode
50.98
89 / 115
61.27
65 / 115
40
46 / 65
23.80
53 / 65
9.60
136 / 159
5.52
150 / 159
5.90
43 / 59
1.30
52 / 59

Coding and Software Engineer

6 evaluations
Benchmark / mode
Score
Rank/total
CodeClash
Standard ModeTools
1223
4 / 8
80.20
13 / 108
72.70
47 / 108
66
58 / 120
48.50
94 / 120

Math and Reasoning

12 evaluations
Benchmark / mode
Score
Rank/total
85
50 / 106
70.50
71 / 106
38
95 / 106
43.40
50 / 62
27.10
8 / 16
9.70
5 / 10
5.20
8 / 10
4.10
41 / 60
4
5 / 9
3.30
6 / 9
0
72 / 80

Writing and Creative Capabilities

1 evaluations
Benchmark / mode
Score
Rank/total
83.05
14 / 23

AI Agent - Tool Usage

4 evaluations
Benchmark / mode
Score
Rank/total
42.20
16 / 18
41.30
10 / 35
35.50
18 / 35

Multimodal Understanding

1 evaluations
Benchmark / mode
Score
Rank/total
76.50
16 / 28

常识推理

1 evaluations
Benchmark / mode
Score
Rank/total
Simple Bench
Thinking Enabled
45.50
34 / 63

Agent Level Benchmark

4 evaluations
Benchmark / mode
Score
Rank/total
Aider-Polyglot
Standard Mode
56.40
26 / 59
61.30
20 / 59
52
33 / 40

Instruction Following

1 evaluations
Benchmark / mode
Score
Rank/total
55
22 / 29

Productivity Knowledge

1 evaluations
Benchmark / mode
Score
Rank/total
33
19 / 21

Long Context

1 evaluations
Benchmark / mode
Score
Rank/total
65
10 / 13

Claw-style Agent Evaluation

2 evaluations
Benchmark / mode
Score
Rank/total
Pinch Bench
Thinking EnabledTools
80.50
22 / 37
Claw Bench
Thinking EnabledTools
77.80
23 / 29

Sources