Claude Sonnet 4 Benchmark Details

Claude Sonnet 4 currently shows benchmark results led by SWE-bench Verified (13 / 108, score 80.20), Terminal-Bench (10 / 35, score 41.30), MMLU Pro (37 / 126, score 84). 1 source link is attached for reference.

Benchmark Results

Claude Sonnet 4

Benchmark Results

General Knowledge

12 evaluations

Benchmark / mode

Score

Rank/total

MMLU Pro

37 / 126

GPQA Diamond

83.80

58 / 179

GPQA Diamond

75.40

92 / 179

GPQA Diamond

123 / 179

LiveBench

Standard Mode

50.98

89 / 115

LiveBench

64K

61.27

65 / 115

ARC-AGI

46 / 65

ARC-AGI

23.80

53 / 65

HLE

9.60

136 / 159

HLE

5.52

150 / 159

ARC-AGI-2

5.90

43 / 59

ARC-AGI-2

1.30

52 / 59

Coding and Software Engineer

6 evaluations

Benchmark / mode

Score

Rank/total

CodeClash

Standard ModeTools

1223

4 / 8

SWE-bench Verified

80.20

13 / 108

SWE-bench Verified

72.70

47 / 108

LiveCodeBench

58 / 120

LiveCodeBench

48.50

94 / 120

SWE-Bench Pro - Public

42.70

38 / 44

Math and Reasoning

12 evaluations

Benchmark / mode

Score

Rank/total

AIME2025

50 / 106

AIME2025

70.50

71 / 106

AIME2025

95 / 106

AIME 2024

43.40

50 / 62

IMO-ProofBench

27.10

8 / 16

IMO 2024

9.70

5 / 10

IMO 2024

5.20

8 / 10

IMO-ProofBench Advanced

4.80

6 / 8

FrontierMath

4.10

41 / 60

IMO 2025

5 / 9

IMO 2025

3.30

6 / 9

FrontierMath - Tier 4

Standard Mode

72 / 80

Writing and Creative Capabilities

1 evaluations

Benchmark / mode

Score

Rank/total

Creative Writing

83.05

14 / 23

AI Agent - Tool Usage

4 evaluations

Benchmark / mode

Score

Rank/total

OSWorld-Verified

42.20

16 / 18

Terminal-Bench

41.30

10 / 35

Terminal-Bench

35.50

18 / 35

Terminal-Bench

26 / 35

Multimodal Understanding

1 evaluations

Benchmark / mode

Score

Rank/total

MMMU

76.50

16 / 28

常识推理

1 evaluations

Benchmark / mode

Score

Rank/total

Simple Bench

Thinking Enabled

45.50

34 / 63

Agent Level Benchmark

4 evaluations

Benchmark / mode

Score

Rank/total

τ²-Bench - Telecom

29 / 35

Aider-Polyglot

Standard Mode

56.40

26 / 59

Aider-Polyglot

32K

61.30

20 / 59

τ²-Bench

33 / 40

Instruction Following

1 evaluations

Benchmark / mode

Score

Rank/total

IF Bench

22 / 29

Productivity Knowledge

1 evaluations

Benchmark / mode

Score

Rank/total

GDPval-AA

19 / 21

Long Context

1 evaluations

Benchmark / mode

Score

Rank/total

AA-LCR

10 / 13

Claw-style Agent Evaluation

2 evaluations

Benchmark / mode

Score

Rank/total

Pinch Bench

Thinking EnabledTools

80.50

22 / 37

Claw Bench

Thinking EnabledTools

77.80

23 / 29

Compare with other models

Sources

artificialanalysis.aiartificialanalysis.ai