Grok 4.1 Benchmark Details

Grok 4.1 currently shows benchmark results led by SWE-bench Verified (87 / 111, score 54.60). This page also tracks comparisons against 3 predecessor or same-series models. 1 source link is attached for reference.

Benchmark Results

Grok 4.1

Benchmark Results

Coding and Software Engineer

1 evaluations

Benchmark / mode

Score

Rank/total

SWE-bench Verified

Standard Mode

54.60

87 / 111

Compare with other models

Version History

How each version of the Grok 4.1 series stacks up on benchmark tests

Grok 4.1GPT-4o(2024-11-20)GPT-4o GPT-4

Benchmark categories:

The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

1 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

Benchmark	Grok 4.1Current	GPT-4o
SWE-bench Verified 编程与软件工程	54.60Standard Mode	31.00Standard Mode

Single-Benchmark Version Trend

Viewing: SWE-bench Verified · 编程与软件工程

Benchmark

NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the Grok 4.1 Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

Model	Supplier	Standard input	Standard output	Base price applies to
GPT-4o	Microsoft Azure	$2.5 / 1M tokens	$10 / 1M tokens	—

Sources

openai.comopenai.com