Composer 2 Benchmark Details

Composer 2 currently shows benchmark results led by Terminal Bench 2.0 (15 / 46, score 61.70), SWE-bench Multilingual (9 / 20, score 73.70). This page also compares it with 3 competitor models and 2 predecessor or same-series models, including performance and pricing views when available. 1 source link is attached for reference.

Benchmark Results

Composer 2

Benchmark Results

Thinking

AI Agent - Tool Usage

1 evaluations
Benchmark / mode
Score
Rank/total
Terminal Bench 2.0
Thinking Mode
61.70
15 / 46

Coding and Software Engineer

1 evaluations
Benchmark / mode
Score
Rank/total
73.70
9 / 20

Competitor Comparison

Benchmark scores for Composer 2 compared against top models in its class

Benchmark categories:
The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

2 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

BenchmarkComposer 2CurrentGPT-5.4Claude Opus 4.6Kimi K2.5
Terminal Bench 2.0
AI Agent - 工具使用
61.70Thinking Enabled
75.10Thinking Level · Extra High | Tools
65.40Extended Thinking | Tools
50.80Thinking Enabled | Tools
SWE-bench Multilingual
编程与软件工程
73.70Thinking Enabled
--
72.00Extended Thinking | Tools
--

Standard API Pricing: Composer 2 vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

When a context threshold exists, the charted base price only applies within these limits:

GPT-5.4: Base price applies to <= 272K
Claude Opus 4.6: Base price applies to <= 200K
ModelSupplierStandard inputStandard outputBase price applies to
Composer 2
Cursor$0.5 / 1M tokens$2.5 / 1M tokens
GPT-5.4
OpenAI$2.5 / 1M tokens$15 / 1M tokens<= 272K
Claude Opus 4.6
Anthropic$5 / 1M tokens$25 / 1M tokens<= 200K

Version History

How each version of the Composer 2 series stacks up on benchmark tests

Benchmark categories:
The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

2 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

BenchmarkComposer 2CurrentComposer 1.5Composer 1
Terminal Bench 2.0
AI Agent - 工具使用
61.70Thinking Enabled
47.90Thinking Enabled
40.00Thinking Enabled
SWE-bench Multilingual
编程与软件工程
73.70Thinking Enabled
65.90Thinking Enabled
56.90Thinking Enabled

Single-Benchmark Version Trend

Viewing: Terminal Bench 2.0 · AI Agent - 工具使用

Benchmark
NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the Composer 2 Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

ModelSupplierStandard inputStandard outputBase price applies to
Composer 2
Cursor$0.5 / 1M tokens$2.5 / 1M tokens
Composer 1.5
Cursor$3.5 / 1M tokens$17.5 / 1M tokens
Composer 1
Cursor$1.25 / 1M tokens$10 / 1M tokens

Sources