Composer 2.5 Benchmark Details

Composer 2.5 currently shows benchmark results led by SWE-bench Multilingual (2 / 20, score 79.80), Terminal Bench 2.0 (7 / 46, score 69.30). This page also compares it with 3 competitor models and 3 predecessor or same-series models, including performance and pricing views when available. 1 source link is attached for reference.

Benchmark Results

Composer 2.5

Benchmark Results

Thinking

AI Agent - Tool Usage

1 evaluations
Benchmark / mode
Score
Rank/total
Terminal Bench 2.0
Thinking Mode
69.30
7 / 46

Coding and Software Engineer

1 evaluations
Benchmark / mode
Score
Rank/total
79.80
2 / 20

Competitor Comparison

Benchmark scores for Composer 2.5 compared against top models in its class

Benchmark categories:
The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

2 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.

BenchmarkComposer 2.5CurrentOpus 4.7Kimi K2.6
Terminal Bench 2.0
AI Agent - 工具使用
69.30Thinking Enabled
69.40Extended Thinking | Tools
66.70Thinking Enabled | Tools
SWE-bench Multilingual
编程与软件工程
79.80Thinking Enabled
--
76.70Thinking Enabled | Tools

Standard API Pricing: Composer 2.5 vs. Peer Models

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

ModelSupplierStandard inputStandard outputBase price applies to
Opus 4.7
Anthropic$5 / 1M tokens$25 / 1M tokens
GPT-5.5
OpenAI$5 / 1M tokens$30 / 1M tokens
Kimi K2.6
Facebook AI研究实验室$0.95 / 1M tokens$4 / 1M tokens

Version History

How each version of the Composer 2.5 series stacks up on benchmark tests

Benchmark categories:
The chart shows each model’s highest score per benchmark within the current filter. Out-of-100 benchmarks use raw heights; out-of-range benchmarks are scaled within that benchmark while labels keep the original scores.

2 benchmarks with comparable scores. Each model shows its best score; mode label is displayed below.· Click a row to view its trend chart.

BenchmarkComposer 2.5CurrentComposer 2Composer 1.5Composer 1
Terminal Bench 2.0
AI Agent - 工具使用
69.30Thinking Enabled
61.70Thinking Enabled
47.90Thinking Enabled
40.00Thinking Enabled
SWE-bench Multilingual
编程与软件工程
79.80Thinking Enabled
73.70Thinking Enabled
65.90Thinking Enabled
56.90Thinking Enabled

Single-Benchmark Version Trend

Viewing: Terminal Bench 2.0 · AI Agent - 工具使用

Benchmark
NormalNormal + ToolsThinkingThinking + ToolsDeepDeep + Tools

X-axis shows model and release date, Y-axis shows score; solid lines connect the same mode across versions, while dotted guides align modes within the same generation.

Standard API Pricing Across the Composer 2.5 Series

Shows standard text input and output pricing side by side for each model. If extended-context pricing exists, the chart keeps the base rate and explains the threshold below.

Source: DataLearnerAI. Standard text prices shown here use the default supplier. · USD / 1M tokens

ModelSupplierStandard inputStandard outputBase price applies to
Composer 2
Cursor$0.5 / 1M tokens$2.5 / 1M tokens
Composer 1.5
Cursor$3.5 / 1M tokens$17.5 / 1M tokens
Composer 1
Cursor$1.25 / 1M tokens$10 / 1M tokens

Sources