This page aggregates mainstream LLM evaluation benchmarks including AIME 2025, SWE Bench Verified, MMLU, GSM8K, HumanEval, and more. We provide a comprehensive reference platform for researchers and developers to understand model performance across various evaluation datasets.

Detailed evaluation results on benchmark leaderboards:View Benchmark Leaderboards

Loading benchmarks...

Industry LLM Evaluation Benchmarks

Detailed evaluation results on benchmark leaderboards:View Benchmark Leaderboards

Loading benchmarks...