What is first-frame fidelity?

First-frame fidelity measures how closely the generated video preserves the uploaded reference image at the beginning of the clip.

LMArena Tracks

Text Generation Coding Math Image Edit Text-to-Video Image-to-Video

Image-to-Video Arena Leaderboard

Name: Image-to-Video Arena Leaderboard
Creator: DataLearner
License: https://creativecommons.org/licenses/by/4.0/

The latest AI image-to-video leaderboard based on anonymous Arena voting. Covers Elo scores, confidence intervals, and vote counts for leading video animation models.

Top Model

happyhorse-1.0

Top Score

1,445

Model Count

Data version

2026年05月12日

Data source: LM Arena

About This Leaderboard

This leaderboard ranks AI image-to-video models by animation quality. Data comes from LMArena's Image-to-Video Arena track, evaluated through anonymous blind testing by real users.

Methodology Overview

Blind testing: Users upload an image, two anonymous models generate animated videos, and users vote for the more natural result.

Elo scoring: Based on the Bradley-Terry model, scientifically measuring each model's relative strength in image-to-video tasks.

Origin:All China

Leaderboard snapshot month:

Ranking Table

Rank	Model	Score	95% CI	Votes	Organization	License
	happyhorse-1.0Alibaba-ATH	1,445	+/-15	14,889	Alibaba-ATH	Proprietary

Data is for reference only. Official sources are authoritative. Click model names to view DataLearner model profiles.

2026-05 Market Signals

Current Best (SOTA)

Grok Imagine Video 720p

Veo 3.1 Audio 1080p

Veo 3.1 Audio

Best China Model

Vidu-Q3-Pro

Wan2.5-I2V-Preview

Kling-2.6-Pro

Best Open Model

•Wan-V2.2-A14B
•LTX-2-19B
•Pika-V2.2

FAQ

What is the difference between image-to-video and text-to-video?

Text-to-video generates a clip from a prompt alone. Image-to-video starts from a reference image, which gives stronger control over subject identity, composition, and visual style.

Which model should I use to animate old photos?

For portrait animation, compare models on facial expression stability, motion naturalness, and identity preservation. Specialized lip-sync tools may be better when speech alignment is the main requirement.

How can I keep characters consistent?

Use a strong reference image as the first frame, keep the prompt specific, and avoid large changes in clothing, camera angle, or style unless the model supports identity conditioning.