Qwen3-VL-4B-Instruct

Name: Qwen3-VL-4B-Instruct
Availability: InStock
Author: 阿里巴巴

多模态大模型

Release date: 2025-10-15更新于: 2025-10-15 08:23:57985

Live demoGitHub Hugging Face Compare

Parameters

40.0亿

Context length

256K

Chinese support

Not supported

Reasoning ability

Qwen3-VL-4B-Instruct is an AI model published by 阿里巴巴, released on 2025-10-15, for 多模态大模型, with 40.0B parameters, and 256K tokens context length, requiring about 8.89 GB storage, under the Apache 2.0 license.

Data sourced primarily from official releases (GitHub, Hugging Face, papers), then benchmark leaderboards, then third-party evaluators. Learn about our data methodology

Qwen3-VL-4B-Instruct

Model basics

Reasoning traces

Not supported

Thinking modes

Thinking modes not supported

Context length

256K tokens

Max output length

No data

Model type

Qwen3-VL-4B-Instruct

Open source & experience

Code license

Apache 2.0

Weights license

Apache 2.0- 免费商用授权

GitHub repo

https://github.com/QwenLM/Qwen3-VL

Hugging Face

https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct

Qwen3-VL-4B-Instruct

Official resources

Paper

Qwen3 Technical Report

DataLearnerAI blog

No blog post yet

Qwen3-VL-4B-Instruct

API details

API speed

3/5

No public API pricing yet.

Qwen3-VL-4B-Instruct

Benchmark Results

Qwen3-VL-4B-Instruct currently shows benchmark results led by DocVQA (3 / 5, score 95.30), MMMU (25 / 28, score 67.40). This page also consolidates core specs, context limits, and API pricing so you can evaluate the model from benchmark results and deployment constraints together.

多模态理解

2 evaluations

Benchmark / mode

Score

Rank/total

DocVQA

Off

95.30

3 / 5

MMMU

Off

67.40

25 / 28

View benchmark analysis Compare with other models

Qwen3-VL-4B-Instruct

Publisher

阿里巴巴

View publisher details

Qwen3-VL-4B-Instruct

Model Overview

Qwen3-VL 4B 简介

Qwen3-VL 是阿里巴巴 Qwen 团队在 Qwen3 代系下推出的新一代视觉-语言模型，面向文本、图像与视频的联合理解与生成。该代系在长上下文、多模态融合与时空理解等方面进行了系统升级：模型原生支持 256K token 上下文，并可扩展至 1M；在视频理解中强调时间戳对齐，能够对长时序视频进行秒级片段定位；在跨模态对齐方面引入多层次视觉特征融合。

架构与技术要点

多模态骨干：文本与视觉双分支，模型卡与实现代码指向 Qwen3-VL 专用架构；视觉侧采用多层特征融合（DeepStack）以捕获细粒度信息，文本侧采用 Interleaved-MRoPE 以增强长时序/空间位置建模。
上下文窗口：原生 256K，可按官方说明扩展至 1M。
输入/输出模态：支持图像与视频作为输入，输出为文本；支持 OCR 与版面/结构化文档解析、空间/遮挡关系判断与时序事件定位等。
许可：Apache-2.0 开源许可；权重与模型卡已在 Hugging Face 发布。

性能与资源

官方模型卡提供多模态与纯文本基准图表与使用样例；权重与推理代码可通过 Transformers/ModelScope 直接调用。

访问与获取

GitHub：提供 Qwen3-VL 代码与使用示例。
Hugging Face：提供 Qwen3-VL-4B-Instruct 权重与模型卡（仓库文件体积约 8.89 GB）。

DataLearner 官方微信

欢迎关注 DataLearner 官方微信，获得最新 AI 技术推送