ai/qwen3-vl

Verified Publisher

By Docker

Updated 3 months ago

The most advanced Qwen model yet, with major gains in text, vision, video, and reasoning.

Model
7

100K+

ai/qwen3-vl repository overview

Qwen3 VL

GGUF version by Unsloth

logo

Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date.

This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities.

Key Enhancements:

  • Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks.
  • Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos.
  • Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI.
  • Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing.
  • Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers.
  • Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc.
  • Expanded OCR: Supports 32 languages (up from 19): robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing.
  • Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension.

Model Architecture Updates:

arc

  1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning.
  2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment.
  3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling.

This is the weight repository for Qwen3-VL-8B-Instruct.


Available model variants

Model variantParametersQuantizationContext windowVRAM¹Size
ai/qwen3-vl:8B

ai/qwen3-vl:8B-UD-Q4_K_XL

ai/qwen3-vl:latest
8BMOSTLY_Q4_K_M262K tokens5.91 GiB4.79 GB
ai/qwen3-vl:2B-BF162BMOSTLY_BF16262K tokens4.38 GiB3.21 GB
ai/qwen3-vl:2B-Q8_K_XL2BMOSTLY_Q8_0262K tokens3.34 GiB2.17 GB
ai/qwen3-vl:2B-UD-Q4_K_XL2BMOSTLY_Q4_K_M262K tokens2.22 GiB1.05 GB
ai/qwen3-vl:4B-Q8_K_XL4BMOSTLY_Q8_0262K tokens6.13 GiB4.70 GB
ai/qwen3-vl:8B-Q8_K_XL8BMOSTLY_Q8_0262K tokens10.36 GiB10.08 GB
ai/qwen3-vl:32B-Q8_K_XL32BMOSTLY_Q8_0262K tokens37.46 GiB36.76 GB
ai/qwen3-vl:32B-UD-Q4_K_XL32BMOSTLY_Q4_K_M262K tokens20.41 GiB18.67 GB
ai/qwen3-vl:4B-BF164BMOSTLY_BF16262K tokens8.92 GiB7.49 GB
ai/qwen3-vl:4B-UD-Q4_K_XL4BMOSTLY_Q4_K_M262K tokens3.80 GiB2.37 GB
ai/qwen3-vl:8B-BF168BMOSTLY_BF16262K tokens15.54 GiB15.26 GB

¹: VRAM estimated based on model characteristics.

latest8B

🐳 Using this model with Docker Model Runner

Run the model:

docker model run ai/qwen3-vl

For more information, check out the Docker Model Runner docs.


Tag summary

Content type

Model

Digest

sha256:a18971a77

Size

5.9 GB

Last updated

3 months ago

docker model pull ai/qwen3-vl

This week's pulls

Pulls:

5,521

Last week