Flagship models (Singapore region)
Flagship model |
Ideal for complex tasks, most powerful. |
Balanced performance, speed, and cost. |
Ideal for simple tasks, fast and low-cost. |
Excellent code model, excels at tool calling and environment interaction. |
Maximum context window (Tokens) | 262,144 | 1,000,000 | 1,000,000 | 1,000,000 |
Minimum input price (Million tokens) | $1.6 | $0.4 | $0.05 | $0.3 |
Minimum output price (Million tokens) | $6.4 | $1.2 | $0.4 | $1.5 |
Model overview
Category | Subcategory | Description |
Text generation | Qwen large language models: Commercial models (Qwen-Max, Qwen-Plus, Qwen-Flash), open-source models (Qwen3, Qwen2.5) | |
Multimodal models | Visual understanding model Qwen-VL, visual reasoning model QVQ, omni-modal model Qwen-Omni, and real-time multi-modal model Qwen-Omni-Realtime | |
Image generation |
| |
Qwen-Image-Edit: Supports Chinese and English prompts and performs complex image and text editing operations, such as style transfer, text modification, and object editing. | ||
Video generation | Generates videos from a single sentence, offering rich styles and fine image quality. | |
| ||
General-purpose video editing: Performs various video editing tasks based on input text, images, and videos. For example, it can generate a new video by extracting motion features from an input video and combining them with a prompt. | ||
Embedding | Converts text into a set of numbers that can represent the text, suitable for search, clustering, recommendation, and classification tasks. |
Text generation - Qwen
The following are the Qwen commercial models. Compared to the open-source versions, the commercial models offer the latest capabilities and improvements.
The parameter sizes of the commercial models are not disclosed.
Each model is updated and upgraded periodically. To use a fixed version, you can select a snapshot. A snapshot is typically maintained for one month after the release of the next snapshot.
You can use the stable or latest version for more lenient rate limiting conditions.
Qwen-Max
This is the best-performing model in the Qwen series. It is suitable for complex, multi-step tasks. Usage | API reference | Try it online
The Qwen-Max model does not support deep thinking.
Qwen3-Max
Model | Version | Context window | Maximum input | Maximum output | Input price | Output price | Free quota |
(Tokens) | (Million tokens) | ||||||
qwen3-max Currently same capabilties as qwen3-max-2025-09-23 | Stable | 262,144 | 258,048 | 65,536 | Tiered pricing, see the description below the table. | 1 million tokens Valid for 90 days after activating Alibaba Cloud Model Studio | |
qwen3-max-2025-09-23 | Snapshot | ||||||
qwen3-max-preview | Preview |
Qwen3-Max uses tiered pricing based on the number of input tokens (left-open, right-closed intervals).
Input tokens | Input price (Million tokens) qwen3-max and qwen3-max-preview support context cache. | Output price (Million tokens) |
0–32K | $1.2 | $6 |
32K–128K | $2.4 | $12 |
128K–252K | $3 | $15 |
Qwen-Max
Model | Version | Context window | Maximum input | Maximum output | Input price | Output price | Free quota |
(Tokens) | (Million tokens) | ||||||
qwen-max Provides the same capabilities as qwen-max-2025-01-25. | Stable | 32,768 | 30,720 | 8,192 | $1.6 50% discount for batch calls | $6.4 50% discount for batch calls | 1 million tokens for input and 1 million for output Valid for 90 days after you activate Alibaba Cloud Model Studio. |
qwen-max-latest Corresponds to the latest snapshot. | Latest | $1.6 | $6.4 | ||||
qwen-max-2025-01-25 Also known as qwen-max-0125, Qwen2.5-Max | Snapshot |
Qwen-Plus
This model provides a balance of capabilities. Its inference performance, cost, and speed fall between Qwen-Max and Qwen-Flash, which makes it ideal for moderately complex tasks. Usage | API reference | Try it online | Deep thinking
Model | Version | Context window | Maximum input | Maximum output | Input price | Output price | Free quota |
(Tokens) | (Million tokens) | ||||||
qwen-plus Has the same capabilities as qwen-plus-2025-07-28. Part of the Qwen3 series. | Stable | 1,000,000 | Thinking mode 995,904 Non-thinking mode 997,952 The default values are 262,144. You can adjust this value using the max_input_tokens parameter. | 32,768 Maximum chain-of-thought: 81,920 | Tiered pricing, see the description below the table. | 1 million tokens Valid for 90 days after you activate Alibaba Cloud Model Studio. | |
qwen-plus-latest Has the same capabilities as qwen-plus-2025-07-28. Part of the Qwen3 series. | Latest | Thinking mode 995,904 Non-thinking mode 997,952 | |||||
qwen-plus-2025-09-11 Part of the Qwen3 series. | Snapshot | Thinking mode 995,904 Non-thinking mode 997,952 | |||||
qwen-plus-2025-07-28 Also known as qwen-plus-0728. Part of the Qwen3 series. | |||||||
qwen-plus-2025-07-14 Also known as qwen-plus-0714. Part of the Qwen3 series. | 131,072 | Thinking mode 98,304 Non-thinking mode 129,024 | 16,384 Maximum chain-of-thought: 38,912 | $0.4 | Thinking mode $4 Non-thinking mode $1.2 | ||
qwen-plus-2025-04-28 Also known as qwen-plus-0428. Part of the Qwen3 series. | |||||||
qwen-plus-2025-01-25 Also known as qwen-plus-0125. | 129,024 | 8,192 | $1.2 |
The qwen-plus, qwen-plus-latest, qwen-plus-2025-09-11, and qwen-plus-2025-07-28 models use tiered pricing based on the number of input tokens per request (left-open, right-closed intervals).
Input tokens | Input price (Million tokens) | Mode | Output price (Million tokens) |
0–256K | $0.4 | Non-thinking mode | $1.2 |
Thinking mode | $4 | ||
256K–1M | $1.2 | Non-thinking mode | $3.6 |
Thinking mode | $12 |
The qwen-plus-2025-09-11, qwen-plus-2025-07-28, qwen-plus-2025-07-14, qwen-plus-2025-04-28, qwen-plus-latest, and qwen-plus models support both thinking and non-thinking modes. You can switch between these modes using the enable_thinking
parameter. In addition, the capabilities of these models have been significantly improved:
Reasoning capability: In evaluations for math, code, and logical reasoning, it significantly outperforms QwQ and non-reasoning models of a similar size, which achieves top-tier performance in the industry for its scale.
Human preference alignment: It features greatly enhanced capabilities in creative writing, role assumption, multi-turn conversation, and instruction following. Its general capabilities significantly exceed those of models of a similar size.
Agent capability: It achieves industry-leading performance in both thinking and non-thinking modes and can accurately invoke external tools.
Multilingual capability: It supports over 100 languages and dialects, with significantly improved capabilities in multilingual translation, instruction understanding, and common-sense reasoning.
Response format: This version fixes response format issues from previous versions, such as abnormal Markdown, intermediate truncation, and incorrect boxed output.
Qwen-Flash
Qwen-Flash is the fastest and most cost-effective model in the Qwen series and is suitable for simple jobs. It uses flexible tiered pricing. Usage | API reference | Try it online | Thinking mode
Model | Version | Context window | Maximum input | Maximum output | Input price | Output price | Free quota |
(Tokens) | (Million tokens) | ||||||
qwen-flash This model has the same capabilities as qwen-flash-2025-07-28. Part of the Qwen3 series. A 50% discount applies to batch calls. | Stable | 1,000,000 | Thinking mode 995,904 Non-thinking mode 997,952 | 32,768 Maximum chain-of-thought: 81,920. | Tiered pricing, see the description below this table. | 1 million input and 1 million output tokens Valid for 90 days after you activate Alibaba Cloud Model Studio. | |
qwen-flash-2025-07-28 Part of the Qwen3 series. | Snapshot |
The qwen-flash and qwen-flash-2025-07-28 models use tiered pricing based on the number of input tokens in each request (left-open, right-closed intervals). The qwen-flash model supports caching and batch calling.
Input token count | Input price (Million tokens) | Output price (Million tokens) |
0–256K | $0.05 | $0.40 |
256K–1M | $0.25 | $2.00 |
Qwen-Turbo
Qwen-Turbo is deprecated. We recommend Qwen-Flash instead. Qwen-Flash offers flexible tiered pricing for more cost-effective billing. Usage | API reference | Try it online | Deep thinking
Model | Version | Context window | Maximum input | Maximum output | Input price | Output price | Free quota |
(Tokens) | (Million tokens) | ||||||
qwen-turbo Provides the same capabilities as qwen-turbo-2025-04-28. Part of the Qwen3 series | Stable | Thinking mode 131,072 Non-thinking mode 1,000,000 | Thinking mode 98,304 Non-thinking mode 1,000,000 | 16,384 Maximum chain-of-thought: 38,912 | $0.05 Half price for batch calling | Thinking mode: $0.5 Non-thinking mode: $0.2 Half price for batch calling | 1 million tokens each Valid for 90 days after you activate Alibaba Cloud Model Studio. |
qwen-turbo-latest Provides the same capabilities as the latest snapshot. Part of the Qwen3 series | Latest | $0.05 | Thinking mode: $0.5 Non-thinking mode: $0.2 | ||||
qwen-turbo-2025-04-28 Also known as qwen-turbo-0428 Part of the Qwen3 series | Snapshot | ||||||
qwen-turbo-2024-11-01 Also known as qwen-turbo-1101 | 1,000,000 | 1,000,000 | 8,192 | $0.2 |
The latest qwen-turbo-2025-04-28 and qwen-turbo-latest models have thinking and non-thinking mode response capabilities. You can switch between the two modes using the enable_thinking
parameter. In addition, the model's capabilities have been significantly improved:
Reasoning capability: In evaluations for math, code, and logical reasoning, it significantly outperforms QwQ and non-reasoning models of a similar size, which reaches the top tier in the industry for its scale.
Human preference alignment: Capabilities in creative writing, role assumption, multi-turn conversation, and instruction following are greatly enhanced. Its general capabilities significantly exceed those of models of a similar size.
Agent capability: This model reaches industry-leading levels in both reasoning and non-reasoning modes. It can achieve precise external tool invocation.
Multilingual capability: This model supports over 100 languages and dialects. Capabilities in multilingual translation, instruction understanding, and common-sense reasoning are significantly improved.
Response format fixes: Fixes response format issues from previous versions, such as abnormal Markdown, intermediate truncation, and incorrect boxed output.
QwQ
QwQ is a reasoning model trained based on the Qwen2.5 model. Its reasoning capability has been significantly improved through reinforcement learning. The model's core metrics for math and code (AIME 24/25, LiveCodeBench) and some general metrics (IFEval, LiveBench) are on par with the full-power version of DeepSeek-R1. Usage
Model | Version | Context window | Maximum input | Maximum chain-of-thought | Maximum response | Input price | Output price | Free quota |
(Tokens) | (Million tokens) | |||||||
qwq-plus | Stable | 131,072 | 98,304 | 32,768 | 8,192 | $0.8 | $2.4 | 1 million tokens Validity: 90 days after you activate Model Studio |
Qwen-Omni
Qwen-Omni accepts multimodal inputs, such as text, images, audio, and video. It generates text or speech responses. The model provides a variety of expressive, human-like voices and supports speech output in multiple languages and dialects. It can be used in audio and video chat scenarios, such as visual recognition, emotion detection, education, and training. Usage | API reference
Qwen3-Omni-Flash
Model | Version | Mode | Context window | Maximum input | Maximum CoT | Maximum output | Free quota |
(Tokens) | |||||||
qwen3-omni-flash Currently same capability as qwen3-omni-flash-2025-09-15 | Stable | Thinking | 65,536 | 16,384 | 32,768 | 16,384 | 1 million tokens each (regardless of modality) Valid for 90 days after activation |
Non-thinking | 49,152 | - | |||||
qwen3-omni-flash-2025-09-15 Also qwen3-omni-flash-0915 | Snapshot | Thinking | 65,536 | 16,384 | 32,768 | 16,384 | |
Non-thinking | 49,152 | - |
After you use up your free quota, inputs and outputs are billed as follows. The billing is the same for both thinking and non-thinking modes. Audio output is not supported in thinking mode.
|
|
Qwen-Omni-Turbo (based on Qwen2.5)
Model | Version | Context window | Maximum input | Maximum output | Free quota |
(Tokens) | |||||
qwen-omni-turbo Currently has the same capabilities as qwen-omni-turbo-2025-03-26. | Stable | 32,768 | 30,720 | 2,048 | 1 million tokens each (regardless of modality) This quota is valid for 90 days after you activate Model Studio. |
qwen-omni-turbo-latest Always has the same capabilities as the latest snapshot version. | Latest | ||||
qwen-omni-turbo-2025-03-26 Also known as qwen-omni-turbo-0326. | Snapshot |
After you use up the free quota for the commercial model, the billing rules for inputs and outputs are as follows:
|
|
Qwen3-Omni-Flash is recommended. It offers significant improvements in capabilities compared to Qwen-Omni-Turbo, which is no longer updated:
It is a hybrid model that supports both thinking and non-thinking modes. You can switch between the two modes using the
enable_thinking
parameter. The thinking mode is disabled by default.Audio output is not supported in thinking mode. In non-thinking mode, the model's audio output has the following features:
The number of supported voices is increased to 17. Qwen-Omni-Turbo supports only 4.
The number of supported languages is increased to 10. Qwen-Omni-Turbo supports only 2.
Qwen-Omni-Realtime
Unlike Qwen-Omni, Qwen-Omni-Realtime supports audio stream inputs. It has a built-in Voice Activity Detection (VAD) feature that automatically detects the start and end of user speech. Usage|Client events|Sever events
Qwen3-Omni-Flash-Realtime
Model | Version | Context window | Maximum input | Maximum output | Free quota |
(Tokens) | |||||
qwen3-omni-flash-realtime Current capabilities are equivalent to qwen3-omni-flash-realtime-2025-09-15 | Stable | 65,536 | 49,152 | 16,384 | 1 million tokens each (regardless of modality) Valid for 90 days after you activate Model Studio. |
qwen3-omni-flash-realtime-2025-09-15 | Snapshot |
After you use up the free quota, the billing rules for inputs and outputs are as follows:
|
|
Qwen-Omni-Turbo-Realtime (based on Qwen2.5)
Model | Version | Context window | Maximum input | Maximum output | Free quota |
(Tokens) | |||||
qwen-omni-turbo-realtime Currently has the same capabilities as qwen-omni-turbo-realtime-2025-05-08. | Stable | 32,768 | 30,720 | 2,048 | 1 million tokens each (regardless of modality) Valid for 90 days after you activate Model Studio. |
qwen-omni-turbo-realtime-latest Always has the same capabilities the latest snapshot version. | Latest | ||||
qwen-omni-turbo-realtime-2025-05-08 | Snapshot |
After you use up the free quota, the billing rules for inputs and outputs are as follows:
|
|
Qwen3-Omni-Flash-Realtime is recommended. It provides significant improvements over Qwen-Omni-Turbo-Realtime, which will no longer be updated. For audio output from the model:
Supports 17 voices, whereas Qwen-Omni-Turbo-Realtime supports only 4.
Supports 10 languages, whereas Qwen-Omni-Turbo-Realtime supports only 2.
QVQ
QVQ is a visual reasoning model that supports visual input and chain-of-thought output. It demonstrates enhanced capabilities in math, programming, visual analysis, creation, and general tasks. Usage
Model | Version | Context window | Maximum input | Maximum CoT | Maximum response | Input price | Output price | Free quota |
(Tokens) | (Million tokens) | |||||||
qvq-max Currently same performance as qvq-max-2025-03-25 | Stable | 131,072 | 106,496 Up to 16,384 per image | 16,384 | 8,192 | $1.2 | $4.8 | 1 million tokens each Valid for 180 days after activation |
qvq-max-latest Always same performance as the latest snapshot | Latest | |||||||
qvq-max-2025-03-25 Also qvq-max-0325 | Snapshot |
Qwen-VL
Qwen-VL is a text generation model with visual (image) understanding capabilities. It comes in two series: QwenVL-Max and QwenVL-Plus. It can perform OCR and also summarize and reason. For example, it can extract properties from product photos or solve problems based on exercise diagrams. Usage | API reference | Try it online
Qwen-VL models are billed based on the total number of input and output tokens.
Image token calculation rule: Visual understanding.
Qwen3-VL-Plus
Model | Version | Mode | Context window | Maximum input | Maximum chain-of-thought | Maximum output | Input price | Output price Chain-of-thought + output | Free quota |
(Tokens) | (Per 1,000 tokens) | ||||||||
qwen3-vl-plus Currently has the same capabilities as qwen3-vl-plus-2025-09-23 | Stable | Thinking | 262,144 | 258,048 Max 16,384 per image | 81,920 | 32,768 | Tiered pricing. For more information, see the notes below the table. | 1 million tokens for input and output each Validity: 90 days after you activate Alibaba Cloud Model Studio | |
Non-thinking | 262,144 | 260,096 Max 16,384 per image | - | ||||||
qwen3-vl-plus-2025-09-23 | Snapshot | Thinking | 262,144 | 258,048 Max 16,384 per image | 81,920 | 32,768 | |||
Non-thinking | 262,144 | 260,096 Max 16,384 per image | - |
The qwen3-vl-plus and qwen3-vl-plus-2025-09-23 models use a tiered billing method based on the number of input tokens in each request. The input and output prices are the same for both the thinking and non-thinking modes.
Number of input tokens | Input price (Million tokens) | Output price (Million tokens) |
0 to 32K | $0.2 | $1.6 |
32K to 128K | $0.3 | $2.4 |
128K to 256K | $0.6 | $4.8 |
QwenVL-Max
This is the most powerful model in the Qwen-VL series. The following models belong to the Qwen2.5-VL series.
Model | Version | Context window | Maximum input | Maximum output | Input price | Output price | Free quota |
(Tokens) | (Million tokens) | ||||||
qwen-vl-max Offers further improvements in visual reasoning and instruction following capabilities compared to qwen-vl-plus, delivering optimal performance on more complex tasks. Currently has the same capabilities as qwen-vl-max-2025-08-13 | Stable | 131,072 | 129,024 Max 16,384 per image | 8,192 | $0.8 50% off for batch calls | $3.2 50% off for batch calls | 1 million tokens for input and output each Validity: 90 days after you activate Alibaba Cloud Model Studio |
qwen-vl-max-latest Always has the same capabilities as the latest snapshot | Latest | $0.8 | $3.2 | ||||
qwen-vl-max-2025-08-13 Also known as qwen-vl-max-0813 Features comprehensive improvements in visual understanding metrics, with significantly enhanced capabilities in mathematics, reasoning, object detection, and multilingual processing. | Snapshot | ||||||
qwen-vl-max-2025-04-08 Also known as qwen-vl-max-0408 Belongs to the Qwen2.5-VL series. The context is extended to 128k, and the mathematics and reasoning capabilities are significantly enhanced. |
QwenVL-Plus
The QwenVL-Plus model offers a balance between performance and cost. The following models belong to the Qwen2.5-VL series.
Model | Version | Context window | Maximum input | Maximum output | Input price | Output price | Free quota |
(Tokens) | (Million tokens) | ||||||
qwen-vl-plus Currently has the same capabilities as qwen-vl-plus-2025-08-15 | Stable | 131,072 | 129,024 Max 16,384 per image | 8,192 | $0.21 50% off for batch calls | $0.63 50% off for batch calls | 1 million tokens for input and output each Validity: 90 days after you activate Alibaba Cloud Model Studio |
qwen-vl-plus-latest Always has the same capabilities as the latest snapshot | Latest | $0.21 | $0.63 | ||||
qwen-vl-plus-2025-08-15 Also known as qwen-vl-plus-0815 Significantly improved capabilities in object detection and localization, and multilingual processing | Snapshot | ||||||
qwen-vl-plus-2025-05-07 Also known as qwen-vl-plus-0507 Significantly improves the ability to understand mathematics, reasoning, and content from monitoring videos | |||||||
qwen-vl-plus-2025-01-25 Also known as qwen-vl-plus-0125 Belongs to the Qwen2.5-VL series. The context is extended to 128k, and the image and video understanding capabilities are significantly enhanced. |
Qwen-OCR
The Qwen-OCR model is specialized for text extraction. Compared to the Qwen-VL model, it focuses more on extracting text from images such as documents, forms, exam questions, and handwritten text. It can recognize multiple languages, including English, French, Japanese, Korean, German, Russian, and Italian. Usage | API reference | Try it online
Model | Version | Context window | Maximum input | Maximum output | Input and output unit price | Free quota |
(Tokens) | (Million tokens) | |||||
qwen-vl-ocr | Stable | 34,096 | 30,000 A maximum of 30,000 tokens per image. | 4,096 | $0.72 | 1 million input tokens and 1 million output tokens Valid for 90 days after you activate Alibaba Cloud Model Studio. |
Qwen-ASR
Based on Qwen's multimodal model, Qwen-ASR supports multilingual recognition, singing recognition, and noise rejection. Usage
Model | Version | Supported languages | Supported sample rates | Unit price | Free quota (Note) |
qwen3-asr-flash Currently equivalent to qwen3-asr-flash-2025-09-08 | Stable | Chinese (Mandarin, Sichuanese, Minnan, Wu, Cantonese), English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish | 16 kHz | $0.000035/second | 36,000 seconds (10 hours) Validity: 90 days after you activate Model Studio |
qwen3-asr-flash-2025-09-08 | Snapshot |
Qwen-Coder
This is the Qwen code model. The latest Qwen3-Coder series models are code generation models based on Qwen3. They have powerful coding Agent capabilities, excel at tool calling and environment interaction, and can perform autonomous programming. They combine excellent coding skills with general-purpose capabilities. Usage | API reference
Model | Version | Context window | Maximum input | Maximum output | Input price | Output price | Free quota |
(Tokens) | (Million tokens) | ||||||
qwen3-coder-plus Currently has the same capabilities as qwen3-coder-plus-2025-07-22 | Stable | 1,000,000 | 997,952 | 65,536 | Tiered pricing. See the description below the table. | 1 million tokens each Valid for 90 days after you activate Model Studio | |
qwen3-coder-plus-2025-09-23 | Snapshot | ||||||
qwen3-coder-plus-2025-07-22 | Snapshot | ||||||
qwen3-coder-flash Currently has the same capabilities as qwen3-coder-flash-2025-07-28 | Stable | ||||||
qwen3-coder-flash-2025-07-28 | Snapshot |
The preceding models use a tiered billing method based on the number of input tokens in each request (left-open, right-closed intervals).
qwen3-coder-plus
The prices for qwen3-coder-plus, qwen3-coder-plus-2025-09-23, and qwen3-coder-plus-2025-07-22 are as follows. qwen3-coder-plus supports context cache. Input text that hits the implicit cache is billed at 20% of the unit price. Input text that hits the explicit cache is billed at 10% of the unit price.
Input tokens | Input cost (Million tokens) | Output cost (Million tokens) |
0–32K | $1 | $5 |
32K–128K | $1.8 | $9 |
128K–256K | $3 | $15 |
256K–1M | $6 | $60 |
qwen3-coder-flash series
The prices for qwen3-coder-flash and qwen3-coder-flash-2025-07-28 are as follows. qwen3-coder-flash supports context cache. Input text that hits the implicit cache is billed at 20% of the unit price.
Input tokens | Input cost (Million tokens) | Output cost (Million tokens) |
0–32K | $0.3 | $1.5 |
32K–128K | $0.5 | $2.5 |
128K–256K | $0.8 | $4 |
256K–1M | $1.6 | $9.6 |
Qwen-MT
This is a flagship large translation model fully upgraded based on Qwen 3. It supports mutual translation across 92 languages, including Chinese, English, Japanese, Korean, French, Spanish, German, Thai, Indonesian, Vietnamese, and Arabic. The model's performance and translation quality are comprehensively upgraded. It provides more stable term customization, format retention, and domain-specific prompt capabilities, which makes translations more accurate and natural. Usage
Model | Context window | Maximum input | Maximum output | Input price | Output price | Free quota |
(Tokens) | (Million tokens) | |||||
qwen-mt-plus Qwen3-MT | 16,384 | 8,192 | 8,192 | $2.46 | $7.37 | 1 million tokens per model Valid for 90 days after activating Alibaba Cloud Model Studio |
qwen-mt-turbo Qwen3-MT | $0.16 | $0.49 |
Text generation - Qwen open-source versions
In the model names, `xxb` indicates the parameter size. For example, `qwen2-72b-instruct` indicates a parameter size of 72 billion (72B).
Alibaba Cloud Model Studio supports calling the open-source versions of Qwen. You do not need to deploy the models locally. For open-source versions, we recommend using the Qwen3 and Qwen2.5 models.
Qwen3
The qwen3-next-80b-a3b-thinking model, released in September 2025, supports only thinking mode. It features improved instruction-following capabilities compared to the qwen3-235b-a22b-thinking-2507 model, resulting in more concise summary responses.
The qwen3-next-80b-a3b-instruct model, released in September 2025, supports only non-thinking mode. It offers enhanced Chinese understanding, logical reasoning, and text generation capabilities compared to qwen3-235b-a22b-instruct-2507.
The qwen3-235b-a22b-thinking-2507 and qwen3-30b-a3b-thinking-2507 models, released in July 2025 and supporting only the thinking mode, are upgrades to the thinking mode of the qwen3-235b-a22b and qwen3-30b-a3b models.
The qwen3-235b-a22b-instruct-2507 and qwen3-30b-a3b-instruct-2507 models, released in July 2025 and supporting only the non-thinking mode, are upgrades to the non-thinking mode of the qwen3-235b-a22b and qwen3-30b-a3b models.
The Qwen3 models released in April 2025 support thinking and non-thinking modes. You can switch between the two modes using the enable_thinking
parameter. In addition, the capabilities of the Qwen3 models have been significantly improved:
Reasoning capability: In evaluations for math, code, and logical reasoning, it significantly outperforms QwQ and non-reasoning models of a similar size, which reaches the top tier in the industry for its scale.
Human preference alignment: Capabilities in creative writing, role assumption, multi-turn conversation, and instruction following are greatly enhanced. Its general capabilities significantly exceed those of models of a similar size.
Agent capability: This model reaches industry-leading levels in both reasoning and non-reasoning modes. It can achieve precise external tool invocation.
Multilingual capability: This model supports over 100 languages and dialects. Capabilities in multilingual translation, instruction understanding, and common-sense reasoning are significantly improved.
Response format fixes: Fixes response format issues from previous versions, such as abnormal Markdown, intermediate truncation, and incorrect boxed output.
The Qwen3 open-source models released in April 2025 do not support non-streaming output in thinking mode.
If a Qwen3 open-source model is in thinking mode but does not output a thinking process, it is billed at the non-thinking mode price.
Thinking mode | Non-thinking mode | Usage
Model | Mode | Context window | Maximum input | Maximum chain-of-thought | Maximum response | Input price | Output price | Free quota |
(Tokens) | (Million tokens) | |||||||
qwen3-next-80b-a3b-thinking | Thinking only | 131,072 | 126,976 | 81,920 | 32,768 | $0.5 | $6 | 1 million tokens per model Valid for 90 days after you activate Alibaba Cloud Model Studio |
qwen3-next-80b-a3b-instruct | Non-thinking only | 129,024 | - | $0.5 | $2 | |||
qwen3-235b-a22b-thinking-2507 | Thinking only | 126,976 | 81,920 | $0.7 | $8.4 | |||
qwen3-235b-a22b-instruct-2507 | Non-thinking only | 129,024 | - | $0.7 | $2.8 | |||
qwen3-30b-a3b-thinking-2507 | Thinking only | 126,976 | 81,920 | $0.2 | $2.4 | |||
qwen3-30b-a3b-instruct-2507 | Non-thinking only | 129,024 | - | $0.8 | ||||
qwen3-235b-a22b This model and the following models are scheduled for release in April 2025. | Non-thinking | 129,024 | - | 16,384 | $0.7 | $2.8 | ||
Thinking | 98,304 | 38,912 | $8.4 | |||||
qwen3-32b | Non-thinking | 129,024 | - | $2.8 | ||||
Thinking | 98,304 | 38,912 | $8.4 | |||||
qwen3-30b-a3b | Non-thinking | 129,024 | - | $0.2 | $0.8 | |||
Thinking | 98,304 | 38,912 | $2.4 | |||||
qwen3-14b | Non-thinking | 129,024 | - | 8,192 | $0.35 | $1.4 | ||
Thinking | 98,304 | 38,912 | $4.2 | |||||
qwen3-8b | Non-thinking | 129,024 | - | $0.18 | $0.7 | |||
Thinking | 98,304 | 38,912 | $2.1 | |||||
qwen3-4b | Non-thinking | 129,024 | - | $0.11 | $0.42 | |||
Thinking | 98,304 | 38,912 | $1.26 | |||||
qwen3-1.7b | Non-thinking | 32,768 | 30,720 | - | $0.42 | |||
Thinking | 28,672 | The total number of input and output tokens cannot exceed 30,720. | $1.26 | |||||
qwen3-0.6b | Non-thinking | 30,720 | - | $0.42 | ||||
Thinking | 28,672 | The total number of input and output tokens cannot exceed 30,720. | $1.26 |
Qwen2.5
Qwen-Omni
A new multimodal large model for understanding and generation, trained on Qwen2.5. It supports text, image, speech, and video inputs, and can generate text and speech simultaneously in a stream. The speed of multimodal content understanding is significantly improved.Usage | API reference
Model | Context window | Maximum input | Maximum output | Free quota (Note) |
(Tokens) | ||||
qwen2.5-omni-7b | 32,768 | 30,720 | 2,048 | 1 million tokens (regardless of modality) Valid for 90 days after activation |
After the free quota is used up, the following billing rules apply to inputs and outputs:
|
|
Qwen3-Omni-Captioner
Qwen3-Omni-Captioner is an open-source model based on Qwen3-Omni. Without requiring prompts, it automatically generates accurate and comprehensive descriptions for complex audio that includes speech, ambient sounds, music, and sound effects. The model can detect speaker emotions, music elements such as style and instruments, and sensitive information. It is ideal for applications such as audio content analysis, security audits, intent recognition, and audio editing. Usage | API reference
Model name | Context length | Max input | Max output | Input cost | Output cost | Free quota |
(Tokens) | (Per 1,000,000 tokens) | |||||
qwen3-omni-30b-a3b-captioner | 65,536 | 32,768 | 32,768 | $3.81 | $3.06 | 1,000,000 tokens Valid for 90 days after you activate Alibaba Cloud Model Studio. |
Qwen-VL
This is the open-source version of Alibaba Cloud's Qwen-VL. Usage | API reference
The Qwen3-VL model offers significant improvements over Qwen2.5-VL:
Agent interaction: It operates computer or mobile phone interfaces, detects GUI elements, understands features, and invokes tools to perform tasks. It achieves top-tier performance in evaluations such as OS World.
Visual encoding: It generates code from images or videos. This feature can be used to create HTML, CSS, and JS code from design drafts or website screenshots.
Spatial intelligence: It supports 2D and 3D positioning and accurately determines object orientation, perspective changes, and occlusion relationships.
Long video understanding: It understands video content up to 20 minutes long and can pinpoint specific moments with second-level accuracy.
Deep thinking: It excels at capturing details and analyzing causality, achieving top-tier performance in evaluations such as MathVista and MMMU.
OCR: It supports 33 languages and performs more stably in scenarios with complex lighting, blur, or tilt. It also significantly improves the accuracy of recognizing rare characters, ancient script, and technical terms.
Qwen3-VL
Model | Mode | Context window | Maximum input | Maximum chain-of-thought | Maximum response length | Input price | Output price Chain-of-thought + output | Free quota |
(Tokens) | (Million tokens) | |||||||
qwen3-vl-30b-a3b-thinking | Thinking mode only | 131,072 | 126,976 | 81,920 | 32,768 | $0.2 | $2.4 | 1 million tokens each Valid for 90 days after Model Studio activation. |
qwen3-vl-30b-a3b-instruct | Non-thinking mode only | 129,024 | - | $0.8 | ||||
qwen3-vl-235b-a22b-thinking | Thinking only | 126,976 | 81,920 | $0.7 | $8.4 | |||
qwen3-vl-235b-a22b-instruct | Non-thinking mode only | 129,024 | - | $2.8 |
Qwen2.5-VL
Model | Context window | Maximum input | Maximum output | Input price | Output price | Free quota |
(Tokens) | (Million tokens) | |||||
qwen2.5-vl-72b-instruct | 131,072 | 129,024 Max 16,384 per image | 8,192 | $2.8 | $8.4 | 1 million tokens for input and output each Validity: 90 days after you activate Alibaba Cloud Model Studio |
qwen2.5-vl-32b-instruct | $1.4 | $4.2 | ||||
qwen2.5-vl-7b-instruct | $0.35 | $1.05 | ||||
qwen2.5-vl-3b-instruct | $0.21 | $0.63 |
Qwen-Coder
Qwen-Coder is an open source code model from Qwen. The latest, qwen3-coder-480b-a35b-instruct, is a code generation model based on Qwen3 with powerful Coding Agent capabilities. It excels at tool calling, environment interaction, and autonomous programming. The model combines excellent coding skills with general-purpose capabilities.Usage | API reference
Model | Context window | Maximum input | Maximum output | Input price | Output price | Free quota |
(Tokens) | ||||||
qwen3-coder-480b-a35b-instruct | 262,144 | 204,800 | 65,536 | Tiered pricing applies, see the description below the table. | 1 million tokens each Validity: Within 90 days after you activate Model Studio | |
qwen3-coder-30b-a3b-instruct |
The qwen3-coder-480b-a35b-instruct and qwen3-coder-30b-a3b-instruct models use tiered billing based on the number of input tokens per request (left-open, right-closed intervals).
Model | Number of input tokens | Input price (Million tokens) | Output price (Million tokens) |
qwen3-coder-480b-a35b-instruct | 0–32K | $1.5 | $7.5 |
32K–128K | $2.7 | $13.5 | |
128K–200K | $4.5 | $22.5 | |
qwen3-coder-30b-a3b-instruct | 0–32K | $0.45 | $2.25 |
32K–128K | $0.75 | $3.75 | |
128K–200K | $1.2 | $6 |
Image generation
Qwen text-to-image
The Qwen text-to-image model excels at complex text rendering, especially for Chinese and English text. Currently, qwen-image-plus has the same capabilities as qwen-image, but qwen-image-plus has lower price. API reference
Model | Unit price | Free quota |
qwen-image-plus | $0.03 per image | Free quota: 100 images for each model Validity period: Within 90 days after you activate Alibaba Cloud Model Studio. |
qwen-image | $0.035 per image |
Input prompt | Output image |
Healing-style hand-drawn poster featuring three puppies playing with a ball on lush green grass, adorned with decorative elements such as birds and stars. The main title “Come Play Ball!” is prominently displayed at the top in bold, blue cartoon font. Below it, the subtitle “Come [Show Off Your Skills]!” appears in green font. A speech bubble adds playful charm with the text: “Hehe, watch me amaze my little friends next!” At the bottom, supplementary text reads: “We get to play ball with our friends again!” The color palette centers on fresh greens and blues, accented with bright pink and yellow tones to highlight a cheerful, childlike atmosphere. |
Qwen image editing
The Qwen image editing model supports precise text editing in Chinese and English. It also supports operations such as color adjustment, detail enhancement, style transfer, adding or removing objects, and changing positions and actions. These features enable complex editing of images and text. API reference
Model | Unit price | Free quota |
qwen-image-edit | $0.045/image | Free quota: 100 images Valid for 90 days after activating Alibaba Cloud Model Studio. |
Original image | Change the person to a standing position, bending over to hold the dog's front paws. | Original image | Replace the words 'HEALTH INSURANCE' on the letter blocks with '明天会更好' (Tomorrow will be better). |
Original image | Replace the dotted shirt with a light blue shirt. | Original image | Change the background in the image to Antarctica. |
Original image | Generate a cartoon profile picture of the person. | Original image | Remove the hair from the dinner plate. |
Wan text-to-image
The Wan text-to-image model generates exquisite images from text. API reference | Try it online
Model | Description | Unit price | Free quota (Note) The free quota is valid for 90 days after you activate Alibaba Cloud Model Studio. |
wan2.5-t2i-preview | The Wan 2.5 preview removes the single-side limitation, allowing you to freely select image dimensions within the constraints of total pixel area and aspect ratio. | $0.03/image | 50 images |
wan2.2-t2i-plus | The Wan 2.2 Professional Edition features comprehensive upgrades that enhance creativity, stability, and photorealistic quality. | $0.05/image | 100 images |
wan2.2-t2i-flash | The Wan 2.2 Express Edition features comprehensive upgrades that enhance creativity, stability, and photorealistic quality. | $0.025/image | 100 images |
wan2.1-t2i-plus | The Wan 2.1 Professional Edition generates images with richer details. | $0.05/image | 200 images |
wan2.1-t2i-turbo | The Wan 2.1 Turbo Edition offers balanced performance and high cost-effectiveness. | $0.025/image | 200 images |
Input prompt | Output image |
A needle-felted Santa Claus holding a gift and a white cat standing next to him, with a background of colorful gifts and green plants creating a cute, warm, and cozy scene. |
Wan2.5 general image editing
The Wan2.5 general image editing model supports entity-consistent image editing and multi-image fusion. It accepts text, a single image, or multiple images as input. API reference
Model | Unit price | Free quota(Note) Valid for 90 days after you activate Alibaba Cloud Model Studio. |
wan2.5-i2i-preview | $0.03/image | 50 images |
Feature | Input example | Output image |
Single-image editing | Replace the floral dress with a vintage-style lace gown that has delicate embroidery on the collar and cuffs. | |
Multi-image fusion | Place the alarm clock from Image 1 next to the vase on the dining table in Image 2. |
Video generation - Wan
Text-to-video
The Wan text-to-video model generates videos from a single sentence. The videos feature rich artistic styles and cinematic quality. API reference | Try it online
Model | Description | Unit price | Free quota(Claim) Valid for 90 days after activating Alibaba Cloud Model Studio |
wan2.5-t2v-preview | Wan 2.5 preview supports automatic dubbing and custom audio files. | 480P: $0.05/second 720P: $0.10/second 1080P: $0.15/second | 50 seconds |
wan2.2-t2v-plus | Wan 2.2 professional edition. This model provides significant improvements in image detail and motion stability. | 480P: $0.02/second 1080P: $0.10/second | 50 seconds |
wan2.1-t2v-turbo | Wan 2.1 speed edition. This model provides fast generation and balanced performance. | $0.036/second | 200 seconds |
wan2.1-t2v-plus | Wan 2.1 professional edition. This model generates videos with rich details and enhanced texture. | $0.10/second | 200 seconds |
Sample prompt | Generated video |
Prompt: A kitten running in the moonlight |
Image-to-video - based on the first frame
The Wan image-to-video model uses an input image as the first frame of a video. It then generates the rest of the video based on a prompt. The videos feature rich artistic styles and cinematic quality. API reference | Try it online
Model | Description | Unit price | Free quota (Note) Valid for 90 days after activating Model Studio |
wan2.5-i2v-preview | Wan 2.5 preview supports automatic dubbing and custom audio files. | 480P: $0.05/second 720P: $0.10/second 1080P: $0.15/second | 50 seconds |
wan2.2-i2v-flash | Wan 2.2 Turbo Edition. This model offers extremely fast generation speeds with significant improvements in image detail and motion stability. | 480P: $0.015/second 720P: $0.036/second | 50 seconds |
wan2.2-i2v-plus | Wan 2.2 Professional Edition. This model provides significant improvements in image detail and motion stability. | 480P: $0.02/second 1080P: $0.10/second | 50 seconds |
wan2.1-i2v-turbo | Wan 2.1 Turbo Edition. This model offers fast generation speeds and balanced performance. | $0.036/second | 200 seconds |
wan2.1-i2v-plus | Wan 2.1 Professional Edition. This model generates videos with rich details and enhanced textures. | $0.10/second | 200 seconds |
Input example | Output video |
Input prompt: A cat running on the grass Input image: | The model generates a video based on the prompt, using the input image as the first frame. Model: wanx2.1-i2v-turbo. |
Image-to-video - based on the first and last frames
The Wan first-and-last-frame video model generates a smooth, dynamic video from a prompt. You only need to provide the first and last frame images. The videos feature rich artistic styles and cinematic quality. API reference | Try it online
Model | Price | Free quota (Note) |
wan2.1-kf2v-plus | $0.10 per second | 200 seconds Valid for 90 days after you activate Model Studio |
Example input | Output video | ||
First frame | Last frame | Prompt | |
In a realistic style, the camera starts at eye level on a small black cat looking up at the sky with curiosity, then gradually moves upward to end in a top-down shot focused on the cat's curious eyes. |
General video editing
The Wan unified video editing model supports multimodal inputs, including text, images, and videos. It can perform video generation and general editing tasks. API reference | Try it online
Model | Price | Free quota |
wan2.1-vace-plus | $0.10 per second | 50 seconds Valid for 90 days after you activate Model Studio. |
The unified video editing model supports the following features:
Feature | Input reference image | Input prompt | Output video |
Multi-image reference | Reference image 1 (reference entity) Reference image 2 (reference background) | In the video, a girl gracefully walks out from a misty, ancient forest. Her steps are light, and the camera captures her every nimble moment. When the girl stops and looks around at the lush woods, a smile of surprise and joy blossoms on her face. This scene, frozen in a moment of interplay between light and shadow, records the girl's wonderful encounter with nature. | Output video |
Video repainting | The video shows a black steampunk-style car driven by a gentleman. The car is decorated with gears and copper pipes. The background features a steam-powered candy factory and retro elements, creating a vintage and playful scene. | ||
Local editing | Input video Input mask image (The white area indicates the editing area) | The video shows a Parisian-style French cafe where a lion in a suit is elegantly sipping coffee. It holds a coffee cup in one hand, taking a gentle sip with a relaxed expression. The cafe is tastefully decorated, with soft hues and warm lighting illuminating the area where the lion is. | The content in the editing area is modified based on the prompt. |
Video extension | Input first clip (1 second) | A dog wearing sunglasses is skateboarding on the street, 3D cartoon. | Output extended video (5 seconds) |
Video outpainting | An elegant lady is passionately playing the violin, with a full symphony orchestra behind her. |
Speech synthesis (text-to-speech)
Qwen-TTS
Model | Version | Unit price | Maximum input characters | Supported languages | Free quota(Note) |
qwen3-tts-flash Currently same capabilities as qwen3-tts-flash-2025-09-18 | Stable | $0.1/10,000 characters | 600 | Chinese (Mandarin, Beijing, Shanghai, Sichuan, Nanjing, Shaanxi, Minnan, Tianjin, Cantonese), English, Spanish, Russian, Italian, French, Korean, Japanese, German, Portuguese | 2,000 characters for each Validity: 90 days after activating Model Studio |
qwen3-tts-flash-2025-09-18 | Snapshot |
Qwen-TTS-Realtime
Model | Version | Price | Supported languages | Free quota(Note) |
qwen3-tts-flash-realtime Currently same performance as qwen3-tts-flash-realtime-2025-09-18 | Stable | $0.13 per 10,000 characters | Chinese (Mandarin, Beijing, Shanghai, Sichuan, Nanjing, Shaanxi, Minnan, Tianjin, Cantonese), English, Spanish, Russian, Italian, French, Korean, Japanese, German, Portuguese | 2,000 characters for each Validity period: 90 days after you activate Model Studio |
qwen3-tts-flash-realtime-2025-09-18 | Snapshot |
Speech recognition and translation (speech-to-text)
Qwen3-LiveTranslate-Flash-Realtime
qwen3-livetranslate-flash-realtime is a multilingual, real-time audio and video translation model. It recognizes 18 languages and translates them into audio in 10 languages in real time.
Core features:
Multilingual support: Supports 18 languages, such as Chinese, English, French, German, Russian, Japanese, and Korean, and 6 Chinese dialects, such as Mandarin, Cantonese, and Sichuanese.
Visual enhancement: Improves translation accuracy using visual content. The model analyzes lip movements, actions, and on-screen text to improve translation in noisy environments or for words with multiple meanings.
Low latency: Achieves a simultaneous interpretation latency as low as 3 seconds.
Lossless simultaneous interpretation: Uses semantic unit prediction technology to resolve cross-language word order issues. This ensures that the quality of real-time translation is nearly identical to that of offline translation.
Natural voice: Generates human-like speech with a natural voice. The model adapts its tone and emotion based on the source audio content.
Model | Version | Context window | Maximum input | Maximum output | Free quota |
(Tokens) | |||||
qwen3-livetranslate-flash-realtime Current capabilities are equivalent to qwen3-livetranslate-flash-realtime-2025-09-22 | Stable | 53248 | 49,152 | 4,096 | 1 million tokens for each Validity: 90 days after activating Alibaba Cloud Model Studio |
qwen3-livetranslate-flash-realtime-2025-09-22 | Snapshot |
After the free quota is exhausted, inputs and outputs are billed as follows:
|
|
Fun-ASR
Fun-ASR is an end-to-end, large-scale automatic speech recognition (ASR) model from Qwen Lab. It is built on advanced, self-developed speech technology and provides excellent contextual awareness and high-accuracy transcription. API reference.
Audio file recognition
Model | Version | Supported languages | Supported sample rates | Scenarios | Supported audio formats | Price | Free quota |
fun-asr Currently equivalent to fun-asr-2025-08-25 | Stable | Chinese, English | Any | ApsaraVideo Live, voice calls, real-time conference interpretation, and more | aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv | $0.000035/second | 36,000 seconds (10 hours) Valid for 90 days |
fun-asr-2025-08-25 | Snapshot |
Text embedding
Text embedding models convert text into numerical representations for tasks such as search, clustering, recommendation, and classification. Billing for these models is based on the number of input tokens. API reference
Model | Embedding dimensions | Batch size | Maximum tokens per row | Supported languages | Price (Million input tokens) | Free quota |
text-embedding-v3 | 1,024 (default), 768, or 512 | 10 | 8,192 | Over 50 languages, such as Chinese, English, Spanish, French, Portuguese, Indonesian, Japanese, Korean, German, and Russian | $0.07 | 500,000 tokens Valid for 90 days after Model Studio activation. |
Role-playing
Qwen's role-playing model is ideal for scenarios that require human-like conversation, such as virtual social interactions, game NPCs, IP character replication, hardware, toys, and in-vehicle systems. This model offers enhanced capabilities in character fidelity, conversation progression, and empathetic listening compared to other Qwen models. Usage
Model | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | ||||
qwen-plus-character-ja | 8,192 | 7,680 | 512 | $0.5 | $1.4 |
Retired models (Singapore region)
Retired on August 20, 2025
Qwen2
Qwen1.5
Flagship models (Beijing region)
Most powerful general-purpose models |
Suitable for complex tasks, most powerful |
Balanced performance, speed, and cost |
Suitable for simple jobs, fast and low cost |
Excellent at coding and proficient in tool calling and environment interaction |
Maximum context window (Tokens) | 262,144 | 1,000,000 | 1,000,000 | 1,000,000 |
Input price (Million tokens) | $0.345 | $0.115 | $0.044 | $0.287 |
Output price (Million tokens) | $1.377 | $0.287 | $0.087 | $0.861 |
For detailed parameters and more models, see the tables that follow.
Model overview
Category | Model | Description |
Text generation | ||
The visual understanding model Qwen-VL, the visual reasoning model QVQ, and the omni-modal model Qwen-Omni | ||
Code model, Math model, Translation model, Data mining model, Intention recognition model, Role assumption model | ||
Image generation |
| |
General-purpose models:
More models: Qwen Image Translation, OutfitAnyone | ||
Speech synthesis and recognition | Qwen-TTS and CosyVoice convert text to speech for scenarios such as voice-based customer service, audiobooks, in-car navigation, and educational tutoring. | |
Paraformer converts speech to text for scenarios such as real-time meeting transcription, real-time live stream captioning, and customer service calls. | ||
Video editing and generation |
| |
| ||
| ||
Embedding | Converts text into a numerical vector representation. These embeddings are used for search, clustering, recommendation, and classification. | |
Converts text, images, and speech into numerical vectors. These embeddings are used for audio and video classification, image classification, and image-text retrieval. | ||
Industry | The Intention Recognition Model parses user intent in milliseconds and selects the appropriate tools to resolve user issues. |
Text generation - Qwen
The following are the Qwen commercial models. Compared to the open-source editions, the commercial models have the latest capabilities and improvements.
Models are updated and upgraded periodically. To use a fixed version, you can select a snapshot. Snapshots are typically maintained for one month after the release of the next snapshot.
We recommend that you use the stable or latest version because their rate limits are looser.
Qwen-Max
This is the best-performing model in the Qwen series. The model is suitable for complex and multi-step tasks. Usage | API reference | Try it online
Qwen3-Max
Model | Version | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | |||||
qwen3-max Currently same capabilties as qwen3-max-2025-09-23 | Stable | 262,144 | 258,048 | 65,536 | Tiered pricing. See the notes below this table. | |
qwen3-max-2025-09-23 | Snapshot | |||||
qwen3-max-preview | Preview |
Qwen3-Max uses tiered pricing based on the number of input tokens (left-open, right-closed intervals).
Input Tokens | Input Price (Million tokens) qwen3-max and qwen3-max-preview support context cache. | Output Price (Million tokens) |
0-32K | $0.861 | $3.441 |
32K-128K | $1.434 | $5.735 |
128K-252K | $2.151 | $8.602 |
qwen3-max and qwen3-max-2025-09-23 support search agent, see Web search.
Qwen-Max
Model | Version | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | |||||
qwen-max Offers the same capabilities as qwen-max-2024-09-19. | Stable | 32,768 | 30,720 | 8,192 | $0.345 | $1.377 |
qwen-max-latest Always points to the latest snapshot. | Latest | 131,072 | 129,024 | |||
qwen-max-2025-01-25 Also known as qwen-max-0125, Qwen2.5-Max | Snapshot | |||||
qwen-max-2024-09-19 Also known as qwen-max-0919. | 32,768 | 30,720 | $2.868 | $8.602 |
Qwen-Plus
This is a balanced model. Its inference performance, cost, and speed are between those of Qwen-Max and Qwen-Turbo. It is ideal for moderately complex tasks.
Usage | API reference | Try it online | Thinking mode
Model | Version | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | |||||
qwen-plus Same capabilities as qwen-plus-2025-07-28. Part of the Qwen3 series. | Stable | 1,000,000 | Thinking mode 995,904 Non-thinking mode 997,952 The default values are 131,072. You can adjust this value using the max_input_tokens parameter. | 32,768 Maximum CoT is 81,920. | Tiered pricing, see the description below the table. | |
qwen-plus-latest Same capabilities as qwen-plus-2025-07-28. Part of the Qwen3 series. | Latest | Thinking mode 995,904 Non-thinking mode 997,952 | ||||
qwen-plus-2025-09-11 Part of the Qwen3 series. | Snapshot | Thinking mode 995,904 Non-thinking mode 997,952 | ||||
qwen-plus-2025-07-28 Also known as qwen-plus-0728. Part of the Qwen3 series. | ||||||
qwen-plus-2025-07-14 Also known as qwen-plus-0714. Part of the Qwen3 series. | 131,072 | Thinking mode 98,304 Non-thinking mode 129,024 | 16,384 Maximum CoT is 38,912. | $0.115 | Thinking mode $1.147 Non-thinking mode $0.287 | |
qwen-plus-2025-04-28 Also known as qwen-plus-0428. Part of the Qwen3 series |
The qwen-plus, qwen-plus-latest, qwen-plus-2025-09-11, and qwen-plus-2025-07-28 models use tiered pricing based on the number of input tokens per request (left-open, right-closed intervals).
Input tokens | Input price (Million tokens) | Mode | Output price (Million tokens) |
0-128K | $0.115 | Non-thinking mode | $0.287 |
Thinking mode | $1.147 | ||
128K-256K | $0.345 | Non-thinking mode | $2.868 |
Thinking mode | $3.441 | ||
256K-1M | $0.689 | Non-thinking mode | $6.881 |
Thinking mode | $9.175 |
These models support thinking and non-thinking modes. You can switch between the two modes using the enable_thinking
parameter. In addition, the model's capabilities have been significantly enhanced:
Inference capability: In evaluations of math, code, and logical reasoning, it significantly surpasses QwQ and non-reasoning models of the same size, which reaches the top tier in the industry for its scale.
Human preference capability: Creative writing, role-play, multi-turn conversation, and instruction-following capabilities have all been greatly improved. Its general capabilities significantly exceed those of models of the same size.
Agent capability: It achieves industry-leading levels in both thinking and non-thinking modes and can accurately call external tools.
Multilingual capability: It supports over 100 languages and dialects, with significant improvements in multilingual translation, instruction understanding, and common-sense reasoning.
Response format: This version fixes response format issues from previous versions, such as abnormal Markdown, intermediate truncation, and incorrect boxed output.
For these models, if you enable thinking mode but no thought process is output, you are charged at the non-thinking mode rate.
Qwen-Flash
This is the fastest and most cost-effective model in the Qwen series. It is ideal for simple tasks. Qwen-Flash uses flexible tiered pricing for more reasonable billing. Usage| API reference | Deep thinking
Model | Version | Context window | Maximum input | Maximum chain-of-thought | Maximum response | Input price | Output price |
(Tokens) | (Million tokens) | ||||||
qwen-flash Provides the same capabilities as qwen-flash-2025-07-28. A model in the Qwen3 series. | Stable | 1,000,000 | 1,044,480 | 32,768 | 81,920 | Tiered pricing applies. For more information, see the description below this table. | |
qwen-flash-2025-07-28 Also known as qwen-flash-0728. | Snapshot |
The qwen-flash and qwen-flash-2025-07-28 models use tiered pricing based on the number of input tokens per request (left-open, right-closed intervals). The qwen-flash model supports context cache.
Input tokens | Input price (Million tokens) | Output price (Million tokens) |
0–128K | $0.022 | $0.216 |
128K–256K | $0.087 | $0.861 |
256K–1M | $0.173 | $1.721 |
Qwen-Turbo
Qwen-Turbo is no longer updated. We recommend replacing it with Qwen-Flash. Qwen-Flash uses flexible tiered pricing for more reasonable billing. Usage | API reference | Try it online | Thinking mode
Model | Version | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | |||||
qwen-turbo Functionally equivalent to qwen-turbo-2025-04-28. Part of the Qwen3 series. | Stable | Thinking mode 131,072 Non-thinking mode 1,000,000 | Thinking mode 98,304 Non-thinking mode 1,000,000 | 16,384 The maximum length for Chain-of-Thought (CoT) is 38,912 tokens. | $0.044 | Thinking mode $0.431 Non-thinking mode $0.087 |
qwen-turbo-latest Functionally equivalent to the latest snapshot version. Part of the Qwen3 series. | Latest | |||||
qwen-turbo-2025-07-15 Also known as qwen-turbo-0715. Part of the Qwen3 series. | Snapshot | |||||
qwen-turbo-2025-04-28 Also known as qwen-turbo-0428. Part of the Qwen3 series. |
These models support thinking and non-thinking modes. You can switch between the two modes using the enable_thinking
parameter. In addition, the model's capabilities have been significantly enhanced:
Inference capability: In evaluations of math, code, and logical reasoning, it significantly surpasses QwQ and non-reasoning models of the same size, which reaches the top tier in the industry for its scale.
Human preference capability: Creative writing, role-play, multi-turn conversation, and instruction-following capabilities have all been greatly improved. Its general capabilities significantly exceed those of models of the same size.
Agent capability: It achieves industry-leading levels in both thinking and non-thinking modes and can accurately call external tools.
Multilingual capability: It supports over 100 languages and dialects, with significant improvements in multilingual translation, instruction understanding, and common-sense reasoning.
Response format: This version fixes response format issues from previous versions, such as abnormal Markdown, intermediate truncation, and incorrect boxed output.
For these models, if you enable thinking mode but no thought process is output, you are charged at the non-thinking mode rate.
QwQ
The QwQ reasoning model, trained on the Qwen2.5 model, significantly improves model inference capabilities through reinforcement learning. Core metrics such as math and code (AIME 24/25, LiveCodeBench) and some general metrics (IFEval, LiveBench) reach the level of the full-power DeepSeek-R1. Usage
Model | Version | Context window | Maximum input | Maximum CoT | Maximum response | Input price | Output price |
(Tokens) | (Million tokens) | ||||||
qwq-plus Provides the same capabilities as qwq-plus-2025-03-05. | Stable | 131,072 | 98,304 | 32,768 | 8,192 | $0.230 | $0.574 |
qwq-plus-latest Always points to the latest snapshot. | Latest | ||||||
qwq-plus-2025-03-05 Also known as qwq-plus-0305. | Snapshot |
Qwen-Long
This is the model in the Qwen series with the longest context window. It offers balanced capabilities and a low cost. It is ideal for tasks such as long-text analysis, information extraction, summarization, and classification and tagging. Usage | Try it online
Model name | Version | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | |||||
qwen-long-latest Same capabilities as the latest snapshot version. | Stable | 10,000,000 | 10,000,000 | 8,192 | $0.072 | $0.287 |
qwen-long-2025-01-25 Also known as qwen-long-0125. | Snapshot |
Qwen Omni
This is a new multimodal understanding and generation large model from Qwen. It supports text, image, speech, and video input, and outputs text and audio. It provides four natural conversational voices. Usage | API reference
Model | Version | Context window | Maximum input | Maximum output |
(Tokens) | ||||
qwen-omni-turbo Offers the same capabilities as the qwen-omni-turbo-2025-03-26 snapshot. | Stable | 32,768 | 30,720 | 2,048 |
qwen-omni-turbo-latest Offers the same capabilities as the latest snapshot. | Latest | |||
qwen-omni-turbo-2025-03-26 Also known as qwen-omni-turbo-0326. | Snapshot | |||
qwen-omni-turbo-2025-01-19 Also known as qwen-omni-turbo-0119. |
The billing rules for input and output are as follows:
|
| ||||||||||||||
Billing example: If a request includes 1,000 text tokens and 1,000 image tokens in the input, and generates 1,000 text tokens and 1,000 audio tokens in the output, the total cost is: $0.000058 (text input) + $0.000216 (image input) + $0.007168 (audio output) |
Qwen Omni-Realtime
Compared to Qwen-Omni, this model supports streaming audio input and has a built-in Voice Activity Detection (VAD) feature to automatically detect the start and end of user speech. Usage
Model | Version | Context window | Maximum input | Maximum Output |
(Tokens) | ||||
qwen-omni-turbo-realtime Offers the same capabilities as the qwen-omni-turbo-2025-05-08 snapshot. | Stable | 32,768 | 30,720 | 2,048 |
qwen-omni-turbo-realtime-latest This model is an alias for the latest snapshot. | Latest | |||
qwen-omni-turbo-realtime-2025-05-08 | Snapshot |
The billing rules for input and output are as follows:
|
|
QVQ
QVQ is a visual reasoning model that supports visual input and chain-of-thought output. It demonstrates stronger capabilities in math, programming, visual analysis, creation, and general tasks. Usage | Try it online
Model | Version | Context window | Maximum input | Maximum CoT | Maximum response | Input price | Output price |
(Tokens) | (Million tokens) | ||||||
qvq-max This model provides stronger visual reasoning and instruction-following capabilities than qvq-plus and delivers optimal performance for complex tasks. This model has the same capabilities as qvq-max-2025-03-25. | Stable | 131,072 | 106,496 Maximum of 16,384 per image. | 16,384 | 8,192 | $1.147 | $4.588 |
qvq-max-latest This model always provides the same capabilities as the latest snapshot. | Latest | ||||||
qvq-max-2025-05-15 Also known as qvq-max-0515. | Snapshot | ||||||
qvq-max-2025-03-25 Also known as qvq-max-0325. | |||||||
qvq-plus This model has the same capabilities as qvq-plus-2025-05-15. | Stable | $0.287 | $0.717 | ||||
qvq-plus-latest This model always provides the same capabilities as the latest snapshot. | Latest | ||||||
qvq-plus-2025-05-15 Also known as qvq-plus-0515. | Snapshot |
Qwen-VL
Qwen-VL is a text generation model with visual (image) understanding capabilities. It comes in two series: Qwen-VL-MAX and Qwen-VL-PLUS. It can perform Optical Character Recognition (OCR) and also summarize and reason. For example, it can extract attributes from product photos or solve problems based on exercise diagrams. Usage | API reference | Try it online
Qwen-VL models are billed based on the total number of input and output tokens.
Image token calculation rule: Visual understanding.
Qwen3-VL-Plus
Model | Version | Mode | Context window | Maximum input | Maximum CoT | Maximum output | Input price | Output price | Free quota |
(Tokens) | (1,000 tokens) | ||||||||
qwen3-vl-plus Currently has the same capabilities as qwen3-vl-plus-2025-09-23 | Stable | Thinking | 262,144 | 258,048 Max 16,384 per image | 81,920 | 32,768 | Tiered pricing. For more information, see the notes below the table. | No free quota | |
Non-thinking | 262,144 | 260,096 Max 16,384 per image | - | ||||||
qwen3-vl-plus-2025-09-23 | Snapshot | Thinking | 262,144 | 258,048 Max 16,384 per image | 81,920 | 32,768 | |||
Non-thinking | 262,144 | 260,096 Max 16,384 per image | - |
The qwen3-vl-plus and qwen3-vl-plus-2025-09-23 models use a tiered billing method based on the number of input tokens in each request. The tier ranges are left-open and right-closed. The input and output prices are the same for both the thinking and non-thinking modes.
Number of input tokens | Input price (Million tokens) | Output price (Million tokens) |
0 to 32K | $0.143353 | $1.433525 |
32K to 128K | $0.215029 | $2.150288 |
128K to 256K | $0.430058 | $4.300576 |
Qwen-VL-Max
This is the most powerful model in the Qwen-VL series. The following models belong to the Qwen2.5-VL series.
Model | Version | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | |||||
qwen-vl-max Offers further improvements in visual reasoning and instruction following capabilities compared to qwen-vl-plus, delivering optimal performance on more complex tasks. Currently has the same capabilities as qwen-vl-max-2025-08-13 | Stable | 131,072 | 129,024 Max 16,384 per image | 8,192 | $0.23 | $0.574 |
qwen-vl-max-latest Always has the same capabilities as the latest snapshot | Latest | |||||
qwen-vl-max-2025-08-13 Also known as qwen-vl-max-0813 Features comprehensive improvements in visual understanding metrics, with significantly enhanced capabilities in mathematics, reasoning, object detection, and multilingual processing. | Snapshot | |||||
qwen-vl-max-2025-04-08 Also known as qwen-vl-max-0408 Enhanced mathematics and reasoning capabilities | $0.431 | $1.291 | ||||
qwen-vl-max-2025-04-02 Also known as qwen-vl-max-0402 Significantly improves accuracy in solving complex mathematical problems | ||||||
qwen-vl-max-2025-01-25 Also known as qwen-vl-max-0125 Upgraded to the Qwen2.5-VL series. The context is extended to 128k, and the image and video understanding capabilities are significantly enhanced |
Qwen-VL-Plus
The Qwen-VL-Plus model offers a balance between performance and cost. The following models belong to the Qwen2.5-VL series.
Model | Version | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | |||||
qwen-vl-plus Currently has the same capabilities as qwen-vl-plus-2025-08-15 | Stable | 131,072 | 129,024 Max 16,384 per image | 8,192 | $0.115 | $0.287 |
qwen-vl-plus-latest Always has the same capabilities as the latest snapshot | Latest | |||||
qwen-vl-plus-2025-08-15 Also known as qwen-vl-plus-0815 Significantly improved capabilities in object detection and localization, and multilingual processing | Snapshot | |||||
qwen-vl-plus-2025-07-10 Also known as qwen-vl-plus-0710 Further improves the ability to understand content from monitoring videos | 32,768 | 30,720 Max 16,384 per image | $0.022 | $0.216 | ||
qwen-vl-plus-2025-05-07 Also known as qwen-vl-plus-0507 Significantly improves the ability to understand mathematics, reasoning, and content from monitoring videos | 131,072 | 129,024 Max 16,384 per image | $0.216 | $0.646 | ||
qwen-vl-plus-2025-01-25 Also known as qwen-vl-plus-0125 Upgraded to the Qwen2.5-VL series. The context is extended to 128k, and the image and video understanding capabilities are significantly enhanced |
Qwen-OCR
Qwen-OCR is a specialized model for text extraction. Compared with the Qwen-VL model, Qwen-OCR is more suitable for extracting text from images of documents, tables, test questions, handwritten notes, and other sources. It recognizes multiple languages, such as English, French, Japanese, Korean, German, Russian, and Italian. Usage | API reference | Try it online
Model | Version | Context window | Maximum input | Maximum output | Unit price for input and output |
(Tokens) | (Million tokens) | ||||
qwen-vl-ocr Provides the same capabilities as qwen-vl-ocr-2025-04-13. | Stable | 34,096 | 30,000 A maximum of 30,000 tokens per image. | 4,096 | $0.717 |
qwen-vl-ocr-latest Always provides the same capabilities as the latest snapshot. | Latest | ||||
qwen-vl-ocr-2025-04-13 Also known as qwen-vl-ocr-0413. Provides significantly improved text recognition, six built-in OCR tasks, and features such as custom prompts and image rotation correction. | Snapshot | ||||
qwen-vl-ocr-2024-10-28 Also known as qwen-vl-ocr-1028. | Snapshot |
Qwen-Math
The Qwen-Math model is a language model specialized for solving mathematical problems. Usage | API reference | Try it online
Model | Version | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | |||||
qwen-math-plus Equivalent to qwen-math-plus-2024-09-19. | Stable | 4,096 | 3,072 | 3,072 | $0.574 | $1.721 |
qwen-math-plus-latest Equivalent to the latest snapshot. | Latest | |||||
qwen-math-plus-2024-09-19 Also known as qwen-math-plus-0919. | Snapshot | |||||
qwen-math-plus-2024-08-16 Also known as qwen-math-plus-0816. | ||||||
qwen-math-turbo Equivalent to qwen-math-turbo-2024-09-19. | Stable | $0.287 | $0.861 | |||
qwen-math-turbo-latest Equivalent to the latest snapshot. | Latest | |||||
qwen-math-turbo-2024-09-19 Also known as qwen-math-turbo-0919. | Snapshot |
Qwen-Coder
This is the Qwen code model. The latest Qwen3-Coder-Plus series model is a code generation model based on Qwen3. It has powerful coding agent capabilities, excels at tool calling and environment interaction, and can perform autonomous programming. It combines excellent coding skills with general-purpose abilities. Usage | API reference | Try it online
Model | Version | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | |||||
qwen3-coder-plus Currently has the same capabilities as qwen3-coder-plus-2025-07-22 | Stable | 1,000,000 | 997,952 | 65,536 | Tiered pricing applies, see the description below the table. | |
qwen3-coder-plus-2025-09-23 | Snapshot | |||||
qwen3-coder-plus-2025-07-22 | Snapshot | |||||
qwen3-coder-flash Currently has the same capabilities as qwen3-coder-flash-2025-07-28 | Stable | |||||
qwen3-coder-flash-2025-07-28 | Snapshot |
These models use tiered pricing based on the number of input tokens per request (left-open, right-closed intervals).
qwen3-coder-plus series
The prices for qwen3-coder-plus, qwen3-coder-plus-2025-09-23, and qwen3-coder-plus-2025-07-22 are as follows. qwen3-coder-plus supports context cache.
Number of input tokens | Input price (Million tokens) | Output price (Million tokens) |
0–32K | $0.574 | $2.294 |
32K–128K | $0.861 | $3.441 |
128K–256K | $1.434 | $5.735 |
256K–1M | $2.868 | $28.671 |
qwen3-coder-flash series
The prices for qwen3-coder-flash and qwen3-coder-flash-2025-07-28 are as follows. qwen3-coder-flash supports context cache.
Number of input tokens | Input price (Million tokens) | Output price (Million tokens) |
0–32K | $0.144 | $0.574 |
32K–128K | $0.216 | $0.861 |
128K–256K | $0.359 | $1.434 |
256K–1M | $0.717 | $3.584 |
Qwen-MT
This flagship large translation model is a comprehensive upgrade of Qwen 3. It supports translation between 92 languages, including Chinese, English, Japanese, Korean, French, Spanish, German, Thai, Indonesian, Vietnamese, and Arabic. With significantly improved performance and translation quality, the model provides more stable term customization, format retention, and domain-specific prompting capabilities for more accurate and natural translations. Usage | Try online
Model | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | ||||
qwen-mt-plus Part of Qwen3-MT | 16,384 | 8,192 | 8,192 | $0.259 | $0.775 |
qwen-mt-turbo Part of Qwen3-MT | $0.101 | $0.280 |
Qwen-ASR
Based on the Qwen multimodal model, Qwen-ASR supports multilingual recognition, singing recognition, customized speech recognition, and noise rejection. Usage
Model | Version | Supported languages | Supported sample rate | Unit price |
qwen3-asr-flash Offers the same capabilities as qwen3-asr-flash-2025-09-08. | Stable | Chinese (Mandarin, Sichuanese, Min Nan, Wu, and Cantonese), English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, and Spanish | 16 kHz | $0.000032/second |
qwen3-asr-flash-2025-09-08 | Snapshot |
Qwen data mining model
The Qwen data mining model extracts structured information from documents for applications such as data annotation and content moderation. Usage | API reference
Model | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | ||||
qwen-doc-turbo | 131,072 | 129,024 | 8,192 | $0.087 | $0.144 |
Qwen deep research model
The Qwen deep research model breaks down complex problems, performs inference and analysis using web searches, and generates research reports. Usage | API reference | Try online
Model | Context window | Maximum input | Maximum output | Input price | Output price | Free quota |
(Tokens) | (Thousand tokens) | |||||
qwen-deep-research | 1,000,000 | 997,952 | 32,768 | $0.007742 | $0.023367 | No free quota |
Text generation - Qwen - Open-source
In model names, `xxb` indicates the number of parameters. For example, `qwen2-72b-instruct` has 72 billion (72B) parameters.
Model Studio supports calls to the open-source editions of Qwen, so you do not need to deploy the models locally. For open-source editions, we recommend using the Qwen3 or Qwen2.5 models.
Qwen3
The qwen3-next-80b-a3b-thinking model, released in September 2025, supports only thinking mode. It features improved instruction-following capabilities compared to the qwen3-235b-a22b-thinking-2507 model, resulting in more concise summary responses.
The qwen3-next-80b-a3b-instruct model, released in September 2025, supports only non-thinking mode. It offers enhanced Chinese understanding, logical reasoning, and text generation capabilities compared to qwen3-235b-a22b-instruct-2507.
The qwen3-235b-a22b-thinking-2507 and qwen3-30b-a3b-thinking-2507 models, released in July 2025 and supporting only the thinking mode, are upgrades to the thinking mode of the qwen3-235b-a22b and qwen3-30b-a3b models.
The qwen3-235b-a22b-instruct-2507 and qwen3-30b-a3b-instruct-2507 models, released in July 2025 and supporting only the non-thinking mode, are upgrades to the non-thinking mode of the qwen3-235b-a22b and qwen3-30b-a3b models.
The Qwen3 model, released in April 2025, supports thinking and non-thinking modes. You can switch between the two modes using the enable_thinking
parameter. In addition, the Qwen3 model features significant enhancements to its capabilities:
Inference capability: In evaluations of math, code, and logical reasoning, it significantly surpasses QwQ and other models of the same size, which reaches the top tier in the industry for its scale.
Human preference capability: Its capabilities for creative writing, role-play, multi-turn conversation, and instruction following have been greatly improved. Its general capabilities significantly exceed those of other models of the same size.
Agent capability: It achieves industry-leading levels in both thinking and non-thinking modes and can accurately call external tools.
Multilingual capability: It supports over 100 languages and dialects, with significant improvements in multilingual translation, instruction understanding, and common-sense reasoning.
Response format: Addresses response format issues from previous versions, such as malformed Markdown, truncated responses, and incorrect `boxed` output.
The Qwen3 open-source model, scheduled for release in April 2025, supports only streaming output in thinking mode.
Thinking Mode | Non-thinking Mode | API Reference
Model | Mode | Context window | Maximum input | Maximum chain-of-thought | Maximum response | Input price | Output price |
(Tokens) | (Million tokens) | ||||||
qwen3-next-80b-a3b-thinking | Thinking only | 131,072 | 126,976 | 81,920 | 32,768 | $0.144 | $1.434 |
qwen3-next-80b-a3b-instruct | Non-thinking only | 129,024 | - | $0.574 | |||
qwen3-235b-a22b-thinking-2507 | Thinking only | 126,976 | 81,920 | $0.287 | $2.868 | ||
qwen3-235b-a22b-instruct-2507 | Non-thinking only | 129,024 | - | $1.147 | |||
qwen3-30b-a3b-thinking-2507 | Thinking only | 126,976 | 81,920 | $0.108 | $1.076 | ||
qwen3-30b-a3b-instruct-2507 | Non-thinking only | 129,024 | - | $0.431 | |||
qwen3-235b-a22b | Non-thinking | 129,024 | - | 16,384 | $0.287 | $1.147 | |
Thinking | 98,304 | 38,912 | $2.868 | ||||
qwen3-32b | Non-thinking | 129,024 | - | $0.287 | $1.147 | ||
Thinking | 98,304 | 38,912 | $2.868 | ||||
qwen3-30b-a3b | Non-thinking | 129,024 | - | $0.108 | $0.431 | ||
Thinking | 98,304 | 38,912 | $1.076 | ||||
qwen3-14b | Non-thinking | 129,024 | - | 8,192 | $0.144 | $0.574 | |
Thinking | 98,304 | 38,912 | $1.434 | ||||
qwen3-8b | Non-thinking | 129,024 | - | $0.072 | $0.287 | ||
Thinking | 98,304 | 38,912 | $0.717 | ||||
qwen3-4b | Non-thinking | 129,024 | - | $0.044 | $0.173 | ||
Thinking | 98,304 | 38,912 | $0.431 | ||||
qwen3-1.7b | Non-thinking | 32,768 | 30,720 | - | $0.173 | ||
Thinking | 28,672 | The sum of this value and the input must not exceed 30,720. | $0.431 | ||||
qwen3-0.6b | Non-thinking | 30,720 | - | $0.173 | |||
Thinking | 28,672 | The combined value of this item and the input cannot exceed 30,720. | $0.431 |
For the Qwen3 model, if thinking mode is enabled but no thinking process is generated, you are charged the non-thinking mode price.
QwQ-Open source
The QwQ reasoning model is trained on the Qwen2.5-32B model and uses reinforcement learning to significantly improve its inference capabilities. The model's performance matches that of the full version of DeepSeek-R1 on core math and code metrics, such as AIME 24/25 and LiveCodeBench, and on general metrics, such as IFEval and LiveBench. Its performance on all metrics significantly exceeds that of DeepSeek-R1-Distill-Qwen-32B, another model based on the Qwen2.5-32B model. Usage | API reference
Model | Context window | Maximum input | Maximum chain-of-thought | Maximum response | Input price | Output price |
(Tokens) | (Million tokens) | |||||
qwq-32b | 131,072 | 98,304 | 32,768 | 8,192 | $0.287 | $0.861 |
QwQ-Preview
The qwq-32b-preview model is an experimental model developed by the Qwen team in 2024. It is designed to enhance AI inference capabilities, particularly in mathematics and programming. For information about the model's limitations, see the official QwQ blog. Usage | API Reference | Try online
Model | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | ||||
qwq-32b-preview | 32,768 | 30,720 | 16,384 | $0.287 | $0.861 |
Qwen2.5
Qwen2.5 is a series of Qwen large language models that includes base and instruction-tuned models with parameter sizes ranging from 500 million to 72 billion. Qwen2.5 offers the following improvements over Qwen2:
It is pre-trained on a large-scale dataset of up to 18 trillion tokens.
It has a significantly expanded knowledge base and greatly improved encoding and math abilities.
It has significant improvements in following instructions, generating long text (over 8K tokens), understanding structured data such as tables, and generating structured outputs, especially JSON. The model is more resilient to diverse system prompts, which enhances chatbot role-play and conditional settings.
It supports over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, and Arabic.
Usage | API reference | Try online
Model name | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | ||||
qwen2.5-14b-instruct-1m | 1,000,000 | 1,000,000 | 8,192 | $0.144 | $0.431 |
qwen2.5-7b-instruct-1m | $0.072 | $0.144 | |||
qwen2.5-72b-instruct | 131,072 | 129,024 | $0.574 | $1.721 | |
qwen2.5-32b-instruct | $0.287 | $0.861 | |||
qwen2.5-14b-instruct | $0.144 | $0.431 | |||
qwen2.5-7b-instruct | $0.072 | $0.144 | |||
qwen2.5-3b-instruct | 32,768 | 30,720 | $0.044 | $0.130 | |
qwen2.5-1.5b-instruct | Free for a limited time | ||||
qwen2.5-0.5b-instruct |
QVQ
The qvq-72b-preview model is an experimental model from the Qwen team that focuses on improving visual reasoning, particularly for mathematical inference. For more information about the model's limitations, see the official QVQ blog. Usage | API reference
To have the model output its thinking process before the final answer, you can use the QVQ commercial model.
Model | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | ||||
qvq-72b-preview | 32,768 | 16,384 Maximum 16,384 for a single image | 16,384 | $1.721 | $5.161 |
Qwen-Omni
Qwen-Omni is a large multimodal model trained on Qwen2.5. It understands text, image, audio, and video inputs. The model can simultaneously stream text and audio outputs and provides significantly faster multimodal content understanding. Usage | API Reference
Model | Context window | Maximum input | Maximum output |
Tokens | |||
qwen2.5-omni-7b | 32,768 | 30,720 | 2,048 |
Billing for inputs and outputs is as follows:
|
| ||||||||||||||
Billing example: If a request has an input of 1,000 text tokens and 1,000 image tokens, and an output of 1,000 text tokens and 1,000 audio tokens, the total cost is $0.000087 (text input) + $0.000287 (image input) + $0.010895 (audio output). |
Qwen-VL
This is the open-source version of Alibaba Cloud's Qwen-VL. Usage | API reference
The Qwen3-VL model offers significant improvements over Qwen2.5-VL:
Agent interaction: It operates computer or mobile phone interfaces, detects GUI elements, understands features, and invokes tools to perform tasks. It achieves top-tier performance in evaluations such as OS World.
Visual encoding: It generates code from images or videos. This feature can be used to create HTML, CSS, and JS code from design drafts or website screenshots.
Spatial intelligence: It supports 2D and 3D positioning and accurately determines object orientation, perspective changes, and occlusion relationships.
Long video understanding: It understands video content up to 20 minutes long and can pinpoint specific moments with second-level accuracy.
Deep thinking: It excels at capturing details and analyzing causality, achieving top-tier performance in evaluations such as MathVista and MMMU.
OCR: It supports 33 languages and performs more stably in scenarios with complex lighting, blur, or tilt. It also significantly improves the accuracy of recognizing rare characters, ancient script, and technical terms.
Qwen3-VL
Model | Mode | Context window | Maximum input | Maximum CoT | Maximum response | Input price | Output price CoT + responses | Free quota |
(Tokens) | (1,000 tokens) | |||||||
qwen3-vl-30b-a3b-thinking | Thinking only | 131,072 | 126,976 | 81,920 | 32,768 | $0.108 | $1.076 | No free quota |
qwen3-vl-30b-a3b-instruct | Non-thinking only | 129,024 | - | $0.431 | ||||
qwen3-vl-235b-a22b-thinking | Thinking only | 131,072 | 126,976 | 81,920 | 32,768 | $0.286705 | $2.867051 | |
qwen3-vl-235b-a22b-instruct | Non-thinking only | 129,024 | - | $1.146820 |
Qwen2.5-VL
Model | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | ||||
qwen2.5-vl-72b-instruct | 131,072 | 129,024 Max 16,384 per image | 8,192 | $2.294 | $6.881 |
qwen2.5-vl-32b-instruct | $1.147 | $3.441 | |||
qwen2.5-vl-7b-instruct | $0.287 | $0.717 | |||
qwen2.5-vl-3b-instruct | $0.173 | $0.517 | |||
qwen2-vl-72b-instruct | 32,768 | 30,720 Max 16,384 per image | 2,048 | $2.294 | $6.881 |
qwen2-vl-7b-instruct | 32,000 | 30,000 Max 16,384 per image | 2,000 | Free for a limited time | |
qwen2-vl-2b-instruct |
Qwen-Math
Qwen2.5-Math, a language model based on Qwen, is designed to solve math problems. It supports Chinese and English and integrates multiple inference methods, such as Chain of Thought (CoT), Program of Thought (PoT), and Tool-Integrated Reasoning (TIR). Usage | API reference | Try online
Model name | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | ||||
qwen2.5-math-72b-instruct | 4,096 | 3,072 | 3,072 | $0.574 | $1.721 |
qwen2.5-math-7b-instruct | $0.144 | $0.287 | |||
qwen2.5-math-1.5b-instruct | Free for a limited time |
Qwen-Coder
Qwen-Coder is an open-source code model from Qwen. The latest model, qwen3-coder-480b-a35b-instruct, is a code generation model built on Qwen3. It has powerful agent capabilities for coding and excels at tool calling and environment interaction. The model supports autonomous programming and combines advanced coding skills with general-purpose abilities. Usage | API reference | Try online
Model | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | ||||
qwen3-coder-480b-a35b-instruct | 262,144 | 204,800 | 65,536 | Tiered pricing. See the notes below. | |
qwen3-coder-30b-a3b-instruct | |||||
qwen2.5-coder-32b-instruct | 131,072 | 129,024 | 8,192 | $0.287 | $0.861 |
qwen2.5-coder-14b-instruct | |||||
qwen2.5-coder-7b-instruct | $0.144 | $0.287 | |||
qwen2.5-coder-3b-instruct | 32,768 | 30,720 | Limited-time free trial | ||
qwen2.5-coder-1.5b-instruct | |||||
qwen2.5-coder-0.5b-instruct |
Billing for qwen3-coder-480b-a35b-instruct and qwen3-coder-30b-a3b-instruct is tiered based on the number of input tokens per request (left-open, right-closed intervals).
Model | Input tokens | Input price (Million tokens) | Output price (Million tokens) |
qwen3-coder-480b-a35b-instruct | 0-32K | $0.861 | $3.441 |
32K-128K | $1.291 | $5.161 | |
128K-200K | $2.151 | $8.602 | |
qwen3-coder-30b-a3b-instruct | 0-32K | $0.216 | $0.861 |
32K-128K | $0.323 | $1.291 | |
128K-200K | $0.538 | $2.151 |
Text generation - third-party models
DeepSeek
DeepSeek is an LLM series from the DeepSeek company. API reference | Try online
Model | Context window | Maximum input | Maximum CoT | Maximum response | Input price | Output price |
(tokens) | (Million tokens) | |||||
deepseek-v3.1 A 685B full-parameter model. | 65,536 | 32,768 | 98,304 | 131,072 | $0.574 | $1.721 |
deepseek-r1 A 685B full-parameter model. | 16,384 | $2.294 | ||||
deepseek-r1-0528 A 685B full-parameter model. | ||||||
deepseek-v3 A 671B full-parameter model. | 65,536 | 57,344 | Not applicable | 8,192 | $0.287 | $1.147 |
deepseek-r1-distill-qwen-1.5b Based on Qwen2.5-Math-1.5B. | 32,768 | 32,768 | 16,384 | 16,384 | Limited-time free trial | |
deepseek-r1-distill-qwen-7b Based on Qwen2.5-Math-7B. | $0.072 | $0.144 | ||||
deepseek-r1-distill-qwen-14b Based on Qwen2.5-14B. | $0.144 | $0.431 | ||||
deepseek-r1-distill-qwen-32b Based on Qwen2.5-32B. | $0.287 | $0.861 | ||||
deepseek-r1-distill-llama-8b Based on Llama-3.1-8B. | Limited-time free trial | |||||
deepseek-r1-distill-llama-70b Based on Llama-3.3-70B. |
Kimi
Kimi-K2, developed by Moonshot AI, is the first open-source trillion-parameter Mixture of Experts (MoE) model from China. It has 32 billion active parameters and excels at encoding and tool calling. Usage | Try online
Model | Context window | Input price | Output price |
(Tokens) | (Million tokens) | ||
Moonshot-Kimi-K2-Instruct | 131,072 | $0.574 | $2.294 |
Image generation
Qwen text-to-image
This model excels at complex text rendering, especially for the Chinese and English languages. Currently, qwen-image-plus and qwen-image have the same capabilties, but qwen-image-plus has lower price. API reference.
Model | Unit price |
qwen-image-plus | $0.028671 per image |
qwen-image | $0.035 per image |
Input prompt | Output image |
Healing-style hand-drawn poster featuring three puppies playing with a ball on lush green grass, adorned with decorative elements such as birds and stars. The main title “Come Play Ball!” is prominently displayed at the top in bold, blue cartoon font. Below it, the subtitle “Come [Show Off Your Skills]!” appears in green font. A speech bubble adds playful charm with the text: “Hehe, watch me amaze my little friends next!” At the bottom, supplementary text reads: “We get to play ball with our friends again!” The color palette centers on fresh greens and blues, accented with bright pink and yellow tones to highlight a cheerful, childlike atmosphere. |
Qwen image editing
The Qwen image editing model offers a wide range of features for advanced image and text editing. You can perform precise text editing in Chinese and English, adjust colors, enhance details, apply style transfers, add or delete objects, and modify positions and actions. API reference.
Model | Price |
qwen-image-edit | $0.043 per image |
Original image | Change the person's pose to bending over and holding the dog's front paws. | Original image | Replace the words 'HEALTH INSURANCE' on the letter blocks with '明天会更好' (Tomorrow will be better). |
Original image | Replace the polka-dot shirt with a light blue shirt. | Original image | Change the background to Antarctica. |
Original image | Generate a cartoon profile picture of the person. | Original image | Remove the hair from the plate. |
Qwen image translation
The Qwen image translation model translates text in images from 11 languages into Chinese or English, accurately preserving the original layout and content while offering custom features such as glossary definitions, sensitive word filtering, and image entity detection. API reference.
Model | Unit price |
qwen-mt-image | $0.000431/image |
Original image | Japanese |
Portuguese | Arabic |
Wan Text-to-Image
Text-to-Image V2
The V2 series features advanced text-to-image models to generate images from text. API reference | Online Experience
Model | Description | Unit price |
wan2.5-t2i-preview | Wan 2.5 preview removes the single-side limitation, allowing users to freely select image dimensions within the constraints of total pixel area and aspect ratio. | $0.028671/image |
wan2.2-t2i-plus | Wan 2.2 professional edition. Offers comprehensive upgrades in creativity, stability, and realistic textures. | $0.02007/image |
wan2.2-t2i-flash | Wan 2.2 speed edition. Offers comprehensive upgrades in creativity, stability, and realistic textures. | $0.028671/image |
wanx2.1-t2i-plus | Wan 2.1 professional edition. Generates highly detailed images in multiple styles. | $0.028671/image |
wanx2.1-t2i-turbo | Wan 2.1 turbo edition. Generates images quickly in multiple styles. | $0.020070/image |
wanx2.0-t2i-turbo | Wan 2.0 turbo edition. Excels at creating high-texture portraits and creative designs. This model is highly cost-effective. | $0.005735/image |
Scenario 1: Text generation Prompt: Generate a New Year's greeting card with a snowy background, children setting off firecrackers, a snake forming the number 2025, and the text "HAPPY NEW YEAR". Comparison: The v2.2 model generates text more effectively and is ideal for creative designs. | |||
wan2.2-t2i-plus | wanx2.1-t2i-plus | wanx2.1-t2i-turbo | wanx2.0-t2i-turbo |
Scenario 2: Portrait generation Prompt: Chinese girl, round face, looking at the camera, elegant ethnic clothing, commercial photography, outdoor, cinematic lighting, medium close-up shot, delicate light makeup, sharp edges. Effect comparison: The 2.2 model offers improved image stability, while the 2.0 model excels at generating high-quality portraits. Both are excellent choices. | |||
wan2.2-t2i-plus | wanx2.1-t2i-plus | wanx2.1-t2i-turbo | wanx2.0-t2i-turbo |
Wan2.5 general image editing
The Wan2.5 general image editing model supports entity-consistent image editing and multi-image fusion. It accepts text, a single image, or multiple images as input. API reference.
模型名称 | 单价 |
wan2.5-i2i-preview | $0.028671/张 |
Feature | Input example | Output image |
Single-image editing | Replace the floral dress with a vintage-style lace gown that has delicate embroidery on the collar and cuffs. | |
Multi-image fusion | Place the alarm clock from Image 1 next to the vase on the dining table in Image 2. |
Wan 2.1 general image editing
The Wan general image editing model performs diverse image edits based on simple instructions and is suitable for applications such as image outpainting, watermark removal, style transfer, image restoration, and image enhancement .Usage | API reference
Model | Unit price |
wanx2.1-imageedit | $0.020070 per image |
General image editing currently supports the following features:
Feature | Input image | Input prompt | Output image |
Global stylization | Apply a French picture book style. | ||
Local stylization | Change the house to a wooden-plank style. | ||
Instruction-based editing | Change the girl's hair to red. | ||
Inpainting | Input image Masked image (The white area is the mask) | A ceramic rabbit holding a ceramic flower. | Output image |
Text watermark removal | Remove the text from the image. | ||
Outpainting | A green fairy. | ||
Image super resolution | Low-resolution image | Image super resolution. | High-resolution image |
Image colorization | Blue background, yellow leaves. | ||
Line art to image | A minimalist, Nordic-style living room. | ||
Underlay graph | A cartoon character cautiously peeking its head out, looking at a brilliant blue gem inside a channel. |
OutfitAnyone
The OutfitAnyone-Plus Edition offers higher image definition, finer clothing texture details, and better logo restoration than the Basic Edition model. However, the longer image generation time makes this edition suitable for scenarios that are not time-sensitive. API reference | Try online
The OutfitAnyone-Image Parsing service parses model and clothing images for pre-processing and post-processing. API reference
Model | Description | Sample Input | Sample Output |
aitryon-plus | OutfitAnyone-Plus | ||
aitryon-parsing-v1 | Image parsing for OutfitAnyone |
OutfitAnyone billing
Model Service | Model | Unit Price | Discount | Tier |
OutfitAnyone-Plus Edition | aitryon-plus | $0.071677/image | None | None |
OutfitAnyone-Image Parsing | aitryon-parsing-v1 | $0.000574/image | None | None |
Speech synthesis (text-to-speech)
Qwen-TTS
Qwen-TTS, a speech synthesis model from the Qwen series, converts text in Chinese, English, or a mix of both into streaming audio output. Usage | API reference
Model | Version | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | |||||
qwen-tts Functionally equivalent to qwen-tts-2025-04-10. | Stable | 8,192 | 512 | 7,680 | $0.230 | $1.434 |
qwen-tts-latest Functionally equivalent to the latest snapshot. | Latest | |||||
qwen-tts-2025-05-22 | Snapshot | |||||
qwen-tts-2025-04-10 |
The audio output is tokenized at a rate of 50 tokens per second. Audio clips shorter than 1 second are also counted as 50 tokens.
CosyVoice
CosyVoice is a next-generation large generative model for speech synthesis developed by Tongyi Lab. Powered by large-scale pre-trained language models, it integrates text understanding with speech generation and supports real-time streaming text-to-speech synthesis. Usage | Try online | Voice list
Model | Unit price |
cosyvoice-v3-plus | $0.286706 per 10,000 characters |
cosyvoice-v3 | $0.0573412 per 10,000 characters |
cosyvoice-v2 | $0.286706 per 10,000 characters |
A Chinese character is counted as two characters, while English letters, punctuation marks, and spaces are each counted as one character.
Speech recognition (speech-to-text) and translation (speech-to-translation)
Paraformer
The Paraformer speech recognition service transcribes only spoken content from audio, and you are billed only for this transcribed content. Therefore, the billable duration is typically shorter than the total length of the audio file. Because the service uses AI to interpret audio, minor transcription errors may occur.
By default, only the first track of a multi-track audio file is transcribed and billed. If you enable multi-track transcription, each track is billed separately based on its duration.
The actual billing duration is specified in the content_duration field of the response.
Audio file recognition
Model | Supported languages | Supported sample rates | Scenarios | Supported audio formats | Unit price |
paraformer-v2 | Mandarin Chinese, Chinese dialects (Cantonese, Wu, Minnan, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin, Jiangxi, Yunnan, and Shanghainese), English, Japanese, Korean, German, French, and Russian | Any | ApsaraVideo Live | aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, and wmv | $0.000012/second |
paraformer-8k-v2 | Mandarin Chinese | 8 kHz | Telephony |
Real-time speech recognition
Model | Supported languages | Supported sample rates | Scenarios | Supported audio formats | Unit price |
paraformer-realtime-v2 | Mandarin Chinese, Chinese dialects (Cantonese, Wu, Minnan, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin, Jiangxi, Yunnan, and Shanghainese), English, Japanese, Korean, German, French, and Russian Supports real-time language switching. | Any | ApsaraVideo Live, online meetings, and other real-time applications. | pcm, wav, mp3, opus, speex, aac, and amr | $0.000035 per second |
paraformer-realtime-8k-v2 | 8 kHz | Telephone customer service and other telephony applications. |
Fun-ASR
Fun-ASR is a speech recognition model in the Tongyi Fun series that supports Chinese (Mandarin, Cantonese), English, Japanese, Thai, Vietnamese, and Indonesian.
Model | Version | Supported languages | Supported sample rates | Scenarios | Supported audio formats | Unit price | Free quota |
fun-asr-mtl Same as fun-asr-2025-08-25 | Stable version | Chinese (Mandarin, Cantonese), English, Japanese, Thai, Vietnamese, and Indonesian | Any | ApsaraVideo Live, phone calls, and conference interpretation | aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, and wmv | $0.000032/second | None |
fun-asr-mtl-2025-08-25 | Snapshot version |
Video generation: Wan and video editing
Text-to-Video
The Wan text-to-video model generates a video from a single sentence. The resulting video features rich artistic styles and cinematic quality. API reference | Try online
Model | Description | Price |
wan2.5-t2v-preview | Wan 2.5 preview supports automatic dubbing and custom audio files. | 480P: $0.043006/second 720P: $0.086012/second 1080P: $0.143353/second |
wan2.2-t2v-plus | Wan 2.2 professional edition. Significantly improves image detail and motion stability. | 480p: $0.02007/second 1080p: $0.100347/second |
wanx2.1-t2v-turbo | Offers faster generation speed and balanced performance. | $0.034405/second |
wanx2.1-t2v-plus | Offers richer details and higher-quality images. | $0.100347/second |
Example input | Output video |
Prompt: A kitten runs in the moonlight |
Image-to-video: first frame
The Wan image-to-video model uses an input image as the first frame and a prompt to generate a video. The resulting video features rich artistic styles and cinematic quality. API reference | Try online
Model | Description | Price |
wan2.5-i2v-preview | Wan 2.5 preview supports automatic dubbing and custom audio files. | 480P: $0.043006/second 720P: $0.086012/second 1080P: $0.143353/second |
wan2.2-i2v-plus | Wan 2.2 professional edition. It offers significant improvements in image detail and motion stability. | 480p: $0.02007/second 1080p: $0.100347/second |
wanx2.1-i2v-turbo | This model offers a faster generation speed and is more cost-effective. It takes only one-third of the time required by the plus model. | $0.034405/second |
wanx2.1-i2v-plus | This model offers richer details and higher-quality images. | $0.100347/second |
Input example | Output video |
Input prompt: A cat runs on the grass Input image: | Output video: The input image serves as the first frame of the video. The remaining frames are generated based on the prompt. Model: wanx2.1-i2v-turbo. |
Image-to-video: first and last frames
The Wan first-and-last-frame video generation model uses a prompt, a first frame image, and a last frame image to generate a smooth, dynamic video. API reference | Try online
Model | Price |
wanx2.1-kf2v-plus | $0.100347 per second |
Input example | Output video | ||
First frame image | Last frame image | Prompt | |
Realistic style. A small black cat looks up at the sky. The camera starts at eye level and gradually rises to a top-down shot of the cat's curious eyes. |
General video editing
The Wan unified video editing model supports multimodal inputs, such as text, images, and videos. It can perform video generation and general editing tasks. API reference
Model | Price |
wanx2.1-vace-plus | $0.100347 per second |
The unified video editing model supports the following features:
Model feature | Input reference image | Input prompt | Output video |
Multi-image reference | Reference image 1 (reference entity) Reference image 2 (reference background) | A girl walks out from the depths of an ancient, misty forest. She moves with light steps as the camera captures her graceful movements. When she stops to look at the lush woods, a smile of surprise and joy appears on her face. This moment, captured in the interplay of light and shadow, records the wonderful encounter between the girl and nature. | Output video |
Video redrawing | A gentleman drives a black, steampunk-style car decorated with gears and copper pipes. The background features a steam-powered candy factory with retro elements, creating a vintage and fun atmosphere. | ||
Local editing | Input video Input mask image (The white area indicates the editing area) | In a Parisian cafe, a lion in a suit elegantly sips coffee. It holds a coffee cup in one hand and looks content. The cafe is tastefully decorated with soft hues and warm lighting that illuminates the lion. | The content in the editing area is modified according to the prompt. |
Video extension | Input initial video segment (1 second) | A dog wearing sunglasses is skateboarding on the street in a 3D cartoon style. | Output extended video (5 seconds) |
Video frame expansion | An elegant lady passionately plays the violin with a full symphony orchestra behind her. |
Wan - digital human
You can generate a video of a person speaking, singing, or performing with natural movements based on a single character image and an audio file. To use this feature, call the following models in sequence. wan2.2-s2v image detection | wan2.2-s2v video generation
Model | Description | Price |
wan2.2-s2v-detect | Checks if an input image meets requirements, such as definition, a single person, and a frontal view. | $0.000574 per image |
wan2.2-s2v | Generates a dynamic character video from a validated image and an audio clip. | 480p: $0.071677 per second 720p: $0.129018 per second |
Sample input | Output video |
Input image: Input audio: |
AnimateAnyone
You can generate a character action video based on a character image and a character action template. To use this feature directly, call the following three models in sequence. AnimateAnyone image detection API details | AnimateAnyone action template generation | AnimateAnyone video generation API details
Model | Description | Price |
animate-anyone-detect-gen2 | Checks whether the input image meets the requirements. | $0.000574 per image |
animate-anyone-template-gen2 | Extracts character actions from a character motion video and generates an action template. | $0.011469 per second |
animate-anyone-gen2 | Generates a character action video based on a character image and an action template. |
Input: Character image | Input: Action video | Output: Image background | Output: Video background |
The preceding examples were generated by the Tongyi App, which integrates AnimateAnyone.
The output of the AnimateAnyone model contains only video frames and does not include audio.
EMO
You can generate a dynamic portrait video based on a portrait image and a human voice audio file. To use this feature, call the following models in sequence. EMO image detection | EMO video generation
Model | Description | Price |
emo-detect-v1 | Checks whether the input image meets the requirements. This model can be called directly without deployment. | $0.000574 per image |
emo-v1 | Generates a dynamic portrait video. This model can be called directly without deployment. |
|
Inputs: Portrait image and human voice audio file | Outputs: Dynamic portrait video |
Portrait image: Human voice audio: See the video on the right. | Dynamic portrait video: Action style level: Active ("style_level": "active") |
LivePortrait
You can generate a dynamic portrait video from a portrait image and a human voice audio file quickly and in a lightweight manner. Compared to the EMO model, this model offers faster generation and lower prices, but with lower output quality. To use this feature, call the following two models in sequence. LivePortrait image detection | LivePortrait video generation
Model | Description | Price |
liveportrait-detect | Verifies that an input image meets the required specifications. | $0.000574 per image |
liveportrait | Generates a dynamic portrait video. | $0.002868 per second |
Inputs: A portrait image and a voice audio file | Outputs: A dynamic portrait video |
Portrait: Voice audio: See the video on the right. | Portrait video: |
Emoji
You can generate a dynamic facial video from a face image and a preset dynamic face template. This feature can be used in scenarios such as creating emojis and generating video materials. To use this feature, call the following models in sequence. Emoji image detection | Emoji video generation
Model | Description | Price |
emoji-detect-v1 | Checks whether an input image meets specified requirements. | $0.000574 per image |
emoji-v1 | Generates a character expression from a portrait image that matches a specified emoji template. | $0.011469 per second |
Input: Portrait image | Output: Dynamic portrait video |
The template sequence for the "Happy" expression is ("input.driven_id": "mengwa_kaixin"). |
VideoRetalk
This feature uses a character video and a human voice audio file to generate a new video in which the character's lip movements match the input audio. To use this feature, call the following model. API reference
Model | Description | Price |
videoretalk | Generates a new video where a character's lip movements are synchronized with the input audio. | $0.011469 per second |
Video restyling
You can generate videos in different styles that match the semantic description of an input text. You can also use this feature to restyle an input video. API reference
Model | Description | Price | |
video-style-transform | Transforms an input video into various styles, such as Japanese anime and American comics. | 720p | $0.071677 per second |
540p | $0.028671 per second |
Input video | Output video (Japanese anime) |
Text embedding
A text embedding model converts text into a numerical representation used for tasks such as search, clustering, recommendation, and classification. Billing for the model is based on the number of input tokens. Synchronous API details.
Model | Embedding dimensions | Batch size | Maximum tokens per row | Supported languages | Price (Million input tokens) |
text-embedding-v4 Part of the Qwen3-Embedding series | 2,048, 1,536, 1,024 (default), 768, 512, 256, 128, or 64 | 10 | 8,192 | Chinese, English, Spanish, French, Portuguese, Indonesian, Japanese, Korean, German, Russian, more than 100 other major languages, and various programming languages | $0.072 |
Multimodal embedding
Multimodal embedding models transform text, images, or videos into floating-point vectors to enable applications such as video classification, image classification, and image-text retrieval. API reference.
Model | Data type | Embedding dimension | Price | Rate limit |
multimodal-embedding-v1 | float(32) | 1,024 | Free trial | 120 requests per minute (RPM) |
Text classification, extraction, and ranking
Text Rerank
This feature is typically used for semantic retrieval, which sorts documents by their semantic relevance to a query. API reference.
Model | Maximum number of documents | Maximum input tokens per document | Maximum total input tokens | Supported languages | Price (Million input tokens) |
gte-rerank-v2 | 500 | 4,000 | 30,000 | Over 50 languages, including Chinese, English, Japanese, Korean, Thai, Spanish, French, Portuguese, German, Indonesian, and Arabic | $0.115 |
Maximum tokens per Query or Document: A single Query or Document is limited to 4,000 tokens. Input that exceeds this limit is truncated.
Maximum number of Documents: A single request is limited to 500 Documents.
Maximum input tokens: The total number of tokens for all Queries and Documents in a single request is limited to 30,000.
Industry
Intention recognition
The Tongyi intention recognition model quickly and accurately parses user intents and selects the appropriate tools to solve user problems, all within a few hundred milliseconds. API reference | Usage
Model | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | ||||
tongyi-intent-detect-v3 | 8,192 | 8,192 | 1,024 | $0.058 | $0.144 |
Role-playing
The Qwen role-playing model is ideal for creating lifelike conversational experiences in various scenarios, such as virtual social interactions, games with non-player characters (NPCs), and emulating intellectual property (IP) characters. It is also well-suited for integration into hardware, toys, and in-vehicle systems. Compared to other Qwen models, this model provides enhanced persona consistency, conversation progression, and empathetic listening. Usage
Model | Context window | Maximum input | Maximum output | Input price | Output price |
(Tokens) | (Million tokens) | ||||
qwen-plus-character | 32,768 | 32,000 | 4,096 | $0.115 | $0.287 |