This topic describes the parameters and interface details of the CosyVoice speech synthesis Python SDK.
This document applies only to the China (Beijing) region. To use the model, you must use an API key from the China (Beijing) region.
User guide: For model introductions and selection recommendations, see Speech synthesis - CosyVoice.
Prerequisites
You have activated Alibaba Cloud Model Studio and obtained an API key. Configure the API key as an environment variable instead of hard coding it in your code to prevent security risks that can result from code exposure.
NoteWhen you need to provide temporary access permissions to third-party applications or users, or want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use temporary authentication tokens.
Compared with long-term API keys, temporary authentication tokens provide higher security. They have shorter validity periods of 60 seconds, which makes them suitable for temporary call scenarios and effectively reduces the risk of API key leakage.
Usage: In your code, replace the API key with the temporary authentication token to perform authentication.
Text and format limitations for speech synthesis
Text length limits
Non-streaming calls (Synchronous call or Asynchronous call): The text length in a single request cannot exceed 2,000 characters.
Streaming calls: The text length in a single request cannot exceed 2,000 characters, and the total text length cannot exceed 200,000 characters.
Character calculation rules
Chinese character: 2 characters
English letter, number, punctuation mark, or space: 1 character
The content of SSML tags is included when calculating the text length.
Examples:
"你好"
→ 2 + 2 = 4 characters"中A文123"
→ 2 + 1 + 2 + 1 + 1 + 1 = 8 characters"中文。"
→ 2 + 2 + 1 = 5 characters"中 文。"
→ 2 + 1 + 2 + 1 = 6 characters"<speak>你好<speak/>"
→ 7 + 2 + 2 + 8 = 19 characters
Encoding format
Use UTF-8 encoding.
Support for mathematical expressions
The mathematical expression parsing feature is currently available only for the cosyvoice-v2
model. It supports common mathematical expressions found in primary and secondary school, including but not limited to basic arithmetic, algebra, and geometry.
For more information, see Support for Latex (Chinese language only).
SSML support
The Speech Synthesis Markup Language (SSML) feature is currently available only for some voices of the cosyvoice-v2
model. You can check the voice list to confirm SSML support. The following conditions must be met:
You must use DashScope SDK 1.23.4 or a later version.
Only synchronous call and asynchronous call (that is, the
call
method of the SpeechSynthesizer class) are supported. Streaming call (that is, thestreaming_call
method of the SpeechSynthesizer class) are not supported.The usage is the same as for standard speech synthesis. You can pass the text containing SSML to the
call
method of the SpeechSynthesizer class.
Get started
The SpeechSynthesizer class provides key interfaces for speech synthesis and supports the following call methods:
Synchronous call: After you submit text, the server processes it immediately and returns the complete synthesized audio. This process is blocking, which means the client must wait for the server to finish before proceeding. This method is suitable for short text synthesis.
Asynchronous call: You can send the entire text to the server in one go and receive the synthesized audio in real time. Sending text in segments is not allowed. This method is suitable for short text synthesis scenarios that require low latency.
Streaming call: You can send text to the server incrementally and receive the synthesized audio in real time. You can send long text in segments, and the server begins processing as soon as it receives a portion of the text. This method is suitable for long text synthesis scenarios that require low latency.
Synchronous call
You can submit a single speech synthesis task and obtain the full result at once without using a callback function. No intermediate results are streamed.
You can instantiate the SpeechSynthesizer class, bind the request parameters, and call the call
method to synthesize the audio and obtain the binary data.
The length of the sent text cannot exceed 2,000 characters. For more information, see the call
method of the SpeechSynthesizer class.
You must re-initialize the SpeechSynthesizer
instance before each call to the call
method.
Asynchronous call
You can submit a single speech synthesis task and stream the intermediate results through a callback. The synthesized result is obtained through the callback functions in ResultCallback
.
You can instantiate the SpeechSynthesizer class, bind the request parameters and the ResultCallback interface, call the call
method to start synthesis, and receive the result in real time through the on_data
method of the ResultCallback interface.
The length of the sent text cannot exceed 2,000 characters. For more information, see the call
method of the SpeechSynthesizer class.
You must re-initialize the SpeechSynthesizer
instance before each call to the call
method.
Streaming calls
You can submit text multiple times within a single speech synthesis task and receive the synthesized result in real time through a callback.
For streaming input, you can call
streaming_call
multiple times to submit text fragments in sequence. The server automatically segments sentences after receiving the fragments:Complete sentences are synthesized immediately.
Incomplete sentences are buffered until they are complete, then synthesized.
When you call
streaming_complete
, the server forcibly synthesizes all received but unprocessed text fragments, including incomplete sentences.The interval between sending text fragments cannot exceed 23 seconds, or a "request timeout after 23 seconds" exception is triggered.
If there is no more text to send, you can call
streaming_complete
promptly to end the task.The server enforces a 23-second timeout mechanism that cannot be modified by the client.
Instantiate the SpeechSynthesizer class.
Instantiate the SpeechSynthesizer class and bind the request parameters and the ResultCallback interface.
Stream the text.
Stream the text by calling the
streaming_call
method of the SpeechSynthesizer class multiple times to send the text to be synthesized to the server in segments.While you send the text, the server returns the synthesized result to the client in real time through the
on_data
method of the ResultCallback interface.The length of the text fragment sent in each call to the
streaming_call
method (thetext
parameter) cannot exceed 2,000 characters, and the total length of all text sent cannot exceed 200,000 characters.You can end the process.
End the process by calling the
streaming_complete
method of the SpeechSynthesizer class to end the speech synthesis.This method blocks the current thread until the
on_complete
oron_error
callback of the ResultCallback interface is triggered. After the callback is triggered, the thread is unblocked.Make sure to call this method. Otherwise, the final part of the text may not be converted to speech.
Request parameters
You can set the request parameters through the constructor of the SpeechSynthesizer class.
Parameter | Type | Default | Required | Description |
model | str | - | Yes | Specifies the model. Supported values are |
voice | str | - | Yes | Specifies the voice to use for speech synthesis. The following two types of voices are supported:
|
format | enum | Varies by voice | No | Specifies the audio coding format and sample rate. If you do not specify the Note The default sample rate is the optimal sample rate for the current voice. If this parameter is not set, the output uses this sample rate by default. Downsampling and upsampling are also supported. The following audio coding formats and sample rates are supported:
|
volume | int | 50 | No | The volume of the synthesized audio. Valid values: 0 to 100. Important This field differs in various versions of the DashScope SDK:
|
speech_rate | float | 1.0 | No | The speech rate of the synthesized audio. Valid values: 0.5 to 2.
|
pitch_rate | float | 1.0 | No | The pitch of the synthesized audio. Valid values: 0.5 to 2. |
bit_rate | int | 32 | No | Specifies the audio bitrate. Valid values: 6 to 510 kbps. A higher bitrate results in better audio quality and a larger file size. This parameter is available only when the Note Set the
|
callback | ResultCallback | - | No |
Key interfaces
The SpeechSynthesizer
class
SpeechSynthesizer
is imported through from dashscope.audio.tts_v2 import *
and provides key interfaces for speech synthesis.
Method | Parameters | Return value | Description |
|
| Returns binary audio data if | Converts a whole piece of text (either plain text or text with SSML) into speech. When you create a
Important You must re-initialize the |
|
| None | Streams the text to be synthesized. Text with SSML is not supported. You can call this interface multiple times to send the text to be synthesized to the server in multiple parts. The synthesis result is obtained through the For usage, see Streaming calls. |
|
| None | Ends the streaming speech synthesis. This method blocks the current thread for N milliseconds, where N is determined by By default, if the waiting time exceeds 10 minutes, the waiting stops. For usage, see Streaming calls. Important For streaming calls, make sure to call this method. Otherwise, parts of the synthesized speech may be missing. |
| None | The request_id of the last task. | Gets the request_id of the last task. |
| None | First package delay | Gets the first package delay, which is typically around 500 ms. The first package delay is the time between when the text is sent and when the first audio packet is received, measured in milliseconds. Use this method after the task is complete. |
| None | The last message | Gets the last message, which is in JSON format. You can use this to get task-failed errors. |
Callback interface (ResultCallback
)
For asynchronous or streaming calls, the server uses callbacks to return key process information and data to the client. You need to implement the callback methods to handle the information or data returned by the server.
You can import it using from dashscope.audio.tts_v2 import *
.
Method | Parameters | Return value | Description |
| None | None | This method is called immediately after a connection is established with the server. |
|
| None | This method is called when there is a response from the service. The |
| None | None | This method is called after all synthesized data has been returned (speech synthesis is complete). |
|
| None | This method is called when an exception occurs. |
|
| None | This method is called when the server returns synthesized audio. You can combine the binary audio data into a complete audio file and play it with a player, or play it in real time with a player that supports streaming playback. Important
|
| None | None | This method is called after the service has closed the connection. |
Response
The server returns binary audio data:
Synchronous call: You can process the binary audio data returned by the
call
method of the SpeechSynthesizer class.Asynchronous call or streaming call: You can process the parameter (bytes type data) of the
on_data
method of the ResultCallback interface.
Error codes
If you encounter an error, see Error messages for troubleshooting.
If the problem is still not resolved, join the developer group to provide feedback on the problem and include the Request ID for further investigation.
More examples
For more examples, see GitHub.
Voice list
The default supported voices are listed in the following table. If you need a more personalized voice, you can use the voice cloning feature to create a custom voice for free. For more information, see Use a cloned voice for speech synthesis.
cosyvoice-v2 voice list
ImportantThe following is the cosyvoice-v2 voice list. When using these voices, you must set the
model
request parameter tocosyvoice-v2
. Otherwise, the call will fail.Scenario
Voice
Voice characteristics
Audio sample (right-click to save)
voice parameter
Language
SSML
Customer service
Long Yingcui
Serious male
longyingcui
Chinese, English
✅
Long Yingda
Cheerful high-pitched female
longyingda
Chinese, English
✅
Long Yingjing
Low-key and calm female
longyingjing
Chinese, English
✅
Long Yingyan
Righteous and stern female
longyingyan
Chinese, English
✅
Long Yingtian
Gentle and sweet female
longyingtian
Chinese, English
✅
Long Yingbing
Sharp and assertive female
longyingbing
Chinese, English
✅
Long Yingtao
Gentle and calm female
longyingtao
Chinese, English
✅
Long Yingling
Gentle and empathetic female
longyingling
Chinese, English
✅
Voice assistant
YUMI
Formal young female
longyumi_v2
Chinese, English
✅
Long Xiaochun
Intellectual and positive female
longxiaochun_v2
Chinese, English
✅
Long Xiaoxia
Calm and authoritative female
longxiaoxia_v2
Chinese, English
✅
Livestreaming and e-commerce
Long Anran
Lively and textured female
longanran
Chinese, English
✅
Long Anxuan
Classic female livestreamer
longanxuan
Chinese, English
✅
Audiobooks
Long Sanshu
Calm and textured male
longsanshu
Chinese, English
✅
Long Xiu
Knowledgeable male storyteller
longxiu_v2
Chinese, English
✅
Long Miao
Rhythmic female
longmiao_v2
Chinese, English
✅
Long Yue
Warm and magnetic female
longyue_v2
Chinese, English
✅
Long Nan
Wise young male
longnan_v2
Chinese, English
✅
Long Yuan
Warm and healing female
longyuan_v2
Chinese, English
✅
Social companion
Long Anrou
Gentle best friend female
longanrou
Chinese, English
✅
Long Qiang
Romantic and charming female
longqiang_v2
Chinese, English
✅
Long Han
Warm and devoted male
longhan_v2
Chinese, English
✅
Long Xing
Gentle girl-next-door
longxing_v2
Chinese, English
✅
Long Hua
Energetic and sweet female
longhua_v2
Chinese, English
✅
Long Wan
Positive and intellectual female
longwan_v2
Chinese, English
✅
Long Cheng
Intelligent young male
longcheng_v2
Chinese, English
✅
Long Feifei
Sweet and delicate female
longfeifei_v2
Chinese, English
✅
Long Xiaocheng
Magnetic low-pitched male
longxiaocheng_v2
Chinese, English
✅
Long Zhe
Awkward and warm-hearted male
longzhe_v2
Chinese, English
✅
Long Yan
Warm and gentle female
longyan_v2
Chinese, English
✅
Long Tian
Magnetic and rational male
longtian_v2
Chinese, English
✅
Long Ze
Warm and energetic male
longze_v2
Chinese, English
✅
Long Shao
Positive and upwardly mobile male
longshao_v2
Chinese, English
✅
Long Hao
Emotional and melancholic male
longhao_v2
Chinese, English
✅
Long Shen
Talented male singer
kabuleshen_v2
Chinese, English
✅
Children's voice
Long Jielidou
Sunny and mischievous male
longjielidou_v2
Chinese, English
✅
Long Ling
Childish and stiff female
longling_v2
Chinese, English
✅
Long Ke
Innocent and well-behaved female
longke_v2
Chinese, English
✅
Long Xian
Bold and cute female
longxian_v2
Chinese, English
✅
Dialect
Long Laotie
Straightforward Northeastern dialect male
longlaotie_v2
Chinese (Northeastern), English
✅
Long Jiayi
Intellectual Cantonese female
longjiayi_v2
Chinese (Cantonese), English
✅
Long Tao
Positive Cantonese female
longtao_v2
Chinese (Cantonese), English
✅
Poetry recitation
Long Fei
Passionate and magnetic male
longfei_v2
Chinese, English
✅
Li Bai
Ancient male poet
libai_v2
Chinese, English
✅
Long Jin
Elegant and gentle male
longjin_v2
Chinese, English
✅
News broadcast
Long Shu
Calm young male
longshu_v2
Chinese, English
✅
Bella2.0
Precise and capable female
loongbella_v2
Chinese, English
✅
Long Shuo
Knowledgeable and capable male
longshuo_v2
Chinese, English
✅
Long Xiaobai
Calm female announcer
longxiaobai_v2
Chinese, English
✅
Long Jing
Typical female announcer
longjing_v2
Chinese, English
✅
loongstella
Confident and crisp female
loongstella_v2
Chinese, English
✅
Overseas marketing
loongeva
Intellectual British English female
loongeva_v2
British English
❌
loongbrian
Calm British English male
loongbrian_v2
British English
❌
loongluna
British English female
loongluna_v2
British English
❌
loongluca
British English male
loongluca_v2
British English
❌
loongemily
British English female
loongemily_v2
British English
❌
loongeric
British English male
loongeric_v2
British English
❌
loongabby
American English female
loongabby_v2
American English
❌
loongannie
American English female
loongannie_v2
American English
❌
loongandy
American English male
loongandy_v2
American English
❌
loongava
American English female
loongava_v2
American English
❌
loongbeth
American English female
loongbeth_v2
American English
❌
loongbetty
American English female
loongbetty_v2
American English
❌
loongcindy
American English female
loongcindy_v2
American English
❌
loongcally
American English female
loongcally_v2
American English
❌
loongdavid
American English male
loongdavid_v2
American English
❌
loongdonna
American English female
loongdonna_v2
American English
❌
loongkyong
Korean female
loongkyong_v2
Korean
❌
loongtomoka
Japanese female
loongtomoka_v2
Japanese
❌
loongtomoya
Japanese male
loongtomoya_v2
Japanese
❌
FAQ
Features, billing, and rate limiting
Q: Where can I find information about CosyVoice's features, billing, and rate limiting?
A: For more information, see Speech synthesis - CosyVoice.
Q: What to do if the pronunciation is inaccurate?
A: You can use SSML to customize the speech synthesis output.
Q: The current Requests Per Second (RPS) cannot meet my business needs. What should I do? How to increase the limit? Do I need to pay for the increase?
A: You can make a request by submitting a ticket or joining the developer group. The scale-out is free.
Q: How do I specify the language of the synthesized speech?
A: You cannot specify the language of the synthesized speech through request parameters. To synthesize speech in a specific language, refer to the voice list and select a voice that supports the desired language.
Troubleshooting
If a code error occurs, see the information in Error codes for troubleshooting.
Q: How do I get the Request ID?
A: You can obtain it in one of the following two ways:
Parse the JSON string
message
in theon_event
method of the ResultCallback interface.Call the
get_last_request_id
method of SpeechSynthesizer.
Q: Why does the SSML feature fail?
A: You can follow these steps to troubleshoot:
Make sure the current voice supports SSML. Cloned voices do not support SSML.
Make sure the
model
parameter is set tocosyvoice-v2
.Make sure you are using the correct interface. Only the
call
method of the SpeechSynthesizer class supports SSML.Make sure the text to be synthesized is in plain text format and meets the format requirements. For more information, see Introduction to the SSML markup language.
Q: Why won't the audio play?
A: Check the following scenarios one by one:
The audio is saved as a complete file, such as xx.mp3
Audio format consistency: Make sure the audio format set in the request parameters matches the file extension. For example, if the audio format is set to wav but the file extension is mp3, playback may fail.
Player compatibility: Confirm whether the player you are using supports the format and sample rate of the audio file. For example, some players may not support high sample rates or specific audio encodings.
The audio is played in a stream
Save the audio stream as a complete file and try to play it with a player. If the file cannot be played, refer to the troubleshooting methods for scenario 1.
If the file can be played normally, the problem may be with the streaming playback implementation. Confirm whether the player you are using supports streaming playback.
Common tools and libraries that support streaming playback include ffmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (Javascript).
Q: Why is the audio playback stuttering?
A: Check the following scenarios one by one:
The audio is saved as a complete file, such as xx.mp3
Please join the developer group and provide the Request ID so we can troubleshoot the issue for you.
The audio is played in a stream
Check the text sending speed: Make sure the interval for sending text is reasonable. Avoid situations where the next sentence is not sent promptly after the previous audio has finished playing.
Check the callback function performance:
Check if there is too much business logic in the callback function, causing it to block.
The callback function runs in the WebSocket thread. If it is blocked, it may affect the WebSocket's ability to receive network packets, leading to stuttering in audio reception.
Write the audio data to a separate audio buffer and then read and process it in another thread to avoid blocking the WebSocket thread.
Check network stability: Make sure the network connection is stable to avoid audio transmission interruptions or delays due to network fluctuations.
Further troubleshooting: If the preceding solutions do not resolve the issue, join the developer group and provide the Request ID so we can investigate the issue further for you.
Q: Why is speech synthesis slow (long synthesis time)?
A: Check the following items:
Check the input interval
For streaming speech synthesis, check if the interval between sending text segments is too long. For example, a delay of several seconds after the previous segment is sent. A long interval will increase the total synthesis time.
Analyze performance metrics
If the first packet delay does not meet the following requirements, submit the Request ID to the technical team for investigation.
First packet delay: Normally around 500 ms.
RTF (RTF = Total synthesis time / Audio duration): Normally around 0.3.
Q: How do I handle pronunciation errors in the synthesized speech?
We recommend cosyvoice-v2 for better results and supports SSML.
If the current model is cosyvoice-v2, use the SSML <phoneme> tag to specify the correct pronunciation.
Q: Why is no audio returned? Why is the end of the text not converted to speech? (Missing synthesized audio)
A: You can check if you forgot to call the streaming_complete
method of the SpeechSynthesizer class. During speech synthesis, the server starts synthesizing only after buffering enough text. If you do not call the streaming_complete
method, the text at the end of the buffer may not be synthesized into speech.
Q: How do I resolve an SSL certificate verification failure?
You can install the system root certificate.
sudo yum install -y ca-certificates sudo update-ca-trust enable
You can add the following content to your code.
import os os.environ["SSL_CERT_FILE"] = "/etc/ssl/certs/ca-bundle.crt"
Q: What causes the "SSL: CERTIFICATE_VERIFY_FAILED" exception on macOS? (websocket closed due to [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000))
When connecting to a WebSocket, you may encounter an OpenSSL certificate verification failure with a message indicating that the certificate cannot be found. This is usually because the certificate configuration in your Python environment is incorrect. You can manually locate and fix the certificate issue by following these steps:
Export system certificates and set environment variables You can run the following commands to export all certificates from your macOS system to a file and set it as the default certificate path for Python and related libraries:
security find-certificate -a -p > ~/all_mac_certs.pem export SSL_CERT_FILE=~/all_mac_certs.pem export REQUESTS_CA_BUNDLE=~/all_mac_certs.pem
Create a symbolic link to fix Python's OpenSSL configuration If Python's OpenSSL configuration is missing certificates, you can create a symbolic link manually. Make sure to replace the path in the command with the actual installation directory of your local Python version:
# 3.9 is an example version number. Adjust the path according to your locally installed Python version. ln -s /etc/ssl/* /Library/Frameworks/Python.framework/Versions/3.9/etc/openssl
Restart the terminal and clear the cache After completing the above operations, you can close and reopen the terminal to ensure the environment variables take effect. You can clear any possible cache and try connecting to the WebSocket again.
These steps can resolve connection issues caused by incorrect certificate configurations. If the problem persists, check if the target server's certificate configuration is correct.
Q: What causes the "AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?" error when running the code?
A: This is because websocket-client is not installed or the version is mismatched. You can run the following commands in sequence:
pip uninstall websocket-client
pip uninstall websocket
pip install websocket-client
Permissions and authentication
Q: I want my API key to be used only for the CosyVoice speech synthesis service and not by other Model Studio models (permission isolation). How can I do this?
A: You can limit the scope of an API key by creating a new workspace and granting it access to only specific models. For more information, see Workspace Management.
More questions
A: See the GitHub QA.