CosyVoice speech synthesis Python SDK - Alibaba Cloud Model Studio

This topic describes the parameters and interface details of the CosyVoice speech synthesis Python SDK.

Important

This document applies only to the China (Beijing) region. To use the model, you must use an API key from the China (Beijing) region.

User guide: For model introductions and selection recommendations, see Speech synthesis - CosyVoice.

Prerequisites

You have activated Alibaba Cloud Model Studio and obtained an API key. Configure the API key as an environment variable instead of hard coding it in your code to prevent security risks that can result from code exposure.
Note
When you need to provide temporary access permissions to third-party applications or users, or want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use temporary authentication tokens.
Compared with long-term API keys, temporary authentication tokens provide higher security. They have shorter validity periods of 60 seconds, which makes them suitable for temporary call scenarios and effectively reduces the risk of API key leakage.
Usage: In your code, replace the API key with the temporary authentication token to perform authentication.
Install the latest version of the DashScope SDK.

Text and format limitations for speech synthesis

Text length limits

Non-streaming calls (Synchronous call or Asynchronous call): The text length in a single request cannot exceed 2,000 characters.
Streaming calls: The text length in a single request cannot exceed 2,000 characters, and the total text length cannot exceed 200,000 characters.

Character calculation rules

Chinese character: 2 characters
English letter, number, punctuation mark, or space: 1 character
The content of SSML tags is included when calculating the text length.
Examples:
- "你好" → 2 + 2 = 4 characters
- "中A文123" → 2 + 1 + 2 + 1 + 1 + 1 = 8 characters
- "中文。" → 2 + 2 + 1 = 5 characters
- "中文。" → 2 + 1 + 2 + 1 = 6 characters
- "<speak>你好<speak/>" → 7 + 2 + 2 + 8 = 19 characters

Encoding format

Use UTF-8 encoding.

Support for mathematical expressions

The mathematical expression parsing feature is currently available only for the cosyvoice-v2 model. It supports common mathematical expressions found in primary and secondary school, including but not limited to basic arithmetic, algebra, and geometry.

For more information, see Support for Latex (Chinese language only).

SSML support

The Speech Synthesis Markup Language (SSML) feature is currently available only for some voices of the cosyvoice-v2 model. You can check the voice list to confirm SSML support. The following conditions must be met:

You must use DashScope SDK 1.23.4 or a later version.
Only synchronous call and asynchronous call (that is, the call method of the SpeechSynthesizer class) are supported. Streaming call (that is, the streaming_call method of the SpeechSynthesizer class) are not supported.
The usage is the same as for standard speech synthesis. You can pass the text containing SSML to the call method of the SpeechSynthesizer class.

Get started

The SpeechSynthesizer class provides key interfaces for speech synthesis and supports the following call methods:

Synchronous call: After you submit text, the server processes it immediately and returns the complete synthesized audio. This process is blocking, which means the client must wait for the server to finish before proceeding. This method is suitable for short text synthesis.
Asynchronous call: You can send the entire text to the server in one go and receive the synthesized audio in real time. Sending text in segments is not allowed. This method is suitable for short text synthesis scenarios that require low latency.
Streaming call: You can send text to the server incrementally and receive the synthesized audio in real time. You can send long text in segments, and the server begins processing as soon as it receives a portion of the text. This method is suitable for long text synthesis scenarios that require low latency.

Synchronous call

You can submit a single speech synthesis task and obtain the full result at once without using a callback function. No intermediate results are streamed.

You can instantiate the SpeechSynthesizer class, bind the request parameters, and call the call method to synthesize the audio and obtain the binary data.

The length of the sent text cannot exceed 2,000 characters. For more information, see the call method of the SpeechSynthesizer class.

Important

You must re-initialize the SpeechSynthesizer instance before each call to the call method.

Click to view the full example

# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *

# If you have not configured the API key in your environment variables, replace "your-api-key" with your own API key.
# dashscope.api_key = "your-api-key"

# Model
model = "cosyvoice-v2"
# Voice
voice = "longxiaochun_v2"

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor.
synthesizer = SpeechSynthesizer(model=model, voice=voice)
# Send the text to be synthesized and get the binary audio data.
audio = synthesizer.call("What is the weather like today?")
print('[Metric] Request ID: {}, First package delay: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

# Save the audio to a local file.
with open('output.mp3', 'wb') as f:
    f.write(audio)

Asynchronous call

You can submit a single speech synthesis task and stream the intermediate results through a callback. The synthesized result is obtained through the callback functions in ResultCallback.

You can instantiate the SpeechSynthesizer class, bind the request parameters and the ResultCallback interface, call the call method to start synthesis, and receive the result in real time through the on_data method of the ResultCallback interface.

The length of the sent text cannot exceed 2,000 characters. For more information, see the call method of the SpeechSynthesizer class.

Important

You must re-initialize the SpeechSynthesizer instance before each call to the call method.

Click to view the full example

# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *

from datetime import datetime

def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp

# If you have not configured the API key in your environment variables, replace "your-api-key" with your own API key.
# dashscope.api_key = "your-api-key"

# Model
model = "cosyvoice-v2"
# Voice
voice = "longxiaochun_v2"


# Define the callback interface.
class Callback(ResultCallback):
    _player = None
    _stream = None

    def on_open(self):
        self.file = open("output.mp3", "wb")
        print("Connection established: " + get_timestamp())

    def on_complete(self):
        print("Speech synthesis complete. All results received: " + get_timestamp())

    def on_error(self, message: str):
        print(f"An error occurred during speech synthesis: {message}")

    def on_close(self):
        print("Connection closed: " + get_timestamp())
        self.file.close()

    def on_event(self, message):
        pass

    def on_data(self, data: bytes) -> None:
        print(get_timestamp() + " Binary audio data length: " + str(len(data)))
        self.file.write(data)


callback = Callback()

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor.
synthesizer = SpeechSynthesizer(
    model=model,
    voice=voice,
    callback=callback,
)

# Send the text to be synthesized and receive the binary audio data in real time in the on_data method of the callback interface.
synthesizer.call("What is the weather like today?")
print('[Metric] Request ID: {}, First package delay: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

Streaming calls

You can submit text multiple times within a single speech synthesis task and receive the synthesized result in real time through a callback.

Note

For streaming input, you can call streaming_call multiple times to submit text fragments in sequence. The server automatically segments sentences after receiving the fragments:
- Complete sentences are synthesized immediately.
- Incomplete sentences are buffered until they are complete, then synthesized.
When you call streaming_complete, the server forcibly synthesizes all received but unprocessed text fragments, including incomplete sentences.
The interval between sending text fragments cannot exceed 23 seconds, or a "request timeout after 23 seconds" exception is triggered.
If there is no more text to send, you can call streaming_complete promptly to end the task.
The server enforces a 23-second timeout mechanism that cannot be modified by the client.

Instantiate the SpeechSynthesizer class.
Instantiate the SpeechSynthesizer class and bind the request parameters and the ResultCallback interface.
Stream the text.
Stream the text by calling the streaming_call method of the SpeechSynthesizer class multiple times to send the text to be synthesized to the server in segments.
While you send the text, the server returns the synthesized result to the client in real time through the on_data method of the ResultCallback interface.
The length of the text fragment sent in each call to the streaming_call method (the text parameter) cannot exceed 2,000 characters, and the total length of all text sent cannot exceed 200,000 characters.
You can end the process.
End the process by calling the streaming_complete method of the SpeechSynthesizer class to end the speech synthesis.
This method blocks the current thread until the on_complete or on_error callback of the ResultCallback interface is triggered. After the callback is triggered, the thread is unblocked.
Make sure to call this method. Otherwise, the final part of the text may not be converted to speech.

Click to view the full example

# coding=utf-8
#
# pyaudio installation instructions:
# For macOS, run the following commands:
#   brew install portaudio
#   pip install pyaudio
# For Debian/Ubuntu, run the following commands:
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# For CentOS, run the following commands:
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# For Microsoft Windows, run the following command:
#   python -m pip install pyaudio

import time
import pyaudio
import dashscope
from dashscope.api_entities.dashscope_response import SpeechSynthesisResponse
from dashscope.audio.tts_v2 import *

from datetime import datetime

def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp

# If you have not configured the API key in your environment variables, replace "your-api-key" with your own API key.
# dashscope.api_key = "your-api-key"

# Model
model = "cosyvoice-v2"
# Voice
voice = "longxiaochun_v2"


# Define the callback interface.
class Callback(ResultCallback):
    _player = None
    _stream = None

    def on_open(self):
        print("Connection established: " + get_timestamp())
        self._player = pyaudio.PyAudio()
        self._stream = self._player.open(
            format=pyaudio.paInt16, channels=1, rate=22050, output=True
        )

    def on_complete(self):
        print("Speech synthesis complete. All results received: " + get_timestamp())

    def on_error(self, message: str):
        print(f"An error occurred during speech synthesis: {message}")

    def on_close(self):
        print("Connection closed: " + get_timestamp())
        # Stop the player.
        self._stream.stop_stream()
        self._stream.close()
        self._player.terminate()

    def on_event(self, message):
        pass

    def on_data(self, data: bytes) -> None:
        print(get_timestamp() + " Binary audio data length: " + str(len(data)))
        self._stream.write(data)


callback = Callback()

test_text = [
    "The streaming text-to-speech SDK ",
    "can convert input text ",
    "into binary speech data. ",
    "Compared to non-streaming speech synthesis, ",
    "the advantage of streaming synthesis is its real-time performance, ",
    "which is much stronger. Users can hear ",
    "nearly synchronized speech output while ",
    "inputting text, which greatly improves the ",
    "interactive experience and reduces user waiting time. ",
    "It is suitable for scenarios that call large ",
    "language models (LLMs) to perform ",
    "speech synthesis by streaming ",
    "text input.",
]

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor.
synthesizer = SpeechSynthesizer(
    model=model,
    voice=voice,
    format=AudioFormat.PCM_22050HZ_MONO_16BIT,  
    callback=callback,
)


# Stream the text to be synthesized. Receive the binary audio data in real time in the on_data method of the callback interface.
for text in test_text:
    synthesizer.streaming_call(text)
    time.sleep(0.1)
# End the streaming speech synthesis.
synthesizer.streaming_complete()

print('[Metric] Request ID: {}, First package delay: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

Request parameters

You can set the request parameters through the constructor of the SpeechSynthesizer class.

Parameter	Type	Default	Required	Description
model	str	-	Yes	Specifies the model. Supported values are `cosyvoice-v2`.
voice	str	-	Yes	Specifies the voice to use for speech synthesis. The following two types of voices are supported: Default voices. For more information, see Voice list. A custom voice that is created using the voice cloning feature. When you use a cloned voice (ensure that the voice cloning and speech synthesis services are under the same account), you must set the `voice` parameter to the ID of the voice. For the complete procedure, see Sample code: Use cloned voice for speech synthesis.
format	enum	Varies by voice	No	Specifies the audio coding format and sample rate. If you do not specify the `format`, the synthesized audio has a sample rate of 22.05 kHz and is in MP3 format. Note The default sample rate is the optimal sample rate for the current voice. If this parameter is not set, the output uses this sample rate by default. Downsampling and upsampling are also supported. The following audio coding formats and sample rates are supported: Audio coding formats and sample rates supported by both `cosyvoice-v2` and cosyvoice-v1: AudioFormat.WAV_8000HZ_MONO_16BIT: WAV format, 8 kHz sample rate AudioFormat.WAV_16000HZ_MONO_16BIT: WAV format, 16 kHz sample rate AudioFormat.WAV_22050HZ_MONO_16BIT: WAV format, 22.05 kHz sample rate AudioFormat.WAV_24000HZ_MONO_16BIT: WAV format, 24 kHz sample rate AudioFormat.WAV_44100HZ_MONO_16BIT: WAV format, 44.1 kHz sample rate AudioFormat.WAV_48000HZ_MONO_16BIT: WAV format, 48 kHz sample rate AudioFormat.MP3_8000HZ_MONO_128KBPS: MP3 format, 8 kHz sample rate AudioFormat.MP3_16000HZ_MONO_128KBPS: MP3 format, 16 kHz sample rate AudioFormat.MP3_22050HZ_MONO_256KBPS: MP3 format, 22.05 kHz sample rate AudioFormat.MP3_24000HZ_MONO_256KBPS: MP3 format, 24 kHz sample rate AudioFormat.MP3_44100HZ_MONO_256KBPS: MP3 format, 44.1 kHz sample rate AudioFormat.MP3_48000HZ_MONO_256KBPS: MP3 format, 48 kHz sample rate AudioFormat.PCM_8000HZ_MONO_16BIT: PCM format, 8 kHz sample rate AudioFormat.PCM_16000HZ_MONO_16BIT: PCM format, 16 kHz sample rate AudioFormat.PCM_22050HZ_MONO_16BIT: PCM format, 22.05 kHz sample rate AudioFormat.PCM_24000HZ_MONO_16BIT: PCM format, 24 kHz sample rate AudioFormat.PCM_44100HZ_MONO_16BIT: PCM format, 44.1 kHz sample rate AudioFormat.PCM_48000HZ_MONO_16BIT: PCM format, 48 kHz sample rate Audio coding formats and sample rates supported only by `cosyvoice-v2`: If the audio format is Opus, you can adjust the bitrate using the `bit_rate` parameter. This is applicable only to DashScope SDK 1.24.0 and later versions. AudioFormat.OGG_OPUS_8KHZ_MONO_32KBPS: Opus format, 8 kHz sample rate, 32 kbps bitrate AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS: Opus format, 16 kHz sample rate, 16 kbps bitrate AudioFormat.OGG_OPUS_16KHZ_MONO_32KBPS: Opus format, 16 kHz sample rate, 32 kbps bitrate AudioFormat.OGG_OPUS_16KHZ_MONO_64KBPS: Opus format, 16 kHz sample rate, 64 kbps bitrate AudioFormat.OGG_OPUS_24KHZ_MONO_16KBPS: Opus format, 24 kHz sample rate, 16 kbps bitrate AudioFormat.OGG_OPUS_24KHZ_MONO_32KBPS: Opus format, 24 kHz sample rate, 32 kbps bitrate AudioFormat.OGG_OPUS_24KHZ_MONO_64KBPS: Opus format, 24 kHz sample rate, 64 kbps bitrate AudioFormat.OGG_OPUS_48KHZ_MONO_16KBPS: Opus format, 48 kHz sample rate, 16 kbps bitrate AudioFormat.OGG_OPUS_48KHZ_MONO_32KBPS: Opus format, 48 kHz sample rate, 32 kbps bitrate AudioFormat.OGG_OPUS_48KHZ_MONO_64KBPS: Opus format, 48 kHz sample rate, 64 kbps bitrate
volume	int	50	No	The volume of the synthesized audio. Valid values: 0 to 100. Important This field differs in various versions of the DashScope SDK: SDK versions 1.20.10 and later: volume SDK versions earlier than 1.20.10: volumn
speech_rate	float	1.0	No	The speech rate of the synthesized audio. Valid values: 0.5 to 2. 0.5: 0.5 times the default speech rate. 1: The default speech rate. This is the default output rate of the model, which may vary slightly depending on the voice. The rate is approximately four characters per second. 2: 2 times the default speech rate.
pitch_rate	float	1.0	No	The pitch of the synthesized audio. Valid values: 0.5 to 2.
bit_rate	int	32	No	Specifies the audio bitrate. Valid values: 6 to 510 kbps. A higher bitrate results in better audio quality and a larger file size. This parameter is available only when the `model` is `cosyvoice-v2` and the audio `format` is Opus. Note Set the `bit_rate` parameter through the `additional_params` parameter: `synthesizer = SpeechSynthesizer(model="cosyvoice-v2", voice="longxiaochun_v2", format=AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS, additional_params={"bit_rate": 32})`
callback	ResultCallback	-	No	Callback interface (ResultCallback).

Key interfaces

The `SpeechSynthesizer` class

SpeechSynthesizer is imported through from dashscope.audio.tts_v2 import * and provides key interfaces for speech synthesis.

Method	Parameters	Return value	Description
`def call(self, text: str, timeout_millis=None)`	`text`: The text to be synthesized. `timeout_millis`: The timeout period for blocking the thread, in milliseconds. If this parameter is not set or is set to 0, it does not take effect.	Returns binary audio data if `ResultCallback` is not specified. Otherwise, returns None.	Converts a whole piece of text (either plain text or text with SSML) into speech. When you create a `SpeechSynthesizer` instance, two cases exist: `ResultCallback` is not specified: The `call` method blocks the current thread until speech synthesis is complete and returns the binary audio data. For usage, see Synchronous call. `ResultCallback` is specified: The `call` method immediately returns None and returns the speech synthesis result through the `on_data` method of the ResultCallback interface. For usage, see Asynchronous call. Important You must re-initialize the `SpeechSynthesizer` instance before each call to the `call` method.
`def streaming_call(self, text: str)`	`text`: The text fragment to be synthesized.	None	Streams the text to be synthesized. Text with SSML is not supported. You can call this interface multiple times to send the text to be synthesized to the server in multiple parts. The synthesis result is obtained through the `on_data` method of the ResultCallback interface. For usage, see Streaming calls.
`def streaming_complete(self, complete_timeout_millis=600000)`	`complete_timeout_millis`: The waiting time, in milliseconds.	None	Ends the streaming speech synthesis. This method blocks the current thread for N milliseconds, where N is determined by `complete_timeout_millis`, until the task is complete. If `completeTimeoutMillis` is set to 0, it waits indefinitely. By default, if the waiting time exceeds 10 minutes, the waiting stops. For usage, see Streaming calls. Important For streaming calls, make sure to call this method. Otherwise, parts of the synthesized speech may be missing.
`def get_last_request_id(self)`	None	The request_id of the last task.	Gets the request_id of the last task.
`def get_first_package_delay(self)`	None	First package delay	Gets the first package delay, which is typically around 500 ms. The first package delay is the time between when the text is sent and when the first audio packet is received, measured in milliseconds. Use this method after the task is complete.
`def get_response(self)`	None	The last message	Gets the last message, which is in JSON format. You can use this to get task-failed errors.

Callback interface (`ResultCallback`)

For asynchronous or streaming calls, the server uses callbacks to return key process information and data to the client. You need to implement the callback methods to handle the information or data returned by the server.

You can import it using from dashscope.audio.tts_v2 import *.

Click to view an example

class Callback(ResultCallback):
    def on_open(self) -> None:
        print('Connection successful')
    
    def on_data(self, data: bytes) -> None:
        # Implement the logic to receive the synthesized binary audio result.

    def on_complete(self) -> None:
        print('Synthesis complete')

    def on_error(self, message) -> None:
        print('An exception occurred: ', message)

    def on_close(self) -> None:
        print('Connection closed')


callback = Callback()

Method	Parameters	Return value	Description
`def on_open(self) -> None`	None	None	This method is called immediately after a connection is established with the server.
`def on_event( self, message: str) -> None`	`message`: The information returned by the server.	None	This method is called when there is a response from the service. The `message` is a JSON string that you can parse to get the Request ID.
`def on_complete(self) -> None`	None	None	This method is called after all synthesized data has been returned (speech synthesis is complete).
`def on_error(self, message) -> None`	`message`: The exception information.	None	This method is called when an exception occurs.
`def on_data(self, data: bytes) -> None`	`data`: The binary audio data returned by the server.	None	This method is called when the server returns synthesized audio. You can combine the binary audio data into a complete audio file and play it with a player, or play it in real time with a player that supports streaming playback. Important In streaming speech synthesis, for compressed formats such as mp3 and opus, the audio segments must be played using a streaming player. Do not play them frame by frame to avoid decoding failures. Players that support streaming playback include ffmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (Javascript). When combining audio data into a complete audio file, write to the same file in append mode. For wav and mp3 formats in streaming speech synthesis, only the first frame contains header information. Subsequent frames contain only audio data.
`def on_close(self) -> None`	None	None	This method is called after the service has closed the connection.

Response

The server returns binary audio data:

Synchronous call: You can process the binary audio data returned by the call method of the SpeechSynthesizer class.
Asynchronous call or streaming call: You can process the parameter (bytes type data) of the on_data method of the ResultCallback interface.

Error codes

If you encounter an error, see Error messages for troubleshooting.

If the problem is still not resolved, join the developer group to provide feedback on the problem and include the Request ID for further investigation.

More examples

For more examples, see GitHub.

Voice list

The default supported voices are listed in the following table. If you need a more personalized voice, you can use the voice cloning feature to create a custom voice for free. For more information, see Use a cloned voice for speech synthesis.

cosyvoice-v2 voice list

Important

The following is the cosyvoice-v2 voice list. When using these voices, you must set the model request parameter to cosyvoice-v2. Otherwise, the call will fail.

Scenario	Voice	Voice characteristics	Audio sample (right-click to save)	voice parameter	Language	SSML
Customer service	Long Yingcui	Serious male		longyingcui	Chinese, English	✅
	Long Yingda	Cheerful high-pitched female		longyingda	Chinese, English	✅
	Long Yingjing	Low-key and calm female		longyingjing	Chinese, English	✅
	Long Yingyan	Righteous and stern female		longyingyan	Chinese, English	✅
	Long Yingtian	Gentle and sweet female		longyingtian	Chinese, English	✅
	Long Yingbing	Sharp and assertive female		longyingbing	Chinese, English	✅
	Long Yingtao	Gentle and calm female		longyingtao	Chinese, English	✅
	Long Yingling	Gentle and empathetic female		longyingling	Chinese, English	✅
Voice assistant	YUMI	Formal young female		longyumi_v2	Chinese, English	✅
	Long Xiaochun	Intellectual and positive female		longxiaochun_v2	Chinese, English	✅
	Long Xiaoxia	Calm and authoritative female		longxiaoxia_v2	Chinese, English	✅
Livestreaming and e-commerce	Long Anran	Lively and textured female		longanran	Chinese, English	✅
Livestreaming and e-commerce	Long Anxuan	Classic female livestreamer		longanxuan	Chinese, English	✅
Audiobooks	Long Sanshu	Calm and textured male		longsanshu	Chinese, English	✅
	Long Xiu	Knowledgeable male storyteller		longxiu_v2	Chinese, English	✅
	Long Miao	Rhythmic female		longmiao_v2	Chinese, English	✅
	Long Yue	Warm and magnetic female		longyue_v2	Chinese, English	✅
	Long Nan	Wise young male		longnan_v2	Chinese, English	✅
	Long Yuan	Warm and healing female		longyuan_v2	Chinese, English	✅
Social companion	Long Anrou	Gentle best friend female		longanrou	Chinese, English	✅
	Long Qiang	Romantic and charming female		longqiang_v2	Chinese, English	✅
	Long Han	Warm and devoted male		longhan_v2	Chinese, English	✅
	Long Xing	Gentle girl-next-door		longxing_v2	Chinese, English	✅
	Long Hua	Energetic and sweet female		longhua_v2	Chinese, English	✅
	Long Wan	Positive and intellectual female		longwan_v2	Chinese, English	✅
	Long Cheng	Intelligent young male		longcheng_v2	Chinese, English	✅
	Long Feifei	Sweet and delicate female		longfeifei_v2	Chinese, English	✅
	Long Xiaocheng	Magnetic low-pitched male		longxiaocheng_v2	Chinese, English	✅
	Long Zhe	Awkward and warm-hearted male		longzhe_v2	Chinese, English	✅
	Long Yan	Warm and gentle female		longyan_v2	Chinese, English	✅
	Long Tian	Magnetic and rational male		longtian_v2	Chinese, English	✅
	Long Ze	Warm and energetic male		longze_v2	Chinese, English	✅
	Long Shao	Positive and upwardly mobile male		longshao_v2	Chinese, English	✅
	Long Hao	Emotional and melancholic male		longhao_v2	Chinese, English	✅
	Long Shen	Talented male singer		kabuleshen_v2	Chinese, English	✅
Children's voice	Long Jielidou	Sunny and mischievous male		longjielidou_v2	Chinese, English	✅
	Long Ling	Childish and stiff female		longling_v2	Chinese, English	✅
	Long Ke	Innocent and well-behaved female		longke_v2	Chinese, English	✅
	Long Xian	Bold and cute female		longxian_v2	Chinese, English	✅
Dialect	Long Laotie	Straightforward Northeastern dialect male		longlaotie_v2	Chinese (Northeastern), English	✅
	Long Jiayi	Intellectual Cantonese female		longjiayi_v2	Chinese (Cantonese), English	✅
	Long Tao	Positive Cantonese female		longtao_v2	Chinese (Cantonese), English	✅
Poetry recitation	Long Fei	Passionate and magnetic male		longfei_v2	Chinese, English	✅
	Li Bai	Ancient male poet		libai_v2	Chinese, English	✅
	Long Jin	Elegant and gentle male		longjin_v2	Chinese, English	✅
News broadcast	Long Shu	Calm young male		longshu_v2	Chinese, English	✅
	Bella2.0	Precise and capable female		loongbella_v2	Chinese, English	✅
	Long Shuo	Knowledgeable and capable male		longshuo_v2	Chinese, English	✅
	Long Xiaobai	Calm female announcer		longxiaobai_v2	Chinese, English	✅
	Long Jing	Typical female announcer		longjing_v2	Chinese, English	✅
	loongstella	Confident and crisp female		loongstella_v2	Chinese, English	✅
Overseas marketing	loongeva	Intellectual British English female		loongeva_v2	British English	❌
	loongbrian	Calm British English male		loongbrian_v2	British English	❌
	loongluna	British English female		loongluna_v2	British English	❌
	loongluca	British English male		loongluca_v2	British English	❌
	loongemily	British English female		loongemily_v2	British English	❌
	loongeric	British English male		loongeric_v2	British English	❌
	loongabby	American English female		loongabby_v2	American English	❌
	loongannie	American English female		loongannie_v2	American English	❌
	loongandy	American English male		loongandy_v2	American English	❌
	loongava	American English female		loongava_v2	American English	❌
	loongbeth	American English female		loongbeth_v2	American English	❌
	loongbetty	American English female		loongbetty_v2	American English	❌
	loongcindy	American English female		loongcindy_v2	American English	❌
	loongcally	American English female		loongcally_v2	American English	❌
	loongdavid	American English male		loongdavid_v2	American English	❌
	loongdonna	American English female		loongdonna_v2	American English	❌
	loongkyong	Korean female		loongkyong_v2	Korean	❌
	loongtomoka	Japanese female		loongtomoka_v2	Japanese	❌
	loongtomoya	Japanese male		loongtomoya_v2	Japanese	❌

FAQ

Features, billing, and rate limiting

Q: Where can I find information about CosyVoice's features, billing, and rate limiting?

A: For more information, see Speech synthesis - CosyVoice.

Q: What to do if the pronunciation is inaccurate?

A: You can use SSML to customize the speech synthesis output.

Q: The current Requests Per Second (RPS) cannot meet my business needs. What should I do? How to increase the limit? Do I need to pay for the increase?

A: You can make a request by submitting a ticket or joining the developer group. The scale-out is free.

Q: How do I specify the language of the synthesized speech?

A: You cannot specify the language of the synthesized speech through request parameters. To synthesize speech in a specific language, refer to the voice list and select a voice that supports the desired language.

Troubleshooting

If a code error occurs, see the information in Error codes for troubleshooting.

Q: How do I get the Request ID?

A: You can obtain it in one of the following two ways:

Parse the JSON string message in the on_event method of the ResultCallback interface.
Call the get_last_request_id method of SpeechSynthesizer.

Q: Why does the SSML feature fail?

A: You can follow these steps to troubleshoot:

Make sure the current voice supports SSML. Cloned voices do not support SSML.
Make sure the model parameter is set to cosyvoice-v2.
Install the latest version of the DashScope SDK.
Make sure you are using the correct interface. Only the call method of the SpeechSynthesizer class supports SSML.
Make sure the text to be synthesized is in plain text format and meets the format requirements. For more information, see Introduction to the SSML markup language.

Q: Why won't the audio play?

A: Check the following scenarios one by one:

The audio is saved as a complete file, such as xx.mp3
1. Audio format consistency: Make sure the audio format set in the request parameters matches the file extension. For example, if the audio format is set to wav but the file extension is mp3, playback may fail.
2. Player compatibility: Confirm whether the player you are using supports the format and sample rate of the audio file. For example, some players may not support high sample rates or specific audio encodings.
The audio is played in a stream
1. Save the audio stream as a complete file and try to play it with a player. If the file cannot be played, refer to the troubleshooting methods for scenario 1.
2. If the file can be played normally, the problem may be with the streaming playback implementation. Confirm whether the player you are using supports streaming playback.
  Common tools and libraries that support streaming playback include ffmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (Javascript).

Q: Why is the audio playback stuttering?

A: Check the following scenarios one by one:

The audio is saved as a complete file, such as xx.mp3
Please join the developer group and provide the Request ID so we can troubleshoot the issue for you.
The audio is played in a stream
1. Check the text sending speed: Make sure the interval for sending text is reasonable. Avoid situations where the next sentence is not sent promptly after the previous audio has finished playing.
2. Check the callback function performance:
  - Check if there is too much business logic in the callback function, causing it to block.
  - The callback function runs in the WebSocket thread. If it is blocked, it may affect the WebSocket's ability to receive network packets, leading to stuttering in audio reception.
  - Write the audio data to a separate audio buffer and then read and process it in another thread to avoid blocking the WebSocket thread.
3. Check network stability: Make sure the network connection is stable to avoid audio transmission interruptions or delays due to network fluctuations.
4. Further troubleshooting: If the preceding solutions do not resolve the issue, join the developer group and provide the Request ID so we can investigate the issue further for you.

Q: Why is speech synthesis slow (long synthesis time)?

A: Check the following items:

Check the input interval
For streaming speech synthesis, check if the interval between sending text segments is too long. For example, a delay of several seconds after the previous segment is sent. A long interval will increase the total synthesis time.
Analyze performance metrics
If the first packet delay does not meet the following requirements, submit the Request ID to the technical team for investigation.
- First packet delay: Normally around 500 ms.
- RTF (RTF = Total synthesis time / Audio duration): Normally around 0.3.

Q: How do I handle pronunciation errors in the synthesized speech?

We recommend cosyvoice-v2 for better results and supports SSML.
If the current model is cosyvoice-v2, use the SSML <phoneme> tag to specify the correct pronunciation.

Q: Why is no audio returned? Why is the end of the text not converted to speech? (Missing synthesized audio)

A: You can check if you forgot to call the streaming_complete method of the SpeechSynthesizer class. During speech synthesis, the server starts synthesizing only after buffering enough text. If you do not call the streaming_complete method, the text at the end of the buffer may not be synthesized into speech.

Q: How do I resolve an SSL certificate verification failure?

You can install the system root certificate.

sudo yum install -y ca-certificates
sudo update-ca-trust enable

You can add the following content to your code.

import os
os.environ["SSL_CERT_FILE"] = "/etc/ssl/certs/ca-bundle.crt"

Q: What causes the "SSL: CERTIFICATE_VERIFY_FAILED" exception on macOS? (websocket closed due to [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000))

When connecting to a WebSocket, you may encounter an OpenSSL certificate verification failure with a message indicating that the certificate cannot be found. This is usually because the certificate configuration in your Python environment is incorrect. You can manually locate and fix the certificate issue by following these steps:

Export system certificates and set environment variables You can run the following commands to export all certificates from your macOS system to a file and set it as the default certificate path for Python and related libraries:
```
security find-certificate -a -p > ~/all_mac_certs.pem
export SSL_CERT_FILE=~/all_mac_certs.pem
export REQUESTS_CA_BUNDLE=~/all_mac_certs.pem
```
Create a symbolic link to fix Python's OpenSSL configuration If Python's OpenSSL configuration is missing certificates, you can create a symbolic link manually. Make sure to replace the path in the command with the actual installation directory of your local Python version:
```
# 3.9 is an example version number. Adjust the path according to your locally installed Python version.
ln -s /etc/ssl/* /Library/Frameworks/Python.framework/Versions/3.9/etc/openssl
```
Restart the terminal and clear the cache After completing the above operations, you can close and reopen the terminal to ensure the environment variables take effect. You can clear any possible cache and try connecting to the WebSocket again.

These steps can resolve connection issues caused by incorrect certificate configurations. If the problem persists, check if the target server's certificate configuration is correct.

Q: What causes the "AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?" error when running the code?

A: This is because websocket-client is not installed or the version is mismatched. You can run the following commands in sequence:

pip uninstall websocket-client
pip uninstall websocket
pip install websocket-client

Permissions and authentication

Q: I want my API key to be used only for the CosyVoice speech synthesis service and not by other Model Studio models (permission isolation). How can I do this?

A: You can limit the scope of an API key by creating a new workspace and granting it access to only specific models. For more information, see Workspace Management.

Alibaba Cloud Model Studio:CosyVoice speech synthesis Python SDK

Prerequisites

Text and format limitations for speech synthesis

Text length limits

Character calculation rules

Encoding format

Support for mathematical expressions

SSML support

Get started

Synchronous call

Asynchronous call

Streaming calls

Request parameters

Key interfaces

The `SpeechSynthesizer` class

Callback interface (`ResultCallback`)

Response

Error codes

More examples

Voice list

FAQ

Features, billing, and rate limiting

Q: Where can I find information about CosyVoice's features, billing, and rate limiting?

Q: What to do if the pronunciation is inaccurate?

Q: The current Requests Per Second (RPS) cannot meet my business needs. What should I do? How to increase the limit? Do I need to pay for the increase?

Q: How do I specify the language of the synthesized speech?

Troubleshooting

Q: How do I get the Request ID?

Q: Why does the SSML feature fail?

Q: Why won't the audio play?

Q: Why is the audio playback stuttering?

Q: Why is speech synthesis slow (long synthesis time)?

Q: How do I handle pronunciation errors in the synthesized speech?

Q: Why is no audio returned? Why is the end of the text not converted to speech? (Missing synthesized audio)

Q: How do I resolve an SSL certificate verification failure?

Q: What causes the "SSL: CERTIFICATE_VERIFY_FAILED" exception on macOS? (websocket closed due to [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000))

Q: What causes the "AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?" error when running the code?

Permissions and authentication

Q: I want my API key to be used only for the CosyVoice speech synthesis service and not by other Model Studio models (permission isolation). How can I do this?

More questions

Prerequisites

Text and format limitations for speech synthesis

Text length limits

Character calculation rules

Encoding format

Support for mathematical expressions

SSML support

Get started

Synchronous call

Asynchronous call

Streaming calls

Request parameters

Key interfaces

The SpeechSynthesizer class

Callback interface (ResultCallback)

Response

Error codes

More examples

Voice list

FAQ

Features, billing, and rate limiting

Q: Where can I find information about CosyVoice's features, billing, and rate limiting?

Q: What to do if the pronunciation is inaccurate?

Q: The current Requests Per Second (RPS) cannot meet my business needs. What should I do? How to increase the limit? Do I need to pay for the increase?

Q: How do I specify the language of the synthesized speech?

Troubleshooting

Q: How do I get the Request ID?

Q: Why does the SSML feature fail?

Q: Why won't the audio play?

Q: Why is the audio playback stuttering?

Q: Why is speech synthesis slow (long synthesis time)?

Q: How do I handle pronunciation errors in the synthesized speech?

Q: Why is no audio returned? Why is the end of the text not converted to speech? (Missing synthesized audio)

Q: How do I resolve an SSL certificate verification failure?

Q: What causes the "SSL: CERTIFICATE_VERIFY_FAILED" exception on macOS? (websocket closed due to [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000))

Q: What causes the "AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?" error when running the code?

Permissions and authentication

Q: I want my API key to be used only for the CosyVoice speech synthesis service and not by other Model Studio models (permission isolation). How can I do this?

More questions

The `SpeechSynthesizer` class

Callback interface (`ResultCallback`)