All Products
Search
Document Center

Alibaba Cloud Model Studio:CosyVoice speech synthesis Python SDK

Last Updated:Aug 21, 2025

This topic describes the parameters and interface details of the CosyVoice speech synthesis Python SDK.

Important

This document applies only to the China (Beijing) region. To use the model, you must use an API key from the China (Beijing) region.

User guide: For model introductions and selection recommendations, see Speech synthesis - CosyVoice.

Prerequisites

  • You have activated Alibaba Cloud Model Studio and obtained an API key. Configure the API key as an environment variable instead of hard coding it in your code to prevent security risks that can result from code exposure.

    Note

    When you need to provide temporary access permissions to third-party applications or users, or want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use temporary authentication tokens.

    Compared with long-term API keys, temporary authentication tokens provide higher security. They have shorter validity periods of 60 seconds, which makes them suitable for temporary call scenarios and effectively reduces the risk of API key leakage.

    Usage: In your code, replace the API key with the temporary authentication token to perform authentication.

  • Install the latest version of the DashScope SDK.

Text and format limitations for speech synthesis

Text length limits

  • Non-streaming calls (Synchronous call or Asynchronous call): The text length in a single request cannot exceed 2,000 characters.

  • Streaming calls: The text length in a single request cannot exceed 2,000 characters, and the total text length cannot exceed 200,000 characters.

Character calculation rules

  • Chinese character: 2 characters

  • English letter, number, punctuation mark, or space: 1 character

  • The content of SSML tags is included when calculating the text length.

  • Examples:

    • "你好" → 2 + 2 = 4 characters

    • "中A文123" → 2 + 1 + 2 + 1 + 1 + 1 = 8 characters

    • "中文。" → 2 + 2 + 1 = 5 characters

    • "中 文。" → 2 + 1 + 2 + 1 = 6 characters

    • "<speak>你好<speak/>" → 7 + 2 + 2 + 8 = 19 characters

Encoding format

Use UTF-8 encoding.

Support for mathematical expressions

The mathematical expression parsing feature is currently available only for the cosyvoice-v2 model. It supports common mathematical expressions found in primary and secondary school, including but not limited to basic arithmetic, algebra, and geometry.

For more information, see Support for Latex (Chinese language only).

SSML support

The Speech Synthesis Markup Language (SSML) feature is currently available only for some voices of the cosyvoice-v2 model. You can check the voice list to confirm SSML support. The following conditions must be met:

Get started

The SpeechSynthesizer class provides key interfaces for speech synthesis and supports the following call methods:

  • Synchronous call: After you submit text, the server processes it immediately and returns the complete synthesized audio. This process is blocking, which means the client must wait for the server to finish before proceeding. This method is suitable for short text synthesis.

  • Asynchronous call: You can send the entire text to the server in one go and receive the synthesized audio in real time. Sending text in segments is not allowed. This method is suitable for short text synthesis scenarios that require low latency.

  • Streaming call: You can send text to the server incrementally and receive the synthesized audio in real time. You can send long text in segments, and the server begins processing as soon as it receives a portion of the text. This method is suitable for long text synthesis scenarios that require low latency.

Synchronous call

You can submit a single speech synthesis task and obtain the full result at once without using a callback function. No intermediate results are streamed.

image

You can instantiate the SpeechSynthesizer class, bind the request parameters, and call the call method to synthesize the audio and obtain the binary data.

The length of the sent text cannot exceed 2,000 characters. For more information, see the call method of the SpeechSynthesizer class.

Important

You must re-initialize the SpeechSynthesizer instance before each call to the call method.

Click to view the full example

# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *

# If you have not configured the API key in your environment variables, replace "your-api-key" with your own API key.
# dashscope.api_key = "your-api-key"

# Model
model = "cosyvoice-v2"
# Voice
voice = "longxiaochun_v2"

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor.
synthesizer = SpeechSynthesizer(model=model, voice=voice)
# Send the text to be synthesized and get the binary audio data.
audio = synthesizer.call("What is the weather like today?")
print('[Metric] Request ID: {}, First package delay: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

# Save the audio to a local file.
with open('output.mp3', 'wb') as f:
    f.write(audio)

Asynchronous call

You can submit a single speech synthesis task and stream the intermediate results through a callback. The synthesized result is obtained through the callback functions in ResultCallback.

image

You can instantiate the SpeechSynthesizer class, bind the request parameters and the ResultCallback interface, call the call method to start synthesis, and receive the result in real time through the on_data method of the ResultCallback interface.

The length of the sent text cannot exceed 2,000 characters. For more information, see the call method of the SpeechSynthesizer class.

Important

You must re-initialize the SpeechSynthesizer instance before each call to the call method.

Click to view the full example

# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *

from datetime import datetime

def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp

# If you have not configured the API key in your environment variables, replace "your-api-key" with your own API key.
# dashscope.api_key = "your-api-key"

# Model
model = "cosyvoice-v2"
# Voice
voice = "longxiaochun_v2"


# Define the callback interface.
class Callback(ResultCallback):
    _player = None
    _stream = None

    def on_open(self):
        self.file = open("output.mp3", "wb")
        print("Connection established: " + get_timestamp())

    def on_complete(self):
        print("Speech synthesis complete. All results received: " + get_timestamp())

    def on_error(self, message: str):
        print(f"An error occurred during speech synthesis: {message}")

    def on_close(self):
        print("Connection closed: " + get_timestamp())
        self.file.close()

    def on_event(self, message):
        pass

    def on_data(self, data: bytes) -> None:
        print(get_timestamp() + " Binary audio data length: " + str(len(data)))
        self.file.write(data)


callback = Callback()

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor.
synthesizer = SpeechSynthesizer(
    model=model,
    voice=voice,
    callback=callback,
)

# Send the text to be synthesized and receive the binary audio data in real time in the on_data method of the callback interface.
synthesizer.call("What is the weather like today?")
print('[Metric] Request ID: {}, First package delay: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

Streaming calls

You can submit text multiple times within a single speech synthesis task and receive the synthesized result in real time through a callback.

Note
  • For streaming input, you can call streaming_call multiple times to submit text fragments in sequence. The server automatically segments sentences after receiving the fragments:

    • Complete sentences are synthesized immediately.

    • Incomplete sentences are buffered until they are complete, then synthesized.

    When you call streaming_complete, the server forcibly synthesizes all received but unprocessed text fragments, including incomplete sentences.

  • The interval between sending text fragments cannot exceed 23 seconds, or a "request timeout after 23 seconds" exception is triggered.

    If there is no more text to send, you can call streaming_complete promptly to end the task.

    The server enforces a 23-second timeout mechanism that cannot be modified by the client.
image
  1. Instantiate the SpeechSynthesizer class.

    Instantiate the SpeechSynthesizer class and bind the request parameters and the ResultCallback interface.

  2. Stream the text.

    Stream the text by calling the streaming_call method of the SpeechSynthesizer class multiple times to send the text to be synthesized to the server in segments.

    While you send the text, the server returns the synthesized result to the client in real time through the on_data method of the ResultCallback interface.

    The length of the text fragment sent in each call to the streaming_call method (the text parameter) cannot exceed 2,000 characters, and the total length of all text sent cannot exceed 200,000 characters.

  3. You can end the process.

    End the process by calling the streaming_complete method of the SpeechSynthesizer class to end the speech synthesis.

    This method blocks the current thread until the on_complete or on_error callback of the ResultCallback interface is triggered. After the callback is triggered, the thread is unblocked.

    Make sure to call this method. Otherwise, the final part of the text may not be converted to speech.

Click to view the full example

# coding=utf-8
#
# pyaudio installation instructions:
# For macOS, run the following commands:
#   brew install portaudio
#   pip install pyaudio
# For Debian/Ubuntu, run the following commands:
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# For CentOS, run the following commands:
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# For Microsoft Windows, run the following command:
#   python -m pip install pyaudio

import time
import pyaudio
import dashscope
from dashscope.api_entities.dashscope_response import SpeechSynthesisResponse
from dashscope.audio.tts_v2 import *

from datetime import datetime

def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp

# If you have not configured the API key in your environment variables, replace "your-api-key" with your own API key.
# dashscope.api_key = "your-api-key"

# Model
model = "cosyvoice-v2"
# Voice
voice = "longxiaochun_v2"


# Define the callback interface.
class Callback(ResultCallback):
    _player = None
    _stream = None

    def on_open(self):
        print("Connection established: " + get_timestamp())
        self._player = pyaudio.PyAudio()
        self._stream = self._player.open(
            format=pyaudio.paInt16, channels=1, rate=22050, output=True
        )

    def on_complete(self):
        print("Speech synthesis complete. All results received: " + get_timestamp())

    def on_error(self, message: str):
        print(f"An error occurred during speech synthesis: {message}")

    def on_close(self):
        print("Connection closed: " + get_timestamp())
        # Stop the player.
        self._stream.stop_stream()
        self._stream.close()
        self._player.terminate()

    def on_event(self, message):
        pass

    def on_data(self, data: bytes) -> None:
        print(get_timestamp() + " Binary audio data length: " + str(len(data)))
        self._stream.write(data)


callback = Callback()

test_text = [
    "The streaming text-to-speech SDK ",
    "can convert input text ",
    "into binary speech data. ",
    "Compared to non-streaming speech synthesis, ",
    "the advantage of streaming synthesis is its real-time performance, ",
    "which is much stronger. Users can hear ",
    "nearly synchronized speech output while ",
    "inputting text, which greatly improves the ",
    "interactive experience and reduces user waiting time. ",
    "It is suitable for scenarios that call large ",
    "language models (LLMs) to perform ",
    "speech synthesis by streaming ",
    "text input.",
]

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor.
synthesizer = SpeechSynthesizer(
    model=model,
    voice=voice,
    format=AudioFormat.PCM_22050HZ_MONO_16BIT,  
    callback=callback,
)


# Stream the text to be synthesized. Receive the binary audio data in real time in the on_data method of the callback interface.
for text in test_text:
    synthesizer.streaming_call(text)
    time.sleep(0.1)
# End the streaming speech synthesis.
synthesizer.streaming_complete()

print('[Metric] Request ID: {}, First package delay: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

Request parameters

You can set the request parameters through the constructor of the SpeechSynthesizer class.

Parameter

Type

Default

Required

Description

model

str

-

Yes

Specifies the model. Supported values are cosyvoice-v2.

voice

str

-

Yes

Specifies the voice to use for speech synthesis.

The following two types of voices are supported:

  • Default voices. For more information, see Voice list.

  • A custom voice that is created using the voice cloning feature. When you use a cloned voice (ensure that the voice cloning and speech synthesis services are under the same account), you must set the voice parameter to the ID of the voice. For the complete procedure, see Sample code: Use cloned voice for speech synthesis.

format

enum

Varies by voice

No

Specifies the audio coding format and sample rate.

If you do not specify the format, the synthesized audio has a sample rate of 22.05 kHz and is in MP3 format.

Note

The default sample rate is the optimal sample rate for the current voice. If this parameter is not set, the output uses this sample rate by default. Downsampling and upsampling are also supported.

The following audio coding formats and sample rates are supported:

  • Audio coding formats and sample rates supported by both cosyvoice-v2 and cosyvoice-v1:

    • AudioFormat.WAV_8000HZ_MONO_16BIT: WAV format, 8 kHz sample rate

    • AudioFormat.WAV_16000HZ_MONO_16BIT: WAV format, 16 kHz sample rate

    • AudioFormat.WAV_22050HZ_MONO_16BIT: WAV format, 22.05 kHz sample rate

    • AudioFormat.WAV_24000HZ_MONO_16BIT: WAV format, 24 kHz sample rate

    • AudioFormat.WAV_44100HZ_MONO_16BIT: WAV format, 44.1 kHz sample rate

    • AudioFormat.WAV_48000HZ_MONO_16BIT: WAV format, 48 kHz sample rate

    • AudioFormat.MP3_8000HZ_MONO_128KBPS: MP3 format, 8 kHz sample rate

    • AudioFormat.MP3_16000HZ_MONO_128KBPS: MP3 format, 16 kHz sample rate

    • AudioFormat.MP3_22050HZ_MONO_256KBPS: MP3 format, 22.05 kHz sample rate

    • AudioFormat.MP3_24000HZ_MONO_256KBPS: MP3 format, 24 kHz sample rate

    • AudioFormat.MP3_44100HZ_MONO_256KBPS: MP3 format, 44.1 kHz sample rate

    • AudioFormat.MP3_48000HZ_MONO_256KBPS: MP3 format, 48 kHz sample rate

    • AudioFormat.PCM_8000HZ_MONO_16BIT: PCM format, 8 kHz sample rate

    • AudioFormat.PCM_16000HZ_MONO_16BIT: PCM format, 16 kHz sample rate

    • AudioFormat.PCM_22050HZ_MONO_16BIT: PCM format, 22.05 kHz sample rate

    • AudioFormat.PCM_24000HZ_MONO_16BIT: PCM format, 24 kHz sample rate

    • AudioFormat.PCM_44100HZ_MONO_16BIT: PCM format, 44.1 kHz sample rate

    • AudioFormat.PCM_48000HZ_MONO_16BIT: PCM format, 48 kHz sample rate

  • Audio coding formats and sample rates supported only by cosyvoice-v2:

    If the audio format is Opus, you can adjust the bitrate using the bit_rate parameter. This is applicable only to DashScope SDK 1.24.0 and later versions.

    • AudioFormat.OGG_OPUS_8KHZ_MONO_32KBPS: Opus format, 8 kHz sample rate, 32 kbps bitrate

    • AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS: Opus format, 16 kHz sample rate, 16 kbps bitrate

    • AudioFormat.OGG_OPUS_16KHZ_MONO_32KBPS: Opus format, 16 kHz sample rate, 32 kbps bitrate

    • AudioFormat.OGG_OPUS_16KHZ_MONO_64KBPS: Opus format, 16 kHz sample rate, 64 kbps bitrate

    • AudioFormat.OGG_OPUS_24KHZ_MONO_16KBPS: Opus format, 24 kHz sample rate, 16 kbps bitrate

    • AudioFormat.OGG_OPUS_24KHZ_MONO_32KBPS: Opus format, 24 kHz sample rate, 32 kbps bitrate

    • AudioFormat.OGG_OPUS_24KHZ_MONO_64KBPS: Opus format, 24 kHz sample rate, 64 kbps bitrate

    • AudioFormat.OGG_OPUS_48KHZ_MONO_16KBPS: Opus format, 48 kHz sample rate, 16 kbps bitrate

    • AudioFormat.OGG_OPUS_48KHZ_MONO_32KBPS: Opus format, 48 kHz sample rate, 32 kbps bitrate

    • AudioFormat.OGG_OPUS_48KHZ_MONO_64KBPS: Opus format, 48 kHz sample rate, 64 kbps bitrate

volume

int

50

No

The volume of the synthesized audio. Valid values: 0 to 100.

Important

This field differs in various versions of the DashScope SDK:

  • SDK versions 1.20.10 and later: volume

  • SDK versions earlier than 1.20.10: volumn

speech_rate

float

1.0

No

The speech rate of the synthesized audio. Valid values: 0.5 to 2.

  • 0.5: 0.5 times the default speech rate.

  • 1: The default speech rate. This is the default output rate of the model, which may vary slightly depending on the voice. The rate is approximately four characters per second.

  • 2: 2 times the default speech rate.

pitch_rate

float

1.0

No

The pitch of the synthesized audio. Valid values: 0.5 to 2.

bit_rate

int

32

No

Specifies the audio bitrate. Valid values: 6 to 510 kbps.

A higher bitrate results in better audio quality and a larger file size.

This parameter is available only when the model is cosyvoice-v2 and the audio format is Opus.

Note

Set the bit_rate parameter through the additional_params parameter:

synthesizer = SpeechSynthesizer(model="cosyvoice-v2",
                                voice="longxiaochun_v2",
                                format=AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS,
                                additional_params={"bit_rate": 32})

callback

ResultCallback

-

No

Callback interface (ResultCallback).

Key interfaces

The SpeechSynthesizer class

SpeechSynthesizer is imported through from dashscope.audio.tts_v2 import * and provides key interfaces for speech synthesis.

Method

Parameters

Return value

Description

def call(self, text: str, timeout_millis=None)
  • text: The text to be synthesized.

  • timeout_millis: The timeout period for blocking the thread, in milliseconds. If this parameter is not set or is set to 0, it does not take effect.

Returns binary audio data if ResultCallback is not specified. Otherwise, returns None.

Converts a whole piece of text (either plain text or text with SSML) into speech.

When you create a SpeechSynthesizer instance, two cases exist:

  • ResultCallback is not specified: The call method blocks the current thread until speech synthesis is complete and returns the binary audio data. For usage, see Synchronous call.

  • ResultCallback is specified: The call method immediately returns None and returns the speech synthesis result through the on_data method of the ResultCallback interface. For usage, see Asynchronous call.

Important

You must re-initialize the SpeechSynthesizer instance before each call to the call method.

def streaming_call(self, text: str)

text: The text fragment to be synthesized.

None

Streams the text to be synthesized. Text with SSML is not supported.

You can call this interface multiple times to send the text to be synthesized to the server in multiple parts. The synthesis result is obtained through the on_data method of the ResultCallback interface.

For usage, see Streaming calls.

def streaming_complete(self, complete_timeout_millis=600000)

complete_timeout_millis: The waiting time, in milliseconds.

None

Ends the streaming speech synthesis.

This method blocks the current thread for N milliseconds, where N is determined by complete_timeout_millis, until the task is complete. If completeTimeoutMillis is set to 0, it waits indefinitely.

By default, if the waiting time exceeds 10 minutes, the waiting stops.

For usage, see Streaming calls.

Important

For streaming calls, make sure to call this method. Otherwise, parts of the synthesized speech may be missing.

def get_last_request_id(self)

None

The request_id of the last task.

Gets the request_id of the last task.

def get_first_package_delay(self)

None

First package delay

Gets the first package delay, which is typically around 500 ms.

The first package delay is the time between when the text is sent and when the first audio packet is received, measured in milliseconds. Use this method after the task is complete.

def get_response(self)

None

The last message

Gets the last message, which is in JSON format. You can use this to get task-failed errors.

Callback interface (ResultCallback)

For asynchronous or streaming calls, the server uses callbacks to return key process information and data to the client. You need to implement the callback methods to handle the information or data returned by the server.

You can import it using from dashscope.audio.tts_v2 import *.

Click to view an example

class Callback(ResultCallback):
    def on_open(self) -> None:
        print('Connection successful')
    
    def on_data(self, data: bytes) -> None:
        # Implement the logic to receive the synthesized binary audio result.

    def on_complete(self) -> None:
        print('Synthesis complete')

    def on_error(self, message) -> None:
        print('An exception occurred: ', message)

    def on_close(self) -> None:
        print('Connection closed')


callback = Callback()

Method

Parameters

Return value

Description

def on_open(self) -> None

None

None

This method is called immediately after a connection is established with the server.

def on_event( self, message: str) -> None

message: The information returned by the server.

None

This method is called when there is a response from the service. The message is a JSON string that you can parse to get the Request ID.

def on_complete(self) -> None

None

None

This method is called after all synthesized data has been returned (speech synthesis is complete).

def on_error(self, message) -> None

message: The exception information.

None

This method is called when an exception occurs.

def on_data(self, data: bytes) -> None

data: The binary audio data returned by the server.

None

This method is called when the server returns synthesized audio.

You can combine the binary audio data into a complete audio file and play it with a player, or play it in real time with a player that supports streaming playback.

Important
  • In streaming speech synthesis, for compressed formats such as mp3 and opus, the audio segments must be played using a streaming player. Do not play them frame by frame to avoid decoding failures.

    Players that support streaming playback include ffmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (Javascript).
  • When combining audio data into a complete audio file, write to the same file in append mode.

  • For wav and mp3 formats in streaming speech synthesis, only the first frame contains header information. Subsequent frames contain only audio data.

def on_close(self) -> None

None

None

This method is called after the service has closed the connection.

Response

The server returns binary audio data:

Error codes

If you encounter an error, see Error messages for troubleshooting.

If the problem is still not resolved, join the developer group to provide feedback on the problem and include the Request ID for further investigation.

More examples

For more examples, see GitHub.

Voice list

The default supported voices are listed in the following table. If you need a more personalized voice, you can use the voice cloning feature to create a custom voice for free. For more information, see Use a cloned voice for speech synthesis.

  • cosyvoice-v2 voice list

    Important

    The following is the cosyvoice-v2 voice list. When using these voices, you must set the model request parameter to cosyvoice-v2. Otherwise, the call will fail.

    Scenario

    Voice

    Voice characteristics

    Audio sample (right-click to save)

    voice parameter

    Language

    SSML

    Customer service

    Long Yingcui

    Serious male

    longyingcui

    Chinese, English

    Long Yingda

    Cheerful high-pitched female

    longyingda

    Chinese, English

    Long Yingjing

    Low-key and calm female

    longyingjing

    Chinese, English

    Long Yingyan

    Righteous and stern female

    longyingyan

    Chinese, English

    Long Yingtian

    Gentle and sweet female

    longyingtian

    Chinese, English

    Long Yingbing

    Sharp and assertive female

    longyingbing

    Chinese, English

    Long Yingtao

    Gentle and calm female

    longyingtao

    Chinese, English

    Long Yingling

    Gentle and empathetic female

    longyingling

    Chinese, English

    Voice assistant

    YUMI

    Formal young female

    longyumi_v2

    Chinese, English

    Long Xiaochun

    Intellectual and positive female

    longxiaochun_v2

    Chinese, English

    Long Xiaoxia

    Calm and authoritative female

    longxiaoxia_v2

    Chinese, English

    Livestreaming and e-commerce

    Long Anran

    Lively and textured female

    longanran

    Chinese, English

    Long Anxuan

    Classic female livestreamer

    longanxuan

    Chinese, English

    Audiobooks

    Long Sanshu

    Calm and textured male

    longsanshu

    Chinese, English

    Long Xiu

    Knowledgeable male storyteller

    longxiu_v2

    Chinese, English

    Long Miao

    Rhythmic female

    longmiao_v2

    Chinese, English

    Long Yue

    Warm and magnetic female

    longyue_v2

    Chinese, English

    Long Nan

    Wise young male

    longnan_v2

    Chinese, English

    Long Yuan

    Warm and healing female

    longyuan_v2

    Chinese, English

    Social companion

    Long Anrou

    Gentle best friend female

    longanrou

    Chinese, English

    Long Qiang

    Romantic and charming female

    longqiang_v2

    Chinese, English

    Long Han

    Warm and devoted male

    longhan_v2

    Chinese, English

    Long Xing

    Gentle girl-next-door

    longxing_v2

    Chinese, English

    Long Hua

    Energetic and sweet female

    longhua_v2

    Chinese, English

    Long Wan

    Positive and intellectual female

    longwan_v2

    Chinese, English

    Long Cheng

    Intelligent young male

    longcheng_v2

    Chinese, English

    Long Feifei

    Sweet and delicate female

    longfeifei_v2

    Chinese, English

    Long Xiaocheng

    Magnetic low-pitched male

    longxiaocheng_v2

    Chinese, English

    Long Zhe

    Awkward and warm-hearted male

    longzhe_v2

    Chinese, English

    Long Yan

    Warm and gentle female

    longyan_v2

    Chinese, English

    Long Tian

    Magnetic and rational male

    longtian_v2

    Chinese, English

    Long Ze

    Warm and energetic male

    longze_v2

    Chinese, English

    Long Shao

    Positive and upwardly mobile male

    longshao_v2

    Chinese, English

    Long Hao

    Emotional and melancholic male

    longhao_v2

    Chinese, English

    Long Shen

    Talented male singer

    kabuleshen_v2

    Chinese, English

    Children's voice

    Long Jielidou

    Sunny and mischievous male

    longjielidou_v2

    Chinese, English

    Long Ling

    Childish and stiff female

    longling_v2

    Chinese, English

    Long Ke

    Innocent and well-behaved female

    longke_v2

    Chinese, English

    Long Xian

    Bold and cute female

    longxian_v2

    Chinese, English

    Dialect

    Long Laotie

    Straightforward Northeastern dialect male

    longlaotie_v2

    Chinese (Northeastern), English

    Long Jiayi

    Intellectual Cantonese female

    longjiayi_v2

    Chinese (Cantonese), English

    Long Tao

    Positive Cantonese female

    longtao_v2

    Chinese (Cantonese), English

    Poetry recitation

    Long Fei

    Passionate and magnetic male

    longfei_v2

    Chinese, English

    Li Bai

    Ancient male poet

    libai_v2

    Chinese, English

    Long Jin

    Elegant and gentle male

    longjin_v2

    Chinese, English

    News broadcast

    Long Shu

    Calm young male

    longshu_v2

    Chinese, English

    Bella2.0

    Precise and capable female

    loongbella_v2

    Chinese, English

    Long Shuo

    Knowledgeable and capable male

    longshuo_v2

    Chinese, English

    Long Xiaobai

    Calm female announcer

    longxiaobai_v2

    Chinese, English

    Long Jing

    Typical female announcer

    longjing_v2

    Chinese, English

    loongstella

    Confident and crisp female

    loongstella_v2

    Chinese, English

    Overseas marketing

    loongeva

    Intellectual British English female

    loongeva_v2

    British English

    loongbrian

    Calm British English male

    loongbrian_v2

    British English

    loongluna

    British English female

    loongluna_v2

    British English

    loongluca

    British English male

    loongluca_v2

    British English

    loongemily

    British English female

    loongemily_v2

    British English

    loongeric

    British English male

    loongeric_v2

    British English

    loongabby

    American English female

    loongabby_v2

    American English

    loongannie

    American English female

    loongannie_v2

    American English

    loongandy

    American English male

    loongandy_v2

    American English

    loongava

    American English female

    loongava_v2

    American English

    loongbeth

    American English female

    loongbeth_v2

    American English

    loongbetty

    American English female

    loongbetty_v2

    American English

    loongcindy

    American English female

    loongcindy_v2

    American English

    loongcally

    American English female

    loongcally_v2

    American English

    loongdavid

    American English male

    loongdavid_v2

    American English

    loongdonna

    American English female

    loongdonna_v2

    American English

    loongkyong

    Korean female

    loongkyong_v2

    Korean

    loongtomoka

    Japanese female

    loongtomoka_v2

    Japanese

    loongtomoya

    Japanese male

    loongtomoya_v2

    Japanese

FAQ

Features, billing, and rate limiting

Q: Where can I find information about CosyVoice's features, billing, and rate limiting?

A: For more information, see Speech synthesis - CosyVoice.

Q: What to do if the pronunciation is inaccurate?

A: You can use SSML to customize the speech synthesis output.

Q: The current Requests Per Second (RPS) cannot meet my business needs. What should I do? How to increase the limit? Do I need to pay for the increase?

A: You can make a request by submitting a ticket or joining the developer group. The scale-out is free.

Q: How do I specify the language of the synthesized speech?

A: You cannot specify the language of the synthesized speech through request parameters. To synthesize speech in a specific language, refer to the voice list and select a voice that supports the desired language.

Troubleshooting

If a code error occurs, see the information in Error codes for troubleshooting.

Q: How do I get the Request ID?

A: You can obtain it in one of the following two ways:

Q: Why does the SSML feature fail?

A: You can follow these steps to troubleshoot:

  1. Make sure the current voice supports SSML. Cloned voices do not support SSML.

  2. Make sure the model parameter is set to cosyvoice-v2.

  3. Install the latest version of the DashScope SDK.

  4. Make sure you are using the correct interface. Only the call method of the SpeechSynthesizer class supports SSML.

  5. Make sure the text to be synthesized is in plain text format and meets the format requirements. For more information, see Introduction to the SSML markup language.

Q: Why won't the audio play?

A: Check the following scenarios one by one:

  1. The audio is saved as a complete file, such as xx.mp3

    1. Audio format consistency: Make sure the audio format set in the request parameters matches the file extension. For example, if the audio format is set to wav but the file extension is mp3, playback may fail.

    2. Player compatibility: Confirm whether the player you are using supports the format and sample rate of the audio file. For example, some players may not support high sample rates or specific audio encodings.

  2. The audio is played in a stream

    1. Save the audio stream as a complete file and try to play it with a player. If the file cannot be played, refer to the troubleshooting methods for scenario 1.

    2. If the file can be played normally, the problem may be with the streaming playback implementation. Confirm whether the player you are using supports streaming playback.

      Common tools and libraries that support streaming playback include ffmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (Javascript).

Q: Why is the audio playback stuttering?

A: Check the following scenarios one by one:

  1. The audio is saved as a complete file, such as xx.mp3

    Please join the developer group and provide the Request ID so we can troubleshoot the issue for you.

  2. The audio is played in a stream

    1. Check the text sending speed: Make sure the interval for sending text is reasonable. Avoid situations where the next sentence is not sent promptly after the previous audio has finished playing.

    2. Check the callback function performance:

      • Check if there is too much business logic in the callback function, causing it to block.

      • The callback function runs in the WebSocket thread. If it is blocked, it may affect the WebSocket's ability to receive network packets, leading to stuttering in audio reception.

      • Write the audio data to a separate audio buffer and then read and process it in another thread to avoid blocking the WebSocket thread.

    3. Check network stability: Make sure the network connection is stable to avoid audio transmission interruptions or delays due to network fluctuations.

    4. Further troubleshooting: If the preceding solutions do not resolve the issue, join the developer group and provide the Request ID so we can investigate the issue further for you.

Q: Why is speech synthesis slow (long synthesis time)?

A: Check the following items:

  1. Check the input interval

    For streaming speech synthesis, check if the interval between sending text segments is too long. For example, a delay of several seconds after the previous segment is sent. A long interval will increase the total synthesis time.

  2. Analyze performance metrics

    If the first packet delay does not meet the following requirements, submit the Request ID to the technical team for investigation.

    • First packet delay: Normally around 500 ms.

    • RTF (RTF = Total synthesis time / Audio duration): Normally around 0.3.

Q: How do I handle pronunciation errors in the synthesized speech?

  • We recommend cosyvoice-v2 for better results and supports SSML.

  • If the current model is cosyvoice-v2, use the SSML <phoneme> tag to specify the correct pronunciation.

Q: Why is no audio returned? Why is the end of the text not converted to speech? (Missing synthesized audio)

A: You can check if you forgot to call the streaming_complete method of the SpeechSynthesizer class. During speech synthesis, the server starts synthesizing only after buffering enough text. If you do not call the streaming_complete method, the text at the end of the buffer may not be synthesized into speech.

Q: How do I resolve an SSL certificate verification failure?

  1. You can install the system root certificate.

    sudo yum install -y ca-certificates
    sudo update-ca-trust enable
  2. You can add the following content to your code.

    import os
    os.environ["SSL_CERT_FILE"] = "/etc/ssl/certs/ca-bundle.crt"

Q: What causes the "SSL: CERTIFICATE_VERIFY_FAILED" exception on macOS? (websocket closed due to [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000))

When connecting to a WebSocket, you may encounter an OpenSSL certificate verification failure with a message indicating that the certificate cannot be found. This is usually because the certificate configuration in your Python environment is incorrect. You can manually locate and fix the certificate issue by following these steps:

  1. Export system certificates and set environment variables You can run the following commands to export all certificates from your macOS system to a file and set it as the default certificate path for Python and related libraries:

    security find-certificate -a -p > ~/all_mac_certs.pem
    export SSL_CERT_FILE=~/all_mac_certs.pem
    export REQUESTS_CA_BUNDLE=~/all_mac_certs.pem
  2. Create a symbolic link to fix Python's OpenSSL configuration If Python's OpenSSL configuration is missing certificates, you can create a symbolic link manually. Make sure to replace the path in the command with the actual installation directory of your local Python version:

    # 3.9 is an example version number. Adjust the path according to your locally installed Python version.
    ln -s /etc/ssl/* /Library/Frameworks/Python.framework/Versions/3.9/etc/openssl
  3. Restart the terminal and clear the cache After completing the above operations, you can close and reopen the terminal to ensure the environment variables take effect. You can clear any possible cache and try connecting to the WebSocket again.

These steps can resolve connection issues caused by incorrect certificate configurations. If the problem persists, check if the target server's certificate configuration is correct.

Q: What causes the "AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?" error when running the code?

A: This is because websocket-client is not installed or the version is mismatched. You can run the following commands in sequence:

pip uninstall websocket-client
pip uninstall websocket
pip install websocket-client

Permissions and authentication

Q: I want my API key to be used only for the CosyVoice speech synthesis service and not by other Model Studio models (permission isolation). How can I do this?

A: You can limit the scope of an API key by creating a new workspace and granting it access to only specific models. For more information, see Workspace Management.

More questions

A: See the GitHub QA.