All Products
Search
Document Center

Alibaba Cloud Model Studio:Server-side events

Last Updated:Sep 30, 2025

This topic describes the server-side events for the Qwen-Omni Real-time API.

For more information, see Real-time multimodal.

Server-side events

error

The server returns an error message for both client-side and server-side errors. Most errors are recoverable and do not affect the session.

Parameter

Type

Description

type

string

The event type. For this event, the value is fixed to error.

error

object

Details about the error.

error.type

string

The error type.

error.code

string

The error code.

error.message

string

The error message.

error.param

string

The parameter related to the error, such as session.modalities.

{
  "event_id": "event_RoUu4T8yExPMI37GKwaOC",
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "code": "invalid_value",
    "message": "Invalid modalities: ['audio']. Supported combinations are: ['text'] and ['audio', 'text'].",
    "param": "session.modalities"
  }
}

session.created

This is the first event that the server sends after a client connects. The event contains the default configurations for the connection.

Parameter

Type

Description

type

string

The event type. For this event, the value is set to session.created.

session

object

The session configuration.

session.modalities

array

The output modalities of the model. You can set this to ["text"] or ["text", "audio"]. Setting this parameter to ["audio"] is not supported.

session.voice

string

The voice of the model's generated audio. For a list of supported voices, see Voice list.

Default voices:

  • Qwen3-Omni-Flash-Realtime: "Cherry"

  • Qwen3-Omni-Turbo-Realtime: "Chelsie"

session.input_audio_format

string

The format of the user's input audio. Currently, only "pcm16" is supported.

session.output_audio_format

string

The format of the model's output audio. Currently, only "pcm24" is supported.

session.input_audio_transcription

object

The configuration for enabling automatic speech recognition (ASR) for the user's input audio.

session.input_audio_transcription.model

string

The model for ASR. The value is set to "gummy-realtime-v1".

session.turn_detection

object

The configuration for voice activity detection (VAD).

If you set this to None, VAD is disabled, and the user must send audio manually.

session.turn_detection.type

string

The server-side VAD type. The value is set to "server_vad".

session.turn_detection.threshold

float

The VAD threshold. Increase this value in noisy environments and decrease it in quiet environments.

  • The closer the value is to -1.0, the more likely it is that noise is detected as speech.

  • The closer the value is to 1.0, the less likely it is that noise is detected as speech.

Default: 0.2. Valid values: [-1.0, 1.0].

session.turn_detection.silence_duration_ms

integer

The duration of silence in milliseconds (ms) that triggers a model response. Default: 800. Valid values: [200, 6000].

{
    "event_id": "event_XlIT7ohkMGIsuSz154lbA",
    "type": "session.created",
    "session": {
        "object": "realtime.session",
        "model": "qwen-omni-turbo-realtime",
        "modalities": [
            "text",
            "audio"
        ],
        "voice": "Chelsie",
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm24",
        "input_audio_transcription": {
            "model": "gummy-realtime-v1"
        },
        
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "prefix_padding_ms": 300,
            "silence_duration_ms": 800,
            "create_response": true,
            "interrupt_response": true
        },
        "tools": [],
        "tool_choice": "auto",
        "temperature": 0.8,
        "max_response_output_tokens": "inf",
        "id": "sess_ECq2bEeDvXHEPLH9qCW3W"
    }
}

session.updated

When a session.update request is processed successfully, the server returns a session.updated event. Otherwise, a real-time multimodal event is returned.

Parameter

Type

Description

type

string

The event type. For this event, the value is session.updated.

session

object

The session configuration.

session.temperature

float

The temperature parameter of the model. The valid value range is [0, 2).

session.modalities

array

The model's output modalities. You can set this parameter to ["text"] for text-only output or ["text", "audio"] for both text and audio output.

session.voice

string

The voice for the model's audio output. For a list of supported voices, see Voice list.

Default voices:

  • Qwen3-Omni-Flash-Realtime: "Cherry"

  • Qwen3-Omni-Turbo-Realtime: "Chelsie"

session.instructions

string

Specifies the system role for the model.

session.input_audio_format

string

The format of the user's input audio. Currently, only pcm16 is supported.

session.output_audio_format

string

The format of the model's output audio. Currently, only pcm24 is supported.

session.input_audio_transcription

string

The configuration for Automatic Speech Recognition (ASR) of the user's input audio.

session.input_audio_transcription.model

string

The model for ASR. The value is fixed at gummy-realtime-v1.

session.turn_detection

object

The configuration for Voice Activity Detection (VAD).

If you set this parameter to None, VAD is disabled and the user must manually send audio.

session.turn_detection.type

string

The server-side VAD type. The value is fixed at server_vad.

session.turn_detection.threshold

float

The VAD threshold. You can increase this value in noisy environments and decrease it in quiet environments.

  • A value closer to -1.0 makes it more likely for noise to be detected as speech.

  • A value closer to 1.0 makes it less likely for noise to be detected as speech.

Default value: 0.2. Valid value range: [-1.0, 1.0].

session.turn_detection.silence_duration_ms

integer

The duration of silence, in milliseconds (ms), that triggers a model response. Default value: 800. Valid value range: [200, 6000].

{
    "event_id": "event_XrsEpDePgfEfHakfTwYoy",
    "type": "session.updated",
    "session": {
        "temperature": 0.8,
        "id": "sess_ECq2bEeDvXHEPLH9qCW3W",
        "object": "realtime.session",
        "model": "qwen-omni-turbo-realtime",
        "modalities": [
            "text",
            "audio"
        ],
        "instructions": "Identity:\nYou are Xiaoyun, a personal assistant developed by Alibaba Cloud.\nResponse requirements:\nYour tone should be professional, precise, and confident.",
        "voice": "Chelsie",
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm24",
        "input_audio_transcription": {
            "model": "gummy-realtime-v1"
        },
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.1,
            "prefix_padding_ms": 500,
            "silence_duration_ms": 900,
            "create_response": true,
            "interrupt_response": true
        },
        "max_response_output_tokens": "inf"
    }
}

input_audio_buffer.speech_started

In server_vad mode, the server returns the input_audio_buffer.speech_started event when it detects the start of speech in the audio buffer.

This event can occur whenever audio is added to the buffer, unless speech has already been detected.

Parameter

Type

Description

event_id

string

The ID of the event.

type

string

The event type. The value is input_audio_buffer.speech_started.

audio_start_ms

integer

The time in milliseconds from the start of writing audio to the buffer until the first detection of speech in the session.

item_id

string

The ID of the user message item that will be created.

{
  "event_id": "event_MOcdMTKH1QQRP5mbGWPHA",
  "type": "input_audio_buffer.speech_started",
  "audio_start_ms": 2022,
  "item_id": "item_Fu4bF8iduL8nfJVsbKb3L"
}

input_audio_buffer.speech_stopped

In server_vad mode, the server returns the input_audio_buffer.speech_stopped event when it detects the end of speech in the audio buffer.

The server also sends a conversation.item.created event, which contains the user message item created from the audio buffer.

Parameter

Type

Description

event_id

string

The event ID.

type

string

The type of the event. The value is always input_audio_buffer.speech_stopped.

audio_end_ms

integer

The time in milliseconds from the start of the session until speech stops.

item_id

string

The ID of the user message item created when speech stops.

{
  "event_id": "event_YmcGFfICPRXBDfgqcpKit",
  "type": "input_audio_buffer.speech_stopped",
  "audio_end_ms": 2823,
  "item_id": "item_Fu4bF8iduL8nfJVsbKb3L"
}

input_audio_buffer.committed

In server_vad mode, the server automatically submits the buffer and returns this event when it detects that the user has stopped speaking. In non-server_vad mode, the server returns this event in response to the client's input_audio_buffer.commit event after the client finishes sending audio.

Parameter

Type

Description

event_id

string

The ID of the event.

type

string

The event type. The value is input_audio_buffer.committed.

item_id

string

The ID of the user message item to be created.

{
  "event_id": "event_WwJkVCuJE8CNaRlMWdn8U",
  "type": "input_audio_buffer.committed",
  "item_id": "item_Fu4bF8iduL8nfJVsbKb3L"
}

input_audio_buffer.cleared

After the client sends an input_audio_buffer.clear event, the server returns an input_audio_buffer.cleared event.

Parameter

Type

Description

event_id

string

The event ID.

type

string

The event type. The value is always input_audio_buffer.cleared.

{
  "event_id": "event_1121",
  "type": "input_audio_buffer.cleared"
}

conversation.item.created

This event is returned when a conversation item is created.

Parameter

Type

Description

event_id

string

The ID of the event.

type

string

The event type. The value is always conversation.item.created.

item

object

The item that is added to the conversation.

item.id

string

The unique ID of the conversation item.

item.object

string

The value is always realtime.item.

item.status

string

The status of the conversation item.

item.role

string

The role of the message sender.

item.content

array

The content of the message.

{
    "event_id": "event_FecZDYhYi4LlVyjsbtyMa",
    "type": "conversation.item.created",
    "item": {
        "id": "item_Dm13TFlfnCx9IrosLFyeX",
        "object": "realtime.item",
        "type": "message",
        "status": "in_progress",
        "role": "assistant",
        "content": [
            {
                "type": "input_audio"
            }
        ]
    }
}

conversation.item.input_audio_transcription.completed

This event contains the audio transcription that is generated after the user's audio has been written to the audio buffer. Although the Realtime model accepts audio input, the transcription is handled by a separate process. This process runs on a dedicated automatic speech recognition (ASR) model, which is currently gummy-realtime-v1. The transcribed text may differ from the model's interpretation and is for reference only.

Parameter

Type

Description

event_id

string

The ID of the event.

type

string

The event type. The value is always conversation.item.input_audio_transcription.completed.

item_id

string

The ID of the user message item that contains the audio.

content_index

integer

The index of the content part that contains the audio.

transcript

string

The transcribed text.

{
    "event_id": "event_OHMnbeeCHHoVrJDhFBMNY",
    "type": "conversation.item.input_audio_transcription.completed",
    "item_id": "item_Fu4bF8iduL8nfJVsbKb3L",
    "content_index": 0,
    "transcript": "Hello, hello."
}

conversation.item.input_audio_transcription.failed

If input audio transcription is enabled and fails, the server returns the conversation.item.input_audio_transcription.failed event. This event is separate from other error events so that the client can identify the related item.

Parameter

Type

Description

type

string

The type of the event. The value is always conversation.item.input_audio_transcription.failed.

item_id

string

The ID of the user message item.

content_index

integer

The index of the content part that contains the audio.

error

object

Transcribed text.

error.code

string

The error code.

error.message

string

The error message.

error.param

string

The parameter related to the error.

{
  "type": "conversation.item.input_audio_transcription.failed",
  "item_id": "<item_id>",
  "content_index": 0,
  "error": {
    "code": "<code>",
    "message": "<message>",
    "param": "<param>"
  }
}

response.created

The server sends this event when it starts generating a new model response.

Field

Type

Description

type

string

The event type. The value is fixed at response.created.

event_id

string

The ID of the event.

response

object

The response object.

response.id

string

The unique ID of the response.

response.conversation_id

string

The unique ID of the current session.

response.object

string

The object type. The value is fixed at realtime.response.

response.status

string

The final status of the response. Valid values: completed, failed, in_progress, or incomplete.

response.modalities

array

The modalities of the response.

response.voice

string

The voice of the audio generated by the model.

response.output

array

This field is currently empty for this event.

{
    "event_id": "event_JIyHMxVNc9gWgflLBiH1w",
    "type": "response.created",
    "response": {
        "id": "resp_Vb3496XSAdbX732ybCL17",
        "object": "realtime.response",
        "conversation_id": "conv_NGFEtyikW1PRDopyZ52Yv",
        "status": "in_progress",
        "modalities": [
            "text",
            "audio"
        ],
        "voice": "Chelsie",
        "output_audio_format": "pcm16",
        "output": []
    }
}

response.done

The server returns this event when response generation is complete. The Response object in this event contains all output items from the response but excludes the raw audio data that was already sent in previous events.

Field

Type

Description

type

string

The value is fixed at response.done.

response

object

The response object.

response.id

string

The unique ID of the response.

response.object

string

The object type. For this event, the value is fixed at realtime.response.

response.conversation_id

string

The unique ID of the current session.

response.status

string

The final status of the response. Valid values are completed, failed, in_progress, or incomplete.

response.modalities

array

The modalities of the response.

response.voice

string

The voice used for the audio generated by the model.

response.output

array

The output of the response.

response.output.id

string

The ID of the response output.

response.output.object

string

The object type of the output item. The value is fixed at "realtime.item".

response.output.type

string

The type of the output item. The value is fixed at "message".

response.output.status

string

The status of the output item. Valid values are "completed", "incomplete", or "canceled".

response.output.role

string

The role of the output item. Valid values are "user", "assistant", or "system".

response.output.content

object

The content of the output item.

For text input, the format is: type=text, text={model inference result}.

For audio input, the format is: type=audio, transcript={model inference result}.

response.usage

object

The usage information for the response.

{
    "event_id": "event_W7QDmx8EWyInRnRp1O7Df",
    "type": "response.done",
    "response": {
        "id": "resp_Vb3496XSAdbX732ybCL17",
        "object": "realtime.response",
        "conversation_id": "conv_NGFEtyikW1PRDopyZ52Yv",
        "status": "completed",
        "modalities": [
            "text",
            "audio"
        ],
        "voice": "Chelsie",
        "output_audio_format": "pcm16",
        "output": [
            {
                "id": "item_Ln9VTSx895BCbxuSKWmtg",
                "object": "realtime.item",
                "type": "message",
                "status": "completed",
                "role": "assistant",
                "content": [
                    {
                        "type": "audio",
                        "transcript": "Sure, feel free to share any thoughts or questions with me."
                    }
                ]
            }
        ],
        "usage": {
            "total_tokens": 261,
            "cached_tokens": 0,
            "input_tokens": 127,
            "output_tokens": 134,
            "input_token_details": {
                "text_tokens": 48,
                "audio_tokens": 79
            },
            "output_token_details": {
                "text_tokens": 14,
                "audio_tokens": 120
            }
        }
    }
}

response.text.delta

The server returns the response.text.delta event as the model incrementally generates new text.

Field

Type

Description

type

string

The value is fixed at response.text.delta.

response

object

The response object.

item_id

string

The ID of the message item. You can use this ID to associate events with the same message item.

output_index

integer

The index of the output item in the response. The value is fixed at 0.

content_index

integer

The index of the content within the output item. The value is fixed at 0.

delta

string

The returned incremental text.

{
  "event_id": "event_B1lIeyOXR7qJMEExbqtTG",
  "type": "response.text.delta",
  "response_id": "resp_B1lIdtjF4Noqpn5NOjznj",
  "item_id": "item_B1lIdJsAJlJiFs8ztWpJt",
  "output_index": 0,
  "content_index": 0,
  "delta": "How"
}

response.text.done

The server returns the response.text.done event when the model finishes generating text.

This event is also returned when a response is interrupted, incomplete, or canceled.

Field

Type

Description

type

string

The value is fixed at response.text.done.

response_id

string

The ID of the response.

item_id

string

The ID of the message item.

output_index

integer

The index of the response output item.

content_index

integer

The index of the content part in the response output item.

text

string

The complete text output by the model.

{
  "event_id": "event_B1lIeE2Nac33zn5V7h2mm",
  "type": "response.text.done",
  "response_id": "resp_B1lIdtjF4Noqpn5NOjznj",
  "item_id": "item_B1lIdJsAJlJiFs8ztWpJt",
  "output_index": 0,
  "content_index": 0,
  "text": "How can I assist you today?"
}

response.audio.delta

The server returns the response.audio.delta event as the model incrementally generates new audio data.

Field

Type

Description

type

string

A fixed value of response.audio.delta.

response_id

string

The response ID, which is used to associate all outputs from the same response.

item_id

string

The message item ID, which is used to associate events with the same message item.

output_index

integer

The index of the output item in the response. The value is always 0.

content_index

integer

The index of the content within the output item in the response. The value is always 0.

delta

string

The incremental audio data output by the model, encoded in Base64.

{
  "event_id": "event_B1osWMZBtrEQbiIwW0qHQ",
  "type": "response.audio.delta",
  "response_id": "resp_P79OOMs8LnrXVpiIHUCKR",
  "item_id": "item_OFaPGtzfWCPyGzxnuEX9i",
  "output_index": 0,
  "content_index": 0,
  "delta": "{base64 audio}"
}

response.audio.done

The server returns the response.audio.done event when the model finishes generating audio data.

This event is also returned when a response is interrupted, incomplete, or canceled.

Field

Type

Description

type

string

The value is fixed as response.audio.done.

response_id

string

The response ID. This ID is used to associate all outputs of the same response.

item_id

string

The message item ID. This ID is used to associate events with the same message item.

output_index

integer

The index of the output item in the response. The value is fixed as 0.

content_index

integer

The index of the content part of the output item in the response. The value is fixed as 0.

{
  "event_id": "event_B1osWMWoDRYyITDyNYcBu",
  "type": "response.audio.done",
  "response_id": "resp_P79OOMs8LnrXVpiIHUCKR",
  "item_id": "item_OFaPGtzfWCPyGzxnuEX9i",
  "output_index": 0,
  "content_index": 0
}

response.audio_transcript.delta

The server returns the response.audio_transcript.delta event as the model incrementally generates text corresponding to the new audio.

Field

Type

Description

type

string

The value is fixed as response.audio_transcript.delta.

response_id

string

The ID of the response. This ID is used to associate all outputs with the same response.

item_id

string

The ID of the message item. This ID is used to associate events with the same message item.

output_index

integer

The index of the output item in the response. The value is fixed as 0.

content_index

integer

The index of the inner part of the output item in the response. The value is fixed as 0.

delta

string

The incremental text.

{
    "event_id": "event_OcoAVmmbMQnirKeVFag9x",
    "type": "response.audio_transcript.delta",
    "response_id": "resp_P79OOMs8LnrXVpiIHUCKR",
    "item_id": "item_OFaPGtzfWCPyGzxnuEX9i",
    "output_index": 0,
    "content_index": 0,
    "delta": "Hello"
}

response.audio_transcript.done

The server returns the response.audio_transcript.done event when the model finishes generating the text that corresponds to the new audio.

Field

Type

Description

type

string

The value is always response.audio_transcript.done.

response_id

string

The ID of the response. You can use this ID to associate all outputs from the same response.

item_id

string

The ID of the message item. You can use this ID to associate events that belong to the same message item.

output_index

integer

The index of the output item in the response. The value is always 0.

content_index

integer

The index of the content part of the output item in the response. The value is always 0.

transcript

string

The final generated text.

{
    "event_id": "event_VN4Q4GJugLcc1S23viW8E",
    "type": "response.audio_transcript.done",
    "response_id": "resp_P79OOMs8LnrXVpiIHUCKR",
    "item_id": "item_JvJauNH2CTXb1D9WV6pD4",
    "output_index": 0,
    "content_index": 0,
    "transcript": "Hello, I am a large language model developed by Alibaba Cloud. My name is Qwen. How can I help you?"
}

response.output_item.added

The server sends this event when a new output item is added.

Field

Type

Description

type

string

The value is fixed as response.output_item.added.

response_id

string

The response ID. You can use this ID to associate all outputs of the same response.

output_index

integer

The index of the output item in the response. The value is fixed as 0.

item

object

Information about the output item.

item.id

string

The unique ID of the output item.

item.object

string

The value is always realtime.item.

item.status

string

The status of the output item.

item.role

string

The role of the message sender.

item.content

array

The content of the message.

{
    "event_id": "event_B4O5yPt3Gjnjy5eYH3plG",
    "type": "response.output_item.added",
    "response_id": "resp_P79OOMs8LnrXVpiIHUCKR",
    "output_index": 0,
    "item": {
        "id": "item_OFaPGtzfWCPyGzxnuEX9i",
        "object": "realtime.item",
        "type": "message",
        "status": "in_progress",
        "role": "assistant",
        "content": []
    }
}

response.output_item.done

The server returns this event when the output for the new item is complete.

Field

Type

Description

type

string

The value is fixed at response.output_item.done.

response_id

string

The ID of the response.

output_index

integer

The index of the output item in the response. The value is fixed at 0.

item

object

Information about the output item.

item.id

string

The unique ID of the output item.

item.object

string

The value is always realtime.item.

item.status

string

The status of the output item.

item.role

string

The role of the message sender.

item.content

array

The content of the message.

{
    "event_id": "event_XkiwbYTBC9Wcdwy6uYJ2G",
    "type": "response.output_item.done",
    "response_id": "resp_P79OOMs8LnrXVpiIHUCKR",
    "output_index": 0,
    "item": {
        "id": "item_JvJauNH2CTXb1D9WV6pD4",
        "object": "realtime.item",
        "type": "message",
        "status": "completed",
        "role": "assistant",
        "content": [
            {
                "type": "audio",
                "text": "Hello, I am a large language model developed by Alibaba Cloud. My name is Qwen. How can I help you?"
            }
        ]
    }
}

response.content_part.added

The server sends this event when a new content part is added.

Field

Type

Description

type

string

The value is always response.content_part.added.

response_id

string

The ID of the response.

item_id

string

The message item ID.

output_index

integer

The index of the response output item. The value is always 0.

content_index

integer

The index of the content part in the response output item. The value is always 0.

part

object

The content part.

part.type

string

The type of the content part.

part.text

string

The text of the content part.

{
    "event_id": "event_J2UixwYKZsXg7c9YXZetL",
    "type": "response.content_part.added",
    "response_id": "resp_P79OOMs8LnrXVpiIHUCKR",
    "item_id": "item_OFaPGtzfWCPyGzxnuEX9i",
    "output_index": 0,
    "content_index": 0,
    "part": {
        "type": "audio",
        "text": ""
    }
}

response.content_part.done

The server returns this event when the output for the new content part is complete.

Field

Type

Description

type

string

The value is set to response.content_part.done.

response_id

string

The ID of the response.

item_id

string

The message item ID.

output_index

integer

The index of the response output item. The value is set to 0.

content_index

integer

The index of the content part within the response output item. The value is set to 0.

part

object

The completed content part.

part.type

string

The type of the content part.

part.text

string

The text of the content part.

{
    "event_id": "event_FdVUyXIa8WVk4BZJv8swq",
    "type": "response.content_part.done",
    "response_id": "resp_QeZcSlvzRmmjIURRMafY8",
    "item_id": "item_HvJYzNHXC1MnzvgBfIxJD",
    "output_index": 0,
    "content_index": 0,
    "part": {
        "type": "audio",
        "text": "I'm not sure what time it is either. You could check your phone or a clock. If there's anything else you'd like to talk about, just let me know."
    }
}