Server-side events for the Realtime API - Alibaba Cloud Model Studio

This topic describes the server-side events for the Qwen-Omni Real-time API.

For more information, see Real-time multimodal.

Server-side events

error

The server returns an error message for both client-side and server-side errors. Most errors are recoverable and do not affect the session.

Parameter	Type	Description
type	string	The event type. For this event, the value is fixed to `error`.
error	object	Details about the error.
error.type	string	The error type.
error.code	string	The error code.
error.message	string	The error message.
error.param	string	The parameter related to the error, such as `session.modalities`.

{
  "event_id": "event_RoUu4T8yExPMI37GKwaOC",
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "code": "invalid_value",
    "message": "Invalid modalities: ['audio']. Supported combinations are: ['text'] and ['audio', 'text'].",
    "param": "session.modalities"
  }
}

session.created

This is the first event that the server sends after a client connects. The event contains the default configurations for the connection.

Parameter	Type	Description
type	string	The event type. For this event, the value is set to `session.created`.
session	object	The session configuration.
session.modalities	array	The output modalities of the model. You can set this to ["text"] or ["text", "audio"]. Setting this parameter to ["audio"] is not supported.
session.voice	string	The voice of the model's generated audio. For a list of supported voices, see Voice list. Default voices: Qwen3-Omni-Flash-Realtime: "Cherry" Qwen3-Omni-Turbo-Realtime: "Chelsie"
session.input_audio_format	string	The format of the user's input audio. Currently, only "pcm16" is supported.
session.output_audio_format	string	The format of the model's output audio. Currently, only "pcm24" is supported.
session.input_audio_transcription	object	The configuration for enabling automatic speech recognition (ASR) for the user's input audio.
session.input_audio_transcription.model	string	The model for ASR. The value is set to "gummy-realtime-v1".
session.turn_detection	object	The configuration for voice activity detection (VAD). If you set this to None, VAD is disabled, and the user must send audio manually.
session.turn_detection.type	string	The server-side VAD type. The value is set to "server_vad".
session.turn_detection.threshold	float	The VAD threshold. Increase this value in noisy environments and decrease it in quiet environments. The closer the value is to -1.0, the more likely it is that noise is detected as speech. The closer the value is to 1.0, the less likely it is that noise is detected as speech. Default: 0.2. Valid values: [-1.0, 1.0].
session.turn_detection.silence_duration_ms	integer	The duration of silence in milliseconds (ms) that triggers a model response. Default: 800. Valid values: [200, 6000].

{
    "event_id": "event_XlIT7ohkMGIsuSz154lbA",
    "type": "session.created",
    "session": {
        "object": "realtime.session",
        "model": "qwen-omni-turbo-realtime",
        "modalities": [
            "text",
            "audio"
        ],
        "voice": "Chelsie",
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm24",
        "input_audio_transcription": {
            "model": "gummy-realtime-v1"
        },
        
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "prefix_padding_ms": 300,
            "silence_duration_ms": 800,
            "create_response": true,
            "interrupt_response": true
        },
        "tools": [],
        "tool_choice": "auto",
        "temperature": 0.8,
        "max_response_output_tokens": "inf",
        "id": "sess_ECq2bEeDvXHEPLH9qCW3W"
    }
}

session.updated

When a session.update request is processed successfully, the server returns a session.updated event. Otherwise, a real-time multimodal event is returned.

Parameter	Type	Description
type	string	The event type. For this event, the value is `session.updated`.
session	object	The session configuration.
session.temperature	float	The temperature parameter of the model. The valid value range is [0, 2).
session.modalities	array	The model's output modalities. You can set this parameter to ["text"] for text-only output or ["text", "audio"] for both text and audio output.
session.voice	string	The voice for the model's audio output. For a list of supported voices, see Voice list. Default voices: Qwen3-Omni-Flash-Realtime: "Cherry" Qwen3-Omni-Turbo-Realtime: "Chelsie"
session.instructions	string	Specifies the system role for the model.
session.input_audio_format	string	The format of the user's input audio. Currently, only pcm16 is supported.
session.output_audio_format	string	The format of the model's output audio. Currently, only pcm24 is supported.
session.input_audio_transcription	string	The configuration for Automatic Speech Recognition (ASR) of the user's input audio.
session.input_audio_transcription.model	string	The model for ASR. The value is fixed at gummy-realtime-v1.
session.turn_detection	object	The configuration for Voice Activity Detection (VAD). If you set this parameter to None, VAD is disabled and the user must manually send audio.
session.turn_detection.type	string	The server-side VAD type. The value is fixed at server_vad.
session.turn_detection.threshold	float	The VAD threshold. You can increase this value in noisy environments and decrease it in quiet environments. A value closer to -1.0 makes it more likely for noise to be detected as speech. A value closer to 1.0 makes it less likely for noise to be detected as speech. Default value: 0.2. Valid value range: [-1.0, 1.0].
session.turn_detection.silence_duration_ms	integer	The duration of silence, in milliseconds (ms), that triggers a model response. Default value: 800. Valid value range: [200, 6000].

{
    "event_id": "event_XrsEpDePgfEfHakfTwYoy",
    "type": "session.updated",
    "session": {
        "temperature": 0.8,
        "id": "sess_ECq2bEeDvXHEPLH9qCW3W",
        "object": "realtime.session",
        "model": "qwen-omni-turbo-realtime",
        "modalities": [
            "text",
            "audio"
        ],
        "instructions": "Identity:\nYou are Xiaoyun, a personal assistant developed by Alibaba Cloud.\nResponse requirements:\nYour tone should be professional, precise, and confident.",
        "voice": "Chelsie",
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm24",
        "input_audio_transcription": {
            "model": "gummy-realtime-v1"
        },
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.1,
            "prefix_padding_ms": 500,
            "silence_duration_ms": 900,
            "create_response": true,
            "interrupt_response": true
        },
        "max_response_output_tokens": "inf"
    }
}

input_audio_buffer.speech_started

In server_vad mode, the server returns the input_audio_buffer.speech_started event when it detects the start of speech in the audio buffer.

This event can occur whenever audio is added to the buffer, unless speech has already been detected.

Parameter	Type	Description
event_id	string	The ID of the event.
type	string	The event type. The value is `input_audio_buffer.speech_started`.
audio_start_ms	integer	The time in milliseconds from the start of writing audio to the buffer until the first detection of speech in the session.
item_id	string	The ID of the user message item that will be created.

{
  "event_id": "event_MOcdMTKH1QQRP5mbGWPHA",
  "type": "input_audio_buffer.speech_started",
  "audio_start_ms": 2022,
  "item_id": "item_Fu4bF8iduL8nfJVsbKb3L"
}

input_audio_buffer.speech_stopped

In server_vad mode, the server returns the input_audio_buffer.speech_stopped event when it detects the end of speech in the audio buffer.

The server also sends a conversation.item.created event, which contains the user message item created from the audio buffer.

Parameter	Type	Description
event_id	string	The event ID.
type	string	The type of the event. The value is always `input_audio_buffer.speech_stopped`.
audio_end_ms	integer	The time in milliseconds from the start of the session until speech stops.
item_id	string	The ID of the user message item created when speech stops.

{
  "event_id": "event_YmcGFfICPRXBDfgqcpKit",
  "type": "input_audio_buffer.speech_stopped",
  "audio_end_ms": 2823,
  "item_id": "item_Fu4bF8iduL8nfJVsbKb3L"
}

input_audio_buffer.committed

In server_vad mode, the server automatically submits the buffer and returns this event when it detects that the user has stopped speaking. In non-server_vad mode, the server returns this event in response to the client's input_audio_buffer.commit event after the client finishes sending audio.

Parameter	Type	Description
event_id	string	The ID of the event.
type	string	The event type. The value is `input_audio_buffer.committed`.
item_id	string	The ID of the user message item to be created.

{
  "event_id": "event_WwJkVCuJE8CNaRlMWdn8U",
  "type": "input_audio_buffer.committed",
  "item_id": "item_Fu4bF8iduL8nfJVsbKb3L"
}

input_audio_buffer.cleared

After the client sends an input_audio_buffer.clear event, the server returns an input_audio_buffer.cleared event.

Parameter	Type	Description
event_id	string	The event ID.
type	string	The event type. The value is always `input_audio_buffer.cleared`.

{
  "event_id": "event_1121",
  "type": "input_audio_buffer.cleared"
}

conversation.item.created

This event is returned when a conversation item is created.

Parameter	Type	Description
event_id	string	The ID of the event.
type	string	The event type. The value is always `conversation.item.created`.
item	object	The item that is added to the conversation.
item.id	string	The unique ID of the conversation item.
item.object	string	The value is always `realtime.item`.
item.status	string	The status of the conversation item.
item.role	string	The role of the message sender.
item.content	array	The content of the message.

{
    "event_id": "event_FecZDYhYi4LlVyjsbtyMa",
    "type": "conversation.item.created",
    "item": {
        "id": "item_Dm13TFlfnCx9IrosLFyeX",
        "object": "realtime.item",
        "type": "message",
        "status": "in_progress",
        "role": "assistant",
        "content": [
            {
                "type": "input_audio"
            }
        ]
    }
}

conversation.item.input_audio_transcription.completed

This event contains the audio transcription that is generated after the user's audio has been written to the audio buffer. Although the Realtime model accepts audio input, the transcription is handled by a separate process. This process runs on a dedicated automatic speech recognition (ASR) model, which is currently gummy-realtime-v1. The transcribed text may differ from the model's interpretation and is for reference only.

Parameter	Type	Description
event_id	string	The ID of the event.
type	string	The event type. The value is always `conversation.item.input_audio_transcription.completed`.
item_id	string	The ID of the user message item that contains the audio.
content_index	integer	The index of the content part that contains the audio.
transcript	string	The transcribed text.

{
    "event_id": "event_OHMnbeeCHHoVrJDhFBMNY",
    "type": "conversation.item.input_audio_transcription.completed",
    "item_id": "item_Fu4bF8iduL8nfJVsbKb3L",
    "content_index": 0,
    "transcript": "Hello, hello."
}

conversation.item.input_audio_transcription.failed

If input audio transcription is enabled and fails, the server returns the conversation.item.input_audio_transcription.failed event. This event is separate from other error events so that the client can identify the related item.

Parameter	Type	Description
type	string	The type of the event. The value is always `conversation.item.input_audio_transcription.failed`.
item_id	string	The ID of the user message item.
content_index	integer	The index of the content part that contains the audio.
error	object	Transcribed text.
error.code	string	The error code.
error.message	string	The error message.
error.param	string	The parameter related to the error.

{
  "type": "conversation.item.input_audio_transcription.failed",
  "item_id": "<item_id>",
  "content_index": 0,
  "error": {
    "code": "<code>",
    "message": "<message>",
    "param": "<param>"
  }
}

response.created

The server sends this event when it starts generating a new model response.

Field	Type	Description
type	string	The event type. The value is fixed at `response.created`.
event_id	string	The ID of the event.
response	object	The response object.
response.id	string	The unique ID of the response.
response.conversation_id	string	The unique ID of the current session.
response.object	string	The object type. The value is fixed at `realtime.response`.
response.status	string	The final status of the response. Valid values: completed, failed, in_progress, or incomplete.
response.modalities	array	The modalities of the response.
response.voice	string	The voice of the audio generated by the model.
response.output	array	This field is currently empty for this event.

{
    "event_id": "event_JIyHMxVNc9gWgflLBiH1w",
    "type": "response.created",
    "response": {
        "id": "resp_Vb3496XSAdbX732ybCL17",
        "object": "realtime.response",
        "conversation_id": "conv_NGFEtyikW1PRDopyZ52Yv",
        "status": "in_progress",
        "modalities": [
            "text",
            "audio"
        ],
        "voice": "Chelsie",
        "output_audio_format": "pcm16",
        "output": []
    }
}

response.done

The server returns this event when response generation is complete. The Response object in this event contains all output items from the response but excludes the raw audio data that was already sent in previous events.

Field	Type	Description
type	string	The value is fixed at `response.done`.
response	object	The response object.
response.id	string	The unique ID of the response.
response.object	string	The object type. For this event, the value is fixed at `realtime.response`.
response.conversation_id	string	The unique ID of the current session.
response.status	string	The final status of the response. Valid values are completed, failed, in_progress, or incomplete.
response.modalities	array	The modalities of the response.
response.voice	string	The voice used for the audio generated by the model.
response.output	array	The output of the response.
response.output.id	string	The ID of the response output.
response.output.object	string	The object type of the output item. The value is fixed at `"realtime.item"`.
response.output.type	string	The type of the output item. The value is fixed at `"message"`.
response.output.status	string	The status of the output item. Valid values are "completed", "incomplete", or "canceled".
response.output.role	string	The role of the output item. Valid values are "user", "assistant", or "system".
response.output.content	object	The content of the output item. For text input, the format is: type=text, text={model inference result}. For audio input, the format is: type=audio, transcript={model inference result}.
response.usage	object	The usage information for the response.

{
    "event_id": "event_W7QDmx8EWyInRnRp1O7Df",
    "type": "response.done",
    "response": {
        "id": "resp_Vb3496XSAdbX732ybCL17",
        "object": "realtime.response",
        "conversation_id": "conv_NGFEtyikW1PRDopyZ52Yv",
        "status": "completed",
        "modalities": [
            "text",
            "audio"
        ],
        "voice": "Chelsie",
        "output_audio_format": "pcm16",
        "output": [
            {
                "id": "item_Ln9VTSx895BCbxuSKWmtg",
                "object": "realtime.item",
                "type": "message",
                "status": "completed",
                "role": "assistant",
                "content": [
                    {
                        "type": "audio",
                        "transcript": "Sure, feel free to share any thoughts or questions with me."
                    }
                ]
            }
        ],
        "usage": {
            "total_tokens": 261,
            "cached_tokens": 0,
            "input_tokens": 127,
            "output_tokens": 134,
            "input_token_details": {
                "text_tokens": 48,
                "audio_tokens": 79
            },
            "output_token_details": {
                "text_tokens": 14,
                "audio_tokens": 120
            }
        }
    }
}

response.text.delta

The server returns the response.text.delta event as the model incrementally generates new text.

Field	Type	Description
type	string	The value is fixed at `response.text.delta`.
response	object	The response object.
item_id	string	The ID of the message item. You can use this ID to associate events with the same message item.
output_index	integer	The index of the output item in the response. The value is fixed at 0.
content_index	integer	The index of the content within the output item. The value is fixed at 0.
delta	string	The returned incremental text.

{
  "event_id": "event_B1lIeyOXR7qJMEExbqtTG",
  "type": "response.text.delta",
  "response_id": "resp_B1lIdtjF4Noqpn5NOjznj",
  "item_id": "item_B1lIdJsAJlJiFs8ztWpJt",
  "output_index": 0,
  "content_index": 0,
  "delta": "How"
}

response.text.done

The server returns the response.text.done event when the model finishes generating text.

This event is also returned when a response is interrupted, incomplete, or canceled.

Field	Type	Description
type	string	The value is fixed at `response.text.done`.
response_id	string	The ID of the response.
item_id	string	The ID of the message item.
output_index	integer	The index of the response output item.
content_index	integer	The index of the content part in the response output item.
text	string	The complete text output by the model.

{
  "event_id": "event_B1lIeE2Nac33zn5V7h2mm",
  "type": "response.text.done",
  "response_id": "resp_B1lIdtjF4Noqpn5NOjznj",
  "item_id": "item_B1lIdJsAJlJiFs8ztWpJt",
  "output_index": 0,
  "content_index": 0,
  "text": "How can I assist you today?"
}

response.audio.delta

The server returns the response.audio.delta event as the model incrementally generates new audio data.

Field	Type	Description
type	string	A fixed value of response.audio.delta.
response_id	string	The response ID, which is used to associate all outputs from the same response.
item_id	string	The message item ID, which is used to associate events with the same message item.
output_index	integer	The index of the output item in the response. The value is always 0.
content_index	integer	The index of the content within the output item in the response. The value is always 0.
delta	string	The incremental audio data output by the model, encoded in Base64.

{
  "event_id": "event_B1osWMZBtrEQbiIwW0qHQ",
  "type": "response.audio.delta",
  "response_id": "resp_P79OOMs8LnrXVpiIHUCKR",
  "item_id": "item_OFaPGtzfWCPyGzxnuEX9i",
  "output_index": 0,
  "content_index": 0,
  "delta": "{base64 audio}"
}

response.audio.done

The server returns the response.audio.done event when the model finishes generating audio data.

This event is also returned when a response is interrupted, incomplete, or canceled.

Field	Type	Description
type	string	The value is fixed as response.audio.done.
response_id	string	The response ID. This ID is used to associate all outputs of the same response.
item_id	string	The message item ID. This ID is used to associate events with the same message item.
output_index	integer	The index of the output item in the response. The value is fixed as 0.
content_index	integer	The index of the content part of the output item in the response. The value is fixed as 0.

{
  "event_id": "event_B1osWMWoDRYyITDyNYcBu",
  "type": "response.audio.done",
  "response_id": "resp_P79OOMs8LnrXVpiIHUCKR",
  "item_id": "item_OFaPGtzfWCPyGzxnuEX9i",
  "output_index": 0,
  "content_index": 0
}

response.audio_transcript.delta

The server returns the response.audio_transcript.delta event as the model incrementally generates text corresponding to the new audio.

Field	Type	Description
type	string	The value is fixed as response.audio_transcript.delta.
response_id	string	The ID of the response. This ID is used to associate all outputs with the same response.
item_id	string	The ID of the message item. This ID is used to associate events with the same message item.
output_index	integer	The index of the output item in the response. The value is fixed as 0.
content_index	integer	The index of the inner part of the output item in the response. The value is fixed as 0.
delta	string	The incremental text.

{
    "event_id": "event_OcoAVmmbMQnirKeVFag9x",
    "type": "response.audio_transcript.delta",
    "response_id": "resp_P79OOMs8LnrXVpiIHUCKR",
    "item_id": "item_OFaPGtzfWCPyGzxnuEX9i",
    "output_index": 0,
    "content_index": 0,
    "delta": "Hello"
}

response.audio_transcript.done

The server returns the response.audio_transcript.done event when the model finishes generating the text that corresponds to the new audio.

Field	Type	Description
type	string	The value is always response.audio_transcript.done.
response_id	string	The ID of the response. You can use this ID to associate all outputs from the same response.
item_id	string	The ID of the message item. You can use this ID to associate events that belong to the same message item.
output_index	integer	The index of the output item in the response. The value is always 0.
content_index	integer	The index of the content part of the output item in the response. The value is always 0.
transcript	string	The final generated text.

{
    "event_id": "event_VN4Q4GJugLcc1S23viW8E",
    "type": "response.audio_transcript.done",
    "response_id": "resp_P79OOMs8LnrXVpiIHUCKR",
    "item_id": "item_JvJauNH2CTXb1D9WV6pD4",
    "output_index": 0,
    "content_index": 0,
    "transcript": "Hello, I am a large language model developed by Alibaba Cloud. My name is Qwen. How can I help you?"
}

response.output_item.added

The server sends this event when a new output item is added.

Field	Type	Description
type	string	The value is fixed as response.output_item.added.
response_id	string	The response ID. You can use this ID to associate all outputs of the same response.
output_index	integer	The index of the output item in the response. The value is fixed as 0.
item	object	Information about the output item.
item.id	string	The unique ID of the output item.
item.object	string	The value is always `realtime.item`.
item.status	string	The status of the output item.
item.role	string	The role of the message sender.
item.content	array	The content of the message.

{
    "event_id": "event_B4O5yPt3Gjnjy5eYH3plG",
    "type": "response.output_item.added",
    "response_id": "resp_P79OOMs8LnrXVpiIHUCKR",
    "output_index": 0,
    "item": {
        "id": "item_OFaPGtzfWCPyGzxnuEX9i",
        "object": "realtime.item",
        "type": "message",
        "status": "in_progress",
        "role": "assistant",
        "content": []
    }
}

response.output_item.done

The server returns this event when the output for the new item is complete.

Field	Type	Description
type	string	The value is fixed at `response.output_item.done`.
response_id	string	The ID of the response.
output_index	integer	The index of the output item in the response. The value is fixed at 0.
item	object	Information about the output item.
item.id	string	The unique ID of the output item.
item.object	string	The value is always `realtime.item`.
item.status	string	The status of the output item.
item.role	string	The role of the message sender.
item.content	array	The content of the message.

{
    "event_id": "event_XkiwbYTBC9Wcdwy6uYJ2G",
    "type": "response.output_item.done",
    "response_id": "resp_P79OOMs8LnrXVpiIHUCKR",
    "output_index": 0,
    "item": {
        "id": "item_JvJauNH2CTXb1D9WV6pD4",
        "object": "realtime.item",
        "type": "message",
        "status": "completed",
        "role": "assistant",
        "content": [
            {
                "type": "audio",
                "text": "Hello, I am a large language model developed by Alibaba Cloud. My name is Qwen. How can I help you?"
            }
        ]
    }
}

response.content_part.added

The server sends this event when a new content part is added.

Field	Type	Description
type	string	The value is always `response.content_part.added`.
response_id	string	The ID of the response.
item_id	string	The message item ID.
output_index	integer	The index of the response output item. The value is always 0.
content_index	integer	The index of the content part in the response output item. The value is always 0.
part	object	The content part.
part.type	string	The type of the content part.
part.text	string	The text of the content part.

{
    "event_id": "event_J2UixwYKZsXg7c9YXZetL",
    "type": "response.content_part.added",
    "response_id": "resp_P79OOMs8LnrXVpiIHUCKR",
    "item_id": "item_OFaPGtzfWCPyGzxnuEX9i",
    "output_index": 0,
    "content_index": 0,
    "part": {
        "type": "audio",
        "text": ""
    }
}

response.content_part.done

The server returns this event when the output for the new content part is complete.

Field	Type	Description
type	string	The value is set to `response.content_part.done`.
response_id	string	The ID of the response.
item_id	string	The message item ID.
output_index	integer	The index of the response output item. The value is set to 0.
content_index	integer	The index of the content part within the response output item. The value is set to 0.
part	object	The completed content part.
part.type	string	The type of the content part.
part.text	string	The text of the content part.

{
    "event_id": "event_FdVUyXIa8WVk4BZJv8swq",
    "type": "response.content_part.done",
    "response_id": "resp_QeZcSlvzRmmjIURRMafY8",
    "item_id": "item_HvJYzNHXC1MnzvgBfIxJD",
    "output_index": 0,
    "content_index": 0,
    "part": {
        "type": "audio",
        "text": "I'm not sure what time it is either. You could check your phone or a clock. If there's anything else you'd like to talk about, just let me know."
    }
}