Perform chat completion inference
Added in 8.18.0
The chat completion inference API enables real-time responses for chat completion tasks by delivering answers incrementally, reducing response times during computation.
It only works with the chat_completion
task type for openai
and elastic
inference services.
IMPORTANT: The inference APIs enable you to use certain services, such as built-in machine learning models (ELSER, E5), models uploaded through Eland, Cohere, OpenAI, Azure, Google AI Studio, Google Vertex AI, Anthropic, Watsonx.ai, or Hugging Face. For built-in models and models uploaded through Eland, the inference APIs offer an alternative way to use and manage trained models. However, if you do not plan to use the inference APIs to use these models or if you want to use non-NLP models, use the machine learning trained model APIs.
NOTE: The chat_completion
task type is only available within the _stream API and only supports streaming.
The Chat completion inference API and the Stream inference API differ in their response structure and capabilities.
The Chat completion inference API provides more comprehensive customization options through more fields and function calling support.
If you use the openai
service or the elastic
service, use the Chat completion inference API.
Path parameters
-
inference_id
string Required The inference Id
Query parameters
-
timeout
string Specifies the amount of time to wait for the inference request to complete.
Body
Required
-
messages
array[object] Required A list of objects representing the conversation. Requests should generally only add new messages from the user (role
user
). The other message roles (assistant
,system
, ortool
) should generally only be copied from the response to a previous completion request, such that the messages array is built up throughout a conversation. -
model
string The ID of the model to use.
-
max_completion_tokens
number The upper bound limit for the number of tokens that can be generated for a completion request.
-
stop
array[string] A sequence of strings to control when the model should stop generating additional tokens.
-
temperature
number The sampling temperature to use.
tool_choice
string | object -
tools
array[object] A list of tools that the model can call.
-
top_p
number Nucleus sampling, an alternative to sampling with temperature.
curl \
--request POST 'http://api.example.com/_inference/chat_completion/{inference_id}/_stream' \
--header "Authorization: $API_KEY" \
--header "Content-Type: application/json" \
--data '"{\n \"model\": \"gpt-4o\",\n \"messages\": [\n {\n \"role\": \"user\",\n \"content\": \"What is Elastic?\"\n }\n ]\n}"'
{
"model": "gpt-4o",
"messages": [
{
"role": "user",
"content": "What is Elastic?"
}
]
}
{
"messages": [
{
"role": "assistant",
"content": "Let's find out what the weather is",
"tool_calls": [
{
"id": "call_KcAjWtAww20AihPHphUh46Gd",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\"location\":\"Boston, MA\"}"
}
}
]
},
{
"role": "tool",
"content": "The weather is cold",
"tool_call_id": "call_KcAjWtAww20AihPHphUh46Gd"
}
]
}
{
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What's the price of a scarf?"
}
]
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_current_price",
"description": "Get the current price of a item",
"parameters": {
"type": "object",
"properties": {
"item": {
"id": "123"
}
}
}
}
}
],
"tool_choice": {
"type": "function",
"function": {
"name": "get_current_price"
}
}
}
event: message
data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":"","role":"assistant"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}}
event: message
data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":Elastic"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}}
event: message
data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":" is"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}}
(...)
event: message
data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk","usage":{"completion_tokens":28,"prompt_tokens":16,"total_tokens":44}}}
event: message
data: [DONE]