Skip to content

Minh/s2s context summary #1813

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
May 10, 2025
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Improved explanations
  • Loading branch information
minh-hoque committed May 5, 2025
commit 74976bbc4a6c9978789c02c72ee865267ca26723
166 changes: 105 additions & 61 deletions examples/Context_summarization_with_realtime_api.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
"|-------|----------------|\n",
"| Capture audio with `sounddevice` | Low‑latency input is critical for natural UX |\n",
"| Use WebSockets with the OpenAI **Realtime** API | Streams beats polling for speed & simplicity |\n",
"| Track token usage and detect “context pressure” | Prevents quality loss in long chats |\n",
"| Track token usage and detect when to summarize context | Prevents quality loss in long chats |\n",
"| Summarise & prune history on‑the‑fly | Keeps conversations coherent without manual resets |\n",
"\n",
"---\n",
Expand All @@ -35,14 +35,19 @@
"\n",
"| Requirement | Details |\n",
"|-------------|---------|\n",
"| **Python ≥ 3.10** | `async` / typing improvements |\n",
"| **Python ≥ 3.10** | Will ensure that you don't hit any issues |\n",
"| **OpenAI API key** | Set `OPENAI_API_KEY` in your shell or paste inline (*not ideal for prod*) |\n",
"| Mic + speakers | Grant OS permission if prompted |\n",
"\n",
"\n",
"**Need help setting up the key?** \n",
"> Follow the [official quick‑start guide](https://platform.openai.com/docs/quickstart#step-2-set-your-api-key).\n",
"\n",
"\n",
"*Notes:*\n",
"1. Why 32 k? OpenAI’s public guidance notes that quality begins to decline well before the full 128 k token limit; 32 k is a conservative threshold observed in practice.\n",
"\n",
"2. Token window = all tokens (words and audio tokens) the model currently keeps in memory for the session.x\n",
"---\n",
"\n",
"### 🚀 One‑liner install (run in a fresh cell)"
Expand All @@ -69,6 +74,11 @@
"from dataclasses import dataclass, field\n",
"from typing import List, Literal\n",
"\n",
"import asyncio, base64, io, json, os, sys, wave, pathlib\n",
"from typing import List\n",
"\n",
"import numpy as np, soundfile as sf, resampy, websockets, openai\n",
"\n",
"import sounddevice as sd # microphone capture\n",
"import simpleaudio # speaker playback\n",
"import websockets # WebSocket client\n",
Expand Down Expand Up @@ -132,18 +142,18 @@
"\n",
"---\n",
"\n",
"### 2.3 Context Windows & Token Economics\n",
"### 2.3 Token Context Windows\n",
"\n",
"* GPT‑4o Realtime accepts **up to 128 K tokens** in theory. \n",
"* In practice, answer quality starts to drift around **≈ 32 K tokens**. \n",
"* Every user/assistant turn consumes tokens → the window **only grows**.\n",
"* **Fix**: Summarise older turns into a single assistant message, keep the last few verbatim turns, and continue.\n",
"* **Strategy**: Summarise older turns into a single assistant message, keep the last few verbatim turns, and continue.\n",
"\n",
"---\n",
"\n",
"### 2.4 Conversation State\n",
"\n",
"Instead of scattered globals, the notebook travels with one **state object**:"
"Instead of scattered globals, the notebook uses with one **state object**:"
]
},
{
Expand Down Expand Up @@ -174,7 +184,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is a helper to quickly display current conversation history between the user and assistant."
"A quick helper to peek at the transcript:"
]
},
{
Expand Down Expand Up @@ -208,48 +218,37 @@
"\n",
"### 3.1 Hands‑on comparison 📊\n",
"\n",
"The cell below:\n",
"The cells below:\n",
"\n",
"1. **Sends `TEXT` to Chat Completions** → reads `prompt_tokens`. \n",
"2. **Turns the same `TEXT` into speech** with TTS. \n",
"3. **Feeds the speech back into the Transcription endpoint** → reads `total_tokens`. \n",
"4. Prints a ratio so you can see the multiplier on *your* hardware / account.\n",
"\n",
"**Note** \n",
"> * Transcription counts *audio tokens*, not text tokens. \n",
"> * Network latency depends on audio length. Keep the sample short for a snappy demo."
"3. **Feeds the speech back into the Realtime API Transcription endpoint** → reads `audio input tokens`. \n",
"4. Prints a ratio so you can see the multiplier on *your* hardware / account."
]
},
{
"cell_type": "code",
"execution_count": 16,
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"📄 Text prompt tokens : 42\n",
"🔊 Audio length (s) : 10.90\n"
"🔊 Audio length (s) : 10.75\n"
]
}
],
"source": [
"# ╔══════════════════════════════════════════════════════════════════╗\n",
"# ║ 3 · Token Utilisation – Text vs Voice (Realtime S2S demo) ║\n",
"# ║ Requires: pip install openai websockets soundfile resampy ║\n",
"# ║ 3 · Token Utilisation – Text vs Voice ║\n",
"# ╚══════════════════════════════════════════════════════════════════╝\n",
"\n",
"import asyncio, base64, io, json, os, sys, wave, pathlib\n",
"from typing import List\n",
"\n",
"import numpy as np, soundfile as sf, resampy, websockets, openai\n",
"\n",
"# ─── Config ──────────────────────────────────────────────────────────────\n",
"TEXT = (\n",
" \"Hello there, I am measuring tokens for text versus voice because we want to better compare the number of tokens used when sending a message as text versus when converting it to speech..\"\n",
")\n",
"CHAT_MODEL = \"gpt-4o-mini\"\n",
"STT_MODEL = \"gpt-4o-transcribe\"\n",
"TTS_MODEL = \"gpt-4o-mini-tts\"\n",
"RT_MODEL = \"gpt-4o-realtime-preview\" # S2S model\n",
"VOICE = \"shimmer\"\n",
Expand All @@ -258,8 +257,6 @@
"PCM_SCALE = 32_767\n",
"CHUNK_MS = 120 # stream step\n",
"\n",
"if not os.getenv(\"OPENAI_API_KEY\"):\n",
" sys.exit(\"❌ Please export OPENAI_API_KEY before running.\")\n",
"\n",
"HEADERS = {\n",
" \"Authorization\": f\"Bearer {openai.api_key}\",\n",
Expand Down Expand Up @@ -294,21 +291,20 @@
"with wave.open(io.BytesIO(wav_bytes)) as w:\n",
" pcm_bytes = w.readframes(w.getnframes())\n",
"duration_sec = len(pcm_bytes) / (2 * TARGET_SR)\n",
"show(\"🔊 Audio length (s)\", f\"{duration_sec:.2f}\")\n"
"show(\"🔊 Audio length (s)\", f\"{duration_sec:.2f}\")"
]
},
{
"cell_type": "code",
"execution_count": 19,
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'type': 'response.done', 'event_id': 'event_BTt4GCm5hrzW0tyejZTFI', 'response': {'object': 'realtime.response', 'id': 'resp_BTt4A0rNrAbJjcPWAAwzY', 'status': 'completed', 'status_details': None, 'output': [{'id': 'item_BTt4As3XS4ZhxsRHhMGzQ', 'object': 'realtime.item', 'type': 'message', 'status': 'completed', 'role': 'assistant', 'content': [{'type': 'audio', 'transcript': \"Sure thing! When you're measuring tokens, keep in mind that text and voice can indeed differ in token usage. For text, tokens are counted based on words, punctuation, and formatting, while for voice, the conversion to speech may lead to more tokens due to the nuances of pronunciation and emphasis. Comparing them will help you optimize your communication strategy. How can I assist you further with this?\"}]}], 'conversation_id': 'conv_BTt48QSC6gI4XX9ZlDRVU', 'modalities': ['text', 'audio'], 'voice': 'shimmer', 'output_audio_format': 'pcm16', 'temperature': 0.8, 'max_output_tokens': 'inf', 'usage': {'total_tokens': 849, 'input_tokens': 227, 'output_tokens': 622, 'input_token_details': {'text_tokens': 119, 'audio_tokens': 108, 'cached_tokens': 64, 'cached_tokens_details': {'text_tokens': 64, 'audio_tokens': 0}}, 'output_token_details': {'text_tokens': 110, 'audio_tokens': 512}}, 'metadata': None}}\n",
"🎤 Audio input tokens : 108\n",
"⚖️ Audio/Text ratio : 2.6×\n",
"🎤 Audio input tokens : 105\n",
"⚖️ Audio/Text ratio : 2.5×\n",
"\n",
"≈9 audio‑tokens / sec vs ≈1 token / word.\n"
]
Expand All @@ -335,7 +331,7 @@
" \"voice\": VOICE,\n",
" \"input_audio_format\": \"pcm16\",\n",
" \"output_audio_format\": \"pcm16\",\n",
" \"input_audio_transcription\": {\"model\": \"gpt-4o-transcribe\"},\n",
" \"input_audio_transcription\": {\"model\": STT_MODEL},\n",
" }\n",
" }))\n",
"\n",
Expand All @@ -351,7 +347,6 @@
" t = ev.get(\"type\")\n",
"\n",
" if t == \"response.done\":\n",
" print(ev)\n",
" return ev[\"response\"][\"usage\"]\\\n",
" [\"input_token_details\"][\"audio_tokens\"]\n",
"\n",
Expand All @@ -368,7 +363,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This small example is using a short toy example. When you have large transcripts, the discrepancy text tokens versus voice tokens increases."
"This toy example uses a short input, but as transcripts get longer, the difference between text token count and voice token count grows substantially."
]
},
{
Expand All @@ -378,49 +373,81 @@
"\n",
"---\n",
"\n",
"## 3 · Streaming Audio: Mic → WebSocket → OpenAI\n",
"## 3 · Streaming Audio\n",
"We’ll stream raw PCM‑16 microphone data straight into the Realtime API.\n",
"\n",
"The pipeline is: mic ─► async.Queue ─► WebSocket ─► Realtime API\n",
"\n",
"### 3.1 Capture Microphone Input"
"### 3.1 Capture Microphone Input\n",
"We’ll start with a coroutine that:\n",
"\n",
"* Opens the default mic at **24 kHz, mono, PCM‑16** (one of the [format](https://platform.openai.com/docs/api-reference/realtime-sessions/create#realtime-sessions-create-input_audio_format) Realtime accepts). \n",
"* Slices the stream into **≈ 40 ms** blocks. \n",
"* Dumps each block into an `asyncio.Queue` so another task (next section) can forward it to OpenAI.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"async def mic_to_queue(pcm_queue: asyncio.Queue[bytes]):\n",
" \"\"\"Capture raw PCM‑16 audio and push 40 ms chunks to an async queue.\"\"\"\n",
"# ╔══════════════════════════════════════════════════════════════════╗\n",
"# ║ 3.1 · Microphone → async.Queue ║\n",
"# ╚══════════════════════════════════════════════════════════════════╝\n",
"\n",
"import asyncio, sys\n",
"import sounddevice as sd\n",
"\n",
"# ── Audio constants (match Realtime requirements) ──────────────────\n",
"SAMPLE_RATE_HZ = 24_000 # 24‑kHz mono\n",
"CHUNK_DURATION_MS = 40 # ≈40‑ms frames\n",
"QUEUE_MAXSIZE = 32 # Back‑pressure buffer\n",
"\n",
"async def mic_to_queue(pcm_queue: asyncio.Queue[bytes]) -> None:\n",
" \"\"\"\n",
" Capture raw PCM‑16 microphone audio and push ~CHUNK_DURATION_MS chunks\n",
" to *pcm_queue* until the surrounding task is cancelled.\n",
"\n",
" Parameters\n",
" ----------\n",
" pcm_queue : asyncio.Queue[bytes]\n",
" Destination queue for PCM‑16 frames (little‑endian int16).\n",
" \"\"\"\n",
" blocksize = int(SAMPLE_RATE_HZ * CHUNK_DURATION_MS / 1000)\n",
"\n",
" def _callback(indata, _frames, _time, status):\n",
" if status:\n",
" print(status, file=sys.stderr)\n",
" if status: # XRuns, device changes, etc.\n",
" print(\"⚠️\", status, file=sys.stderr)\n",
" try:\n",
" pcm_queue.put_nowait(bytes(indata))\n",
" pcm_queue.put_nowait(bytes(indata)) # 1‑shot enqueue\n",
" except asyncio.QueueFull:\n",
" pass # drop if network slower than mic\n",
" # Drop frame if upstream (WebSocket) can’t keep up.\n",
" pass\n",
"\n",
" # RawInputStream is synchronous; wrap in context manager to auto‑close.\n",
" with sd.RawInputStream(\n",
" samplerate=SAMPLE_RATE_HZ,\n",
" blocksize=blocksize,\n",
" dtype=\"int16\",\n",
" channels=1,\n",
" callback=_callback,\n",
" ):\n",
" await asyncio.Event().wait() # run until cancelled"
" try:\n",
" # Keep coroutine alive until cancelled by caller.\n",
" await asyncio.Event().wait()\n",
" finally:\n",
" print(\"⏹️ Mic stream closed.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.2 Send Audio Chunks to the API"
"### 3.2 Send Audio Chunks to the API\n",
"\n",
"Our mic task is now filling an `asyncio.Queue` with raw PCM‑16 blocks. \n",
"Next step: pull chunks off that queue, **base‑64 encode** them (the protocol requires JSON‑safe text), and ship each block to the Realtime WebSocket as an `input_audio_buffer.append` event.\n"
]
},
{
Expand Down Expand Up @@ -448,16 +475,21 @@
"metadata": {},
"source": [
"### 3.3 Handle Incoming Events \n",
"A conversation round‑trip produces these notable events:\n",
"\n",
"| Event Type | When It Arrives | Purpose |\n",
"|-----------------------------------------|----------------------------|----------------------------------------------------------|\n",
"| session.created | upon connection | confirms handshake & auth |\n",
"| conversation.item.created (user) | after user finishes speaking | placeholder user item; may lack transcript initially |\n",
"| conversation.item.retrieved | when transcript ready | fills missing user text |\n",
"| response.audio.delta | streaming | assistant audio bytes (incremental) |\n",
"| response.done | assistant finished | final assistant text & usage metrics |\n",
"| conversation.item.deleted | after pruning | acknowledges deletion |"
"Once audio reaches the server, the Realtime API pushes a stream of JSON events back over the **same** WebSocket. \n",
"Understanding these events is critical for:\n",
"\n",
"* Printing live transcripts \n",
"* Playing incremental audio back to the user \n",
"* Keeping an accurate `ConversationState` so context trimming works later \n",
"\n",
"| Event type | Typical timing | What you should do with it |\n",
"|------------|----------------|----------------------------|\n",
"| **`session.created`** | Immediately after connection | Verify the handshake; stash the `session_id` if you need it for server logs. |\n",
"| **`conversation.item.created`** (user) | Right after the user stops talking | Place a *placeholder* `Turn` in `state.history`. Transcript may still be `null`. |\n",
"| **`conversation.item.retrieved`** | A few hundred ms later | Fill in any missing user transcript once STT completes. |\n",
"| **`response.audio.delta`** | Streaming chunks while the assistant speaks | Append bytes to a local buffer, play them (low‑latency) as they arrive. |\n",
"| **`response.done`** | After final assistant token | Add assistant text + usage stats, update `state.latest_tokens`. |\n",
"| **`conversation.item.deleted`** | Whenever you prune old turns | Remove superseded items from `conversation.item`. |\n"
]
},
{
Expand Down Expand Up @@ -847,7 +879,19 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next Steps & Further Reading"
"## Next Steps & Further Reading\n",
"Try out the notebook and try integrating context summary into your application.\n",
"\n",
"Few things you can try:\n",
"- Evaluate if context summary helps with your eval and use case\n",
"- Try various methods of summarizing\n",
"- ect\n",
"\n",
"Resources:\n",
"- https://platform.openai.com/docs/guides/realtime \n",
"- https://platform.openai.com/docs/guides/realtime-conversations\n",
"- https://platform.openai.com/docs/api-reference/realtime\n",
"- https://voiceaiandvoiceagents.com/"
]
},
{
Expand Down