Skip to content

DanTulovsky/openai-edge-tts

 
 

Repository files navigation

OpenAI-Compatible Edge-TTS API 🗣️

GitHub stars GitHub forks GitHub repo size GitHub top language GitHub last commit Discord LinkedIn

This project provides a local, OpenAI-compatible text-to-speech (TTS) API using edge-tts. It emulates the OpenAI TTS endpoint (/v1/audio/speech), enabling users to generate speech from text with various voice options and playback speeds, just like the OpenAI API.

edge-tts uses Microsoft Edge's online text-to-speech service, so it is completely free.

View this project on Docker Hub

Please ⭐️ star this repo if you find it helpful

Features

  • OpenAI-Compatible Endpoint: /v1/audio/speech with similar request structure and behavior.
  • SSE Streaming Support: Real-time audio streaming via Server-Sent Events when stream_format: "sse" is specified.
  • Raw Audio Streaming: Low-latency progressive playback with stream_format: "audio_stream".
  • Supported Voices: Maps OpenAI voices (alloy, echo, fable, onyx, nova, shimmer) to edge-tts equivalents.
  • Flexible Formats: Supports multiple audio formats (mp3, opus, aac, flac, wav, pcm).
  • Adjustable Speed: Option to modify playback speed (0.25x to 4.0x).
  • Optional Direct Edge-TTS Voice Selection: Use either OpenAI voice mappings or specify any edge-tts voice directly.

⚡️ Quick start

The simplest way to get started without having to configure anything is to run the command below

docker run -d -p 5050:5050 travisvn/openai-edge-tts:latest

This will run the service at port 5050 with all the default configs

(Docker required, obviously)

Setup

Prerequisites

  • Docker (recommended): Docker and Docker Compose for containerized setup.
  • Python (optional): For local development, install dependencies in requirements.txt.
  • ffmpeg (optional): Required for audio format conversion. Optional if sticking to mp3.

Installation

  1. Clone the Repository:
git clone https://github.com/travisvn/openai-edge-tts.git
cd openai-edge-tts
  1. Environment Variables: Create a .env file in the root directory with the following variables:
API_KEY=your_api_key_here
PORT=5050

DEFAULT_VOICE=en-US-AvaNeural
DEFAULT_RESPONSE_FORMAT=mp3
DEFAULT_SPEED=1.0

DEFAULT_LANGUAGE=en-US

REQUIRE_API_KEY=True
REMOVE_FILTER=False
EXPAND_API=True
DETAILED_ERROR_LOGGING=True

Or, copy the default .env.example with the following:

cp .env.example .env
  1. Run with Docker Compose (recommended):
docker compose up --build

Run with -d to run docker compose in "detached mode", meaning it will run in the background and free up your terminal.

docker compose up -d

Building Locally with FFmpeg using Docker Compose

By default, docker compose up --build creates a minimal image without ffmpeg. If you're building locally (after cloning this repository) and need ffmpeg for audio format conversions (beyond MP3), you can include it in the build.

This is controlled by the INSTALL_FFMPEG_ARG build argument. Set this environment variable to true in one of these ways:

  1. Prefixing the command:
    INSTALL_FFMPEG_ARG=true docker compose up --build
  2. Adding to your .env file: Add this line to the .env file in the project root:
    INSTALL_FFMPEG_ARG=true
    Then, run docker compose up --build.
  3. Exporting in your shell environment: Add export INSTALL_FFMPEG_ARG=true to your shell configuration (e.g., ~/.zshrc, ~/.bashrc) and reload your shell. Then docker compose up --build will use it.

This is for local builds. For pre-built Docker Hub images, add the latest-ffmpeg tag to the version

docker run -d -p 5050:5050 -e API_KEY=your_api_key_here -e PORT=5050 travisvn/openai-edge-tts:latest-ffmpeg

Alternatively, run directly with Docker:

docker build -t openai-edge-tts .
docker run -p 5050:5050 --env-file .env openai-edge-tts

To run the container in the background, add -d after the docker run command:

docker run -d -p 5050:5050 --env-file .env openai-edge-tts
  1. Access the API: Your server will be accessible at http://localhost:5050.

Building Multi-Platform Images (ARM64 & AMD64)

To build Docker images for both ARM64 and AMD64 architectures:

Using the build script:

# Build and push multi-platform to registry
./build-multi-platform.sh --push

# Build with FFmpeg support and push
./build-multi-platform.sh --ffmpeg --push

# Build for local use (current platform only)
./build-multi-platform.sh

# See all options
./build-multi-platform.sh --help

Note: Multi-platform builds require pushing to a registry. Local builds without --push will only build for your current platform.

Using Docker Buildx directly:

# Create builder if needed
docker buildx create --name multiarch-builder --driver docker-container --use --bootstrap

# Build and push multi-platform image
docker buildx build --platform linux/amd64,linux/arm64 \
  --build-arg INSTALL_FFMPEG=true \
  -t yourusername/openai-edge-tts:latest \
  --push .

# Verify platforms
docker buildx imagetools inspect yourusername/openai-edge-tts:latest

Running with Python

If you prefer to run this project directly with Python, follow these steps to set up a virtual environment, install dependencies, and start the server.

1. Clone the Repository

git clone https://github.com/travisvn/openai-edge-tts.git
cd openai-edge-tts

2. Set Up a Virtual Environment

Create and activate a virtual environment to isolate dependencies:

# For macOS/Linux
python3 -m venv venv
source venv/bin/activate

# For Windows
python -m venv venv
venv\Scripts\activate

3. Install Dependencies

Use pip to install the required packages listed in requirements.txt:

pip install -r requirements.txt

If you want to run the Python tests locally, install the test-only dependencies:

pip install -r requirements-test.txt

Note: For local transcription (the test suite uses Whisper) and for some audio format conversions, ffmpeg must be installed and available on your PATH. Verify by running:

ffmpeg -version

Note: The Docker image installs only requirements.txt to keep the runtime image minimal; test-only packages are excluded from the image.

4. Configure Environment Variables

Create a .env file in the root directory and set the following variables:

API_KEY=your_api_key_here
PORT=5050

DEFAULT_VOICE=en-US-AvaNeural
DEFAULT_RESPONSE_FORMAT=mp3
DEFAULT_SPEED=1.0

DEFAULT_LANGUAGE=en-US

REQUIRE_API_KEY=True
REMOVE_FILTER=False
EXPAND_API=True
DETAILED_ERROR_LOGGING=True

5. Run the Server

Once configured, start the server with:

python app/server.py

The server will start running at http://localhost:5050.

6. Test the API

You can now interact with the API at http://localhost:5050/v1/audio/speech and other available endpoints. See the Usage section for request examples.

Usage

Endpoint: /v1/audio/speech

Generates audio from the input text. Available parameters:

Required Parameter:

  • input (string): The text to be converted to audio (up to 4096 characters).

Optional Parameters:

  • model (string): Set to "tts-1" or "tts-1-hd" (default: "tts-1").
  • voice (string): One of the OpenAI-compatible voices (alloy, echo, fable, onyx, nova, shimmer) or any valid edge-tts voice (default: "en-US-AvaNeural").
  • response_format (string): Audio format. Options: mp3, opus, aac, flac, wav, pcm (default: mp3).
  • speed (number): Playback speed (0.25 to 4.0). Default is 1.0.
  • stream_format (string): Response format. Options: "audio" (raw audio data, default), "audio_stream" (streaming raw audio with chunked transfer), or "sse" (Server-Sent Events streaming with JSON events).

Note: The API is fully compatible with OpenAI's TTS API specification. The instructions parameter (for fine-tuning voice characteristics) is not currently supported, but all other parameters work identically to OpenAI's implementation.

Standard Audio Generation

Example request with curl and saving the output to an mp3 file:

curl -X POST http://localhost:5050/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your_api_key_here" \
  -d '{
    "input": "Hello, I am your AI assistant! Just let me know how I can help bring your ideas to life.",
    "voice": "echo",
    "response_format": "mp3",
    "speed": 1.1
  }' \
  --output speech.mp3

Direct Audio Playback (like OpenAI)

You can pipe the audio directly to ffplay for immediate playback, just like OpenAI's API:

curl -X POST http://localhost:5050/v1/audio/speech \
  -H "Authorization: Bearer your_api_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Today is a wonderful day to build something people love!",
    "voice": "alloy",
    "response_format": "mp3"
  }' | ffplay -i -

Or for immediate playback without saving to file:

curl -X POST http://localhost:5050/v1/audio/speech \
  -H "Authorization: Bearer your_api_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "This will play immediately without saving to disk!",
    "voice": "shimmer"
  }' | ffplay -autoexit -nodisp -i -

Or, to be in line with the OpenAI API endpoint parameters:

curl -X POST http://localhost:5050/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your_api_key_here" \
  -d '{
    "model": "tts-1",
    "input": "Hello, I am your AI assistant! Just let me know how I can help bring your ideas to life.",
    "voice": "alloy"
  }' \
  --output speech.mp3

Raw Audio Streaming (Low-Latency Playback)

For low-latency browser playback with immediate streaming, use audio_stream format:

curl -X POST http://localhost:5050/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your_api_key_here" \
  -d '{
    "input": "This will stream raw audio chunks for immediate playback!",
    "voice": "alloy",
    "stream_format": "audio_stream"
  }' | ffplay -i -

Benefits of audio_stream:

  • Lowest latency for browser playback
  • Starts playing within 1-2 seconds
  • Uses HTTP chunked transfer encoding
  • Raw audio bytes (no base64 encoding overhead)
  • Best for real-time applications

Server-Sent Events (SSE) Streaming

For applications that need structured streaming events (like web applications), use SSE format:

curl -X POST http://localhost:5050/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your_api_key_here" \
  -d '{
    "model": "tts-1",
    "input": "This will stream as Server-Sent Events with JSON data containing base64-encoded audio chunks.",
    "voice": "alloy",
    "stream_format": "sse"
  }'

SSE Response Format:

data: {"type": "speech.audio.delta", "audio": "base64-encoded-audio-chunk"}

data: {"type": "speech.audio.delta", "audio": "base64-encoded-audio-chunk"}

data: {"type": "speech.audio.done", "usage": {"input_tokens": 12, "output_tokens": 0, "total_tokens": 12}}

JavaScript/Web Usage

Example using fetch API for raw audio streaming (lowest latency):

async function streamTTSWithAudioStream(text) {
  const response = await fetch('http://localhost:5050/v1/audio/speech', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      Authorization: 'Bearer your_api_key_here',
    },
    body: JSON.stringify({
      input: text,
      voice: 'alloy',
      stream_format: 'audio_stream',
    }),
  });

  const reader = response.body.getReader();
  const chunks = [];

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    chunks.push(value);
  }

  // Combine and play
  const audioBlob = new Blob(chunks, { type: 'audio/mpeg' });
  const audioUrl = URL.createObjectURL(audioBlob);
  const audio = new Audio(audioUrl);
  audio.play();
}

// Usage
streamTTSWithAudioStream('Hello from raw audio streaming!');

Example using fetch API for SSE streaming:

async function streamTTSWithSSE(text) {
  const response = await fetch('http://localhost:5050/v1/audio/speech', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      Authorization: 'Bearer your_api_key_here',
    },
    body: JSON.stringify({
      input: text,
      voice: 'alloy',
      stream_format: 'sse',
    }),
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  const audioChunks = [];

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const chunk = decoder.decode(value);
    const lines = chunk.split('\n');

    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = JSON.parse(line.slice(6));

        if (data.type === 'speech.audio.delta') {
          // Decode base64 audio chunk
          const audioData = atob(data.audio);
          const audioArray = new Uint8Array(audioData.length);
          for (let i = 0; i < audioData.length; i++) {
            audioArray[i] = audioData.charCodeAt(i);
          }
          audioChunks.push(audioArray);
        } else if (data.type === 'speech.audio.done') {
          console.log('Speech synthesis complete:', data.usage);

          // Combine all chunks and play
          const totalLength = audioChunks.reduce(
            (sum, chunk) => sum + chunk.length,
            0
          );
          const combinedArray = new Uint8Array(totalLength);
          let offset = 0;
          for (const chunk of audioChunks) {
            combinedArray.set(chunk, offset);
            offset += chunk.length;
          }

          const audioBlob = new Blob([combinedArray], { type: 'audio/mpeg' });
          const audioUrl = URL.createObjectURL(audioBlob);
          const audio = new Audio(audioUrl);
          audio.play();
          return;
        }
      }
    }
  }
}

// Usage
streamTTSWithSSE('Hello from SSE streaming!');

International Language Example

And an example of a language other than English:

curl -X POST http://localhost:5050/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your_api_key_here" \
  -d '{
    "model": "tts-1",
    "input": "じゃあ、行く。電車の時間、調べておくよ。",
    "voice": "ja-JP-KeitaNeural"
  }' \
  --output speech.mp3

Additional Endpoints

  • POST/GET /v1/models: Lists available TTS models.
  • POST/GET /v1/voices: Lists edge-tts voices for a given language / locale.
  • POST/GET /v1/voices/all: Lists all edge-tts voices, with language support information.

Contributing

Contributions are welcome! Please fork the repository and create a pull request for any improvements.

License

This project is licensed under GNU General Public License v3.0 (GPL-3.0), and the acceptable use-case is intended to be personal use. For enterprise or non-personal use of openai-edge-tts, contact me at [email protected]


Example Use Case

Tip

Swap localhost to your local IP (ex. 192.168.0.1) if you have issues

It may be the case that, when accessing this endpoint on a different server / computer or when the call is made from another source (like Open WebUI), you need to change the URL from localhost to your local IP (something like 192.168.0.1 or similar)

Open WebUI

Open up the Admin Panel and go to Settings -> Audio

Below, you can see a screenshot of the correct configuration for using this project to substitute the OpenAI endpoint

Screenshot of Open WebUI Admin Settings for Audio adding the correct endpoints for this project

If you're running both Open WebUI and this project in Docker, the API endpoint URL is probably http://host.docker.internal:5050/v1

Note

View the official docs for Open WebUI integration with OpenAI Edge TTS

AnythingLLM

In version 1.6.8, AnythingLLM added support for "generic OpenAI TTS providers" — meaning we can use this project as the TTS provider in AnythingLLM

Open up settings and go to Voice & Speech (Under AI Providers)

Below, you can see a screenshot of the correct configuration for using this project to substitute the OpenAI endpoint

Screenshot of AnythingLLM settings for Voice adding the correct endpoints for this project


Quick Info

  • your_api_key_here never needs to be replaced — No "real" API key is required. Use whichever string you'd like.
  • The quickest way to get this up and running is to install docker and run the command below:
docker run -d -p 5050:5050 -e API_KEY=your_api_key_here -e PORT=5050 travisvn/openai-edge-tts:latest

Voice Samples 🎙️

Play voice samples and see all available Edge TTS voices


Nginx proxy configuration for HLS (Range/HEAD)

When proxying through Nginx, ensure byte range requests are preserved and advertised for HLS init/segment files. Example:

location ~ /v1/audio/speech/hls/.+\.(m4s|m4a|mp3)$ {
  proxy_pass http://127.0.0.1:5050;
  proxy_http_version 1.1;
  proxy_set_header Connection "";
  proxy_pass_request_headers on;
  proxy_force_ranges on;
  add_header Accept-Ranges bytes always;
}

Keep playlist responses 200-only without Accept-Ranges; segments/init may be 200 or 206 with proper Content-Range and Accept-Ranges: bytes.

About

Free, high-quality text-to-speech API endpoint to replace OpenAI, Azure, or ElevenLabs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 62.6%
  • JavaScript 18.0%
  • Shell 10.1%
  • HTML 8.3%
  • Dockerfile 1.0%