Skip to content

supertone-inc/supertonic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Supertonic — Lightning Fast, On-Device TTS

Demo Models

Supertonic Banner

Supertonic is a lightning-fast, on-device text-to-speech system designed for extreme performance with minimal computational overhead. Powered by ONNX Runtime, it runs entirely on your device—no cloud, no API calls, no privacy concerns.

Demo

Raspberry Pi

Watch Supertonic running on a Raspberry Pi, demonstrating on-device, real-time text-to-speech synthesis:

supertonic_raspberry-pi_480.mov

E-Reader

Experience Supertonic on an Onyx Boox Go 6 e-reader in airplane mode, achieving an average RTF of 0.3× with zero network dependency:

supertonic_ebook.mp4

🎧 Try it now: Experience Supertonic in your browser with our Interactive Demo, or get started with pre-trained models from Hugging Face Hub

📰 Update News

2025.11.24 - Added Flutter SDK support with macOS compatibility

Table of Contents

Why Supertonic?

  • ⚡ Blazingly Fast: Generates speech up to 167× faster than real-time on consumer hardware (M4 Pro)—unmatched by any other TTS system
  • 🪶 Ultra Lightweight: Only 66M parameters, optimized for efficient on-device performance with minimal footprint
  • 📱 On-Device Capable: Complete privacy and zero latency—all processing happens locally on your device
  • 🎨 Natural Text Handling: Seamlessly processes numbers, dates, currency, abbreviations, and complex expressions without pre-processing
  • ⚙️ Highly Configurable: Adjust inference steps, batch processing, and other parameters to match your specific needs
  • 🧩 Flexible Deployment: Deploy seamlessly across servers, browsers, and edge devices with multiple runtime backends.

Language Support

We provide ready-to-use TTS inference examples across multiple ecosystems:

Language/Platform Path Description
Python py/ ONNX Runtime inference
Node.js nodejs/ Server-side JavaScript
Browser web/ WebGPU/WASM inference
Java java/ Cross-platform JVM
C++ cpp/ High-performance C++
C# csharp/ .NET ecosystem
Go go/ Go implementation
Swift swift/ macOS applications
iOS ios/ Native iOS apps
Rust rust/ Memory-safe systems
Flutter flutter/ Cross-platform apps

For detailed usage instructions, please refer to the README.md in each language directory.

Getting Started

First, clone the repository:

git clone https://github.com/supertone-inc/supertonic.git
cd supertonic

Prerequisites

Before running the examples, download the ONNX models and preset voices, and place them in the assets directory:

Note: The Hugging Face repository uses Git LFS. Please ensure Git LFS is installed and initialized before cloning or pulling large model files.

  • macOS: brew install git-lfs && git lfs install
  • Generic: see https://git-lfs.com for installers
git clone https://huggingface.co/Supertone/supertonic assets

Quick Start

Python Example (Details)

cd py
uv sync
uv run example_onnx.py

Node.js Example (Details)

cd nodejs
npm install
npm start

Browser Example (Details)

cd web
npm install
npm run dev

Java Example (Details)

cd java
mvn clean install
mvn exec:java

C++ Example (Details)

cd cpp
mkdir build && cd build
cmake .. && cmake --build . --config Release
./example_onnx

C# Example (Details)

cd csharp
dotnet restore
dotnet run

Go Example (Details)

cd go
go mod download
go run example_onnx.go helper.go

Swift Example (Details)

cd swift
swift build -c release
.build/release/example_onnx

Rust Example (Details)

cd rust
cargo build --release
./target/release/example_onnx

iOS Example (Details)

cd ios/ExampleiOSApp
xcodegen generate
open ExampleiOSApp.xcodeproj
  • In Xcode: Targets → ExampleiOSApp → Signing: select your Team
  • Choose your iPhone as run destination → Build & Run

Technical Details

  • Runtime: ONNX Runtime for cross-platform inference (CPU-optimized; GPU mode is not tested)
  • Browser Support: onnxruntime-web for client-side inference
  • Batch Processing: Supports batch inference for improved throughput
  • Audio Output: Outputs 16-bit WAV files

Performance

We evaluated Supertonic's performance (with 2 inference steps) using two key metrics across input texts of varying lengths: Short (59 chars), Mid (152 chars), and Long (266 chars).

Metrics:

  • Characters per Second: Measures throughput by dividing the number of input characters by the time required to generate audio. Higher is better.
  • Real-time Factor (RTF): Measures the time taken to synthesize audio relative to its duration. Lower is better (e.g., RTF of 0.1 means it takes 0.1 seconds to generate one second of audio).

Characters per Second

System Short (59 chars) Mid (152 chars) Long (266 chars)
Supertonic (M4 pro - CPU) 912 1048 1263
Supertonic (M4 pro - WebGPU) 996 1801 2509
Supertonic (RTX4090) 2615 6548 12164
API ElevenLabs Flash v2.5 144 209 287
API OpenAI TTS-1 37 55 82
API Gemini 2.5 Flash TTS 12 18 24
API Supertone Sona speech 1 38 64 92
Open Kokoro 104 107 117
Open NeuTTS Air 37 42 47

Notes:
API = Cloud-based API services (measured from Seoul)
Open = Open-source models
Supertonic (M4 pro - CPU) and (M4 pro - WebGPU): Tested with ONNX
Supertonic (RTX4090): Tested with PyTorch model
Kokoro: Tested on M4 Pro CPU with ONNX
NeuTTS Air: Tested on M4 Pro CPU with Q8-GGUF

Real-time Factor

System Short (59 chars) Mid (152 chars) Long (266 chars)
Supertonic (M4 pro - CPU) 0.015 0.013 0.012
Supertonic (M4 pro - WebGPU) 0.014 0.007 0.006
Supertonic (RTX4090) 0.005 0.002 0.001
API ElevenLabs Flash v2.5 0.133 0.077 0.057
API OpenAI TTS-1 0.471 0.302 0.201
API Gemini 2.5 Flash TTS 1.060 0.673 0.541
API Supertone Sona speech 1 0.372 0.206 0.163
Open Kokoro 0.144 0.124 0.126
Open NeuTTS Air 0.390 0.338 0.343
Additional Performance Data (5-step inference)

Characters per Second (5-step)

System Short (59 chars) Mid (152 chars) Long (266 chars)
Supertonic (M4 pro - CPU) 596 691 850
Supertonic (M4 pro - WebGPU) 570 1118 1546
Supertonic (RTX4090) 1286 3757 6242

Real-time Factor (5-step)

System Short (59 chars) Mid (152 chars) Long (266 chars)
Supertonic (M4 pro - CPU) 0.023 0.019 0.018
Supertonic (M4 pro - WebGPU) 0.024 0.012 0.010
Supertonic (RTX4090) 0.011 0.004 0.002

Natural Text Handling

Supertonic is designed to handle complex, real-world text inputs that contain numbers, currency symbols, abbreviations, dates, and proper nouns.

🎧 View audio samples more easily: Check out our Interactive Demo for a better viewing experience of all audio examples

Overview of Test Cases:

Category Key Challenges Supertonic ElevenLabs OpenAI Gemini
Financial Expression Decimal currency, abbreviated magnitudes (M, K), currency symbols, currency codes
Time and Date Time notation, abbreviated weekdays/months, date formats
Phone Number Area codes, hyphens, extensions (ext.)
Technical Unit Decimal numbers with units, abbreviated technical notations
Example 1: Financial Expression

Text:

"The startup secured $5.2M in venture capital, a huge leap from their initial $450K seed round."

Challenges:

  • Decimal point in currency ($5.2M should be read as "five point two million")
  • Abbreviated magnitude units (M for million, K for thousand)
  • Currency symbol ($) that needs to be properly pronounced as "dollars"

Audio Samples:

System Result Audio Sample
Supertonic 🎧 Play Audio
ElevenLabs Flash v2.5 🎧 Play Audio
OpenAI TTS-1 🎧 Play Audio
Gemini 2.5 Flash TTS 🎧 Play Audio
Example 2: Time and Date

Text:

"The train delay was announced at 4:45 PM on Wed, Apr 3, 2024 due to track maintenance."

Challenges:

  • Time expression with PM notation (4:45 PM)
  • Abbreviated weekday (Wed)
  • Abbreviated month (Apr)
  • Full date format (Apr 3, 2024)

Audio Samples:

System Result Audio Sample
Supertonic 🎧 Play Audio
ElevenLabs Flash v2.5 🎧 Play Audio
OpenAI TTS-1 🎧 Play Audio
Gemini 2.5 Flash TTS 🎧 Play Audio
Example 3: Phone Number

Text:

"You can reach the hotel front desk at (212) 555-0142 ext. 402 anytime."

Challenges:

  • Area code in parentheses that should be read as separate digits
  • Phone number with hyphen separator (555-0142)
  • Abbreviated extension notation (ext.)
  • Extension number (402)

Audio Samples:

System Result Audio Sample
Supertonic 🎧 Play Audio
ElevenLabs Flash v2.5 🎧 Play Audio
OpenAI TTS-1 🎧 Play Audio
Gemini 2.5 Flash TTS 🎧 Play Audio
Example 4: Technical Unit

Text:

"Our drone battery lasts 2.3h when flying at 30kph with full camera payload."

Challenges:

  • Decimal time duration with abbreviation (2.3h = two point three hours)
  • Speed unit with abbreviation (30kph = thirty kilometers per hour)
  • Technical abbreviations (h for hours, kph for kilometers per hour)
  • Technical/engineering context requiring proper pronunciation

Audio Samples:

System Result Audio Sample
Supertonic 🎧 Play Audio
ElevenLabs Flash v2.5 🎧 Play Audio
OpenAI TTS-1 🎧 Play Audio
Gemini 2.5 Flash TTS 🎧 Play Audio

Note: These samples demonstrate how each system handles text normalization and pronunciation of complex expressions without requiring pre-processing or phonetic annotations.

Citation

The following papers describe the core technologies used in Supertonic. If you use this system in your research or find these techniques useful, please consider citing the relevant papers:

SupertonicTTS: Main Architecture

This paper introduces the overall architecture of SupertonicTTS, including the speech autoencoder, flow-matching based text-to-latent module, and efficient design choices.

@article{kim2025supertonic,
  title={SupertonicTTS: Towards Highly Efficient and Streamlined Text-to-Speech System},
  author={Kim, Hyeongju and Yang, Jinhyeok and Yu, Yechan and Ji, Seunghun and Morton, Jacob and Bous, Frederik and Byun, Joon and Lee, Juheon},
  journal={arXiv preprint arXiv:2503.23108},
  year={2025},
  url={https://arxiv.org/abs/2503.23108}
}

Length-Aware RoPE: Text-Speech Alignment

This paper presents Length-Aware Rotary Position Embedding (LARoPE), which improves text-speech alignment in cross-attention mechanisms.

@article{kim2025larope,
  title={Length-Aware Rotary Position Embedding for Text-Speech Alignment},
  author={Kim, Hyeongju and Lee, Juheon and Yang, Jinhyeok and Morton, Jacob},
  journal={arXiv preprint arXiv:2509.11084},
  year={2025},
  url={https://arxiv.org/abs/2509.11084}
}

Self-Purifying Flow Matching: Training with Noisy Labels

This paper describes the self-purification technique for training flow matching models robustly with noisy or unreliable labels.

@article{kim2025spfm,
  title={Training Flow Matching Models with Reliable Labels via Self-Purification},
  author={Kim, Hyeongju and Yu, Yechan and Yi, June Young and Lee, Juheon},
  journal={arXiv preprint arXiv:2509.19091},
  year={2025},
  url={https://arxiv.org/abs/2509.19091}
}

License

This project's sample code is released under the MIT License. - see the LICENSE for details.

The accompanying model is released under the OpenRAIL-M License. - see the LICENSE file for details.

This model was trained using PyTorch, which is licensed under the BSD 3-Clause License but is not redistributed with this project. - see the LICENSE for details.

Copyright (c) 2025 Supertone Inc.