FastAPI service for Speech-to-Text, Text-to-Speech, and Conversation
Powered by LiquidAI LFM2-Audio-1.5B
- Speech-to-Text (STT): Transcribe audio at 12.9x realtime
- Text-to-Speech (TTS): High-quality synthesis
- Conversation (Speech-to-Speech): Full voice-to-voice conversations
- End-to-end model: Single model handles all audio tasks
- GPU Accelerated: CUDA support with Blackwell compatibility
# Clone the repository
git clone https://github.com/yourusername/audio-api.git
cd audio-api
# Create conda environment
conda create -n liquid-audio python=3.12
conda activate liquid-audio
# Install dependencies
pip install torch --index-url https://download.pytorch.org/whl/cu128
pip install liquid-audio fastapi uvicorn librosa soundfile python-multipart
# Copy and configure environment
cp .env.example .env
# Edit .env with your settings# Using the start script
./start_api.sh
# Or manually
conda activate liquid-audio
python liquid_audio_api.py- API: http://localhost:5006
- Swagger Docs: http://localhost:5006/docs
- ReDoc: http://localhost:5006/redoc
curl -X POST http://localhost:5006/transcribe \
-F "audio=@your_audio.wav" \
-F "text_prompt=Transcribe the audio." \
-F "max_tokens=256"curl -X POST http://localhost:5006/synthesize \
-F "text=Hello, this is a test." \
-F "max_tokens=512"curl -X POST http://localhost:5006/converse \
-F "audio=@your_audio.wav" \
-F "system_prompt=Respond briefly." \
-F "max_tokens=128"curl http://localhost:5006/health | python -m json.tool- Transcription (STT): ~12.9x realtime
- Synthesis (TTS): ~2.95s for short text
- Conversation: ~3.36s full response
- VRAM Usage: ~3 GB allocated, ~6 GB reserved
Configuration is done via environment variables (see .env.example):
| Variable | Default | Description |
|---|---|---|
API_HOST |
0.0.0.0 |
Listen address |
API_PORT |
5006 |
Listen port |
GPU_DEVICE |
cuda:0 |
GPU to use |
STORAGE_BASE |
./data |
Data storage directory |
MODEL_BASE |
./models/cache |
Model cache directory |
To run as a system service:
# Create service file
sudo tee /etc/systemd/system/audio-api.service << 'EOF'
[Unit]
Description=LiquidAI Audio API
After=network.target
[Service]
Type=simple
User=your_user
WorkingDirectory=/path/to/audio-api
Environment="PATH=/path/to/miniconda3/envs/liquid-audio/bin"
ExecStart=/path/to/miniconda3/envs/liquid-audio/bin/python liquid_audio_api.py
Restart=on-failure
RestartSec=15
[Install]
WantedBy=multi-user.target
EOF
# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable audio-api
sudo systemctl start audio-api./data/ # STORAGE_BASE
├── temp/ # Temporary uploads
├── YYYY-MM-DD/ # Daily directories
│ ├── raw/ # Uploaded audio files
│ ├── processed/ # Generated audio files
│ ├── transcriptions.jsonl
│ ├── synthesis.jsonl
│ └── conversations.jsonl
A lightweight desktop application for voice-to-text with clipboard integration.
Location: ./dictation-app/
cd dictation-app
./start_dictation.sh # CLI version
./start_webui.sh # Web UI version (port 7870)Features:
- Press SPACE to record/stop (CLI)
- Auto-copy to clipboard
- JSONL logging
- Audio archive
- 1-2 second latency
See dictation-app/README.md for details.
Real-time streaming via WebSocket and WebRTC.
Location: ./streaming/
cd streaming
python conversation_stream.py # Port 5008See streaming/docs/ for setup guides.
| Property | Value |
|---|---|
| Model | LiquidAI/LFM2-Audio-1.5B |
| Parameters | 1.45B |
| VRAM | ~3 GB |
| Capabilities | STT + TTS + Conversation |
curl http://localhost:5006/health# If using systemd
journalctl -u audio-api -f
# Or check data directory
tail -f ./data/$(date +%Y-%m-%d)/*.jsonl- CUDA out of memory: Reduce
max_tokensor use smaller batch sizes - Model not found: Run
python download_models.pyfirst - Audio format error: Ensure input is WAV, MP3, or FLAC
MIT License - See LICENSE file for details.
Version: 1.1.0