Real-time speech-to-text with Voice Activity Detection (VAD), powered by JigsawStack's STT API.
- Voice Activity Detection: Automatically detects when you start/stop speaking
- Audio Accumulation: Sends growing audio clips every 1s for evolving transcriptions
- Real-time Waveform: Visual feedback showing speech activity with gradient animation
- Segment Management: Finalizes segments after 3 seconds of silence
- Modern UI: Clean interface with final (black) and interim (gray italic) text display
- Install dependencies:
npm install or yarn install- Set your JigsawStack API key in
.env:
JIGSAWSTACK_API_KEY=your_api_key_here- Run both servers in separate terminals:
Terminal 1: WebSocket Server (Port 8080)
npm start or yarn startTerminal 2: HTTP Server (Port 3000)
npm run start:http or yarn start:http- Open
http://localhost:3000/index.htmland allow microphone access
- VAD monitors audio energy levels to detect speech
- When speech starts, audio is accumulated in a buffer
- Every 1 second, accumulated audio is sent for transcription
- Transcriptions evolve and improve as more audio context is added
- After 3 seconds of silence, the segment is finalized
- Process restarts fresh for the next speech segment
Edit the config in index.html (line 382):
{
language: "en", // Language code
encoding: "wav", // Audio format (wav, webm, pcm16)
interimResults: true, // Show evolving transcriptions
format: "text" // Output format (text or json)
}server.ts- WebSocket server handling STT requestshttp-server.ts- Serves the web interfaceindex-vad.html- VAD-based client with waveform visualizationindex.html- Simple interval-based client (no VAD)
- JigsawStack STT API
- Web Audio API for VAD and PCM capture
- WebSockets for real-time communication
- TypeScript + Node.js