Skip to content

silijon/whispers

Repository files navigation

Whispers - Voice-to-Command System

Transform your voice into executable bash commands with AI-powered intelligence

Whispers is a sophisticated voice-controlled command-line interface that captures your speech, transcribes it using Whisper, and converts natural language into precise bash commands using AI. Say goodbye to typing complex commands!

Features

Core Capabilities

  • Real-time Speech Recognition - Ultra-low latency audio capture and transcription
  • AI-Powered Command Generation - Natural language to bash conversion using multiple AI providers
  • Intelligent Autocorrect - Automatically fixes file/directory names and paths
  • Multi-Platform Audio - Works with PulseAudio, PipeWire, and ALSA
  • Shell Integration - Seamless integration with zsh, tmux, and clipboard

AI Provider Support

  • Anthropic Claude (Haiku/Sonnet) - Fast and accurate
  • OpenAI GPT - Industry standard performance
  • Ollama - Local, private inference
  • Custom APIs - Bring your own AI endpoint

Advanced Features

  • Voice Activity Detection - Smart silence detection
  • Pre-buffering - Captures speech from the very beginning
  • Confirmation Mode - Optional safety prompts before execution
  • Command History - Automatic zsh history integration
  • Multi-format Audio - Supports various sample rates and formats

Quick Start

Prerequisites

  • Python 3.13+
  • A Whisper server running on port 8080
  • AI API key (Anthropic, OpenAI, etc.)
  • Audio device (microphone)

Installation

  1. Clone and setup:
git clone <repository-url>
cd whispers
pip install -e .
  1. Configure AI provider:
cp ai_inference_config.json.example ai_inference.json
# Edit ai_inference.json with your API key
  1. Test audio capture:
./audio_capture.py --list  # List audio devices
./audio_capture.py -d 0    # Test with device 0

Basic Usage

Voice-to-Command Pipeline

# Generate commands (safe mode)
./voice-to-command.sh

# Execute commands automatically (use with caution!)
./voice-to-command.sh --execute

Zsh Integration

Add to your .zshrc:

source ~/whispers/voice-inject.plugin.zsh

# Enable confirmation prompts (recommended)
export VOICE_CONFIRM=1
export VOICE_TIMEOUT=5

Tmux Integration

# Monitor and auto-inject commands into tmux
./tmux-voice-inject.sh monitor &

Voice Command Examples

Just speak naturally:

Say This Gets Converted To
"List all Python files" find . -name "*.py"
"Show disk usage" df -h
"Count lines in text files" find . -name "*.txt" -exec wc -l {} +
"Find large files over 100MB" find . -size +100M -type f
"Show running processes" ps aux
"Compress this directory" tar -czf archive.tar.gz .
"Watch system logs" tail -f /var/log/syslog

Configuration

Audio Settings

# Low-latency capture with PipeWire
./capture-lowlat.sh --device 0

# Custom audio parameters  
python3 audio_capture.py \
  --sample-rate 44100 \
  --gain 20 \
  --device 2

AI Provider Configuration

{
  "provider": "anthropic",
  "api_key": "your-key-here",
  "model": "claude-3-haiku-20240307",
  "temperature": 0.0,
  "max_tokens": 1000
}

Voice Detection Tuning

python3 streaming_transcriber.py \
  --silence-threshold 0.02 \
  --silence-duration 1.5 \
  --pre-buffer 0.3 \
  --show-levels  # Visual audio level display

Architecture

                                                                      
 Microphone  →  Audio Capture  →  Whisper  →  AI Provider  
              (Low Latency)     Server      (Claude/GPT) 
                                                         
                        ↓                       ↓       
                                                         
 Shell/Tmux  ←  Command Executor  ←  Autocorrect  ←  Bash Command 
 Integration     & History           & Validate      Generation   
                                                                      

Components Deep Dive

audio_capture.py

  • High-performance audio capture using sounddevice
  • Configurable gain, sample rates, and block sizes
  • Device enumeration and selection
  • Real-time audio streaming to stdout

streaming_transcriber.py

  • Intelligent voice activity detection
  • Pre-buffering for complete speech capture
  • Real-time audio level monitoring
  • Whisper server communication with retry logic

ai_inference.py

  • Multi-provider AI client (Anthropic, OpenAI, Ollama)
  • Configurable prompts and parameters
  • Request/response handling with error recovery
  • Configuration management

voice-inject.plugin.zsh

  • Intelligent path and filename autocorrection
  • Case-insensitive file matching
  • Automatic zsh history integration
  • Configurable confirmation prompts

Troubleshooting

Audio Issues

# List all audio devices
pactl list short sources

# Test PipeWire/PulseAudio  
pw-record --format=s16 --rate=16000 test.wav

# Check device permissions
groups $USER  # Should include 'audio'

Whisper Server

# Verify server is running
curl http://localhost:8080/health

# Check server logs
docker logs whisper-server

AI Provider Issues

# Test AI inference directly
echo "list files" | python3 ai_inference.py --verbose

# Validate configuration
python3 ai_inference.py --save-config test-config.json

Requirements

System:

  • Linux (Fedora/Ubuntu tested)
  • Python 3.13+
  • PulseAudio/PipeWire
  • Audio input device

Python Dependencies:

  • sounddevice>=0.5.2 - Audio capture
  • numpy>=2.3.2 - Audio processing
  • requests>=2.32.5 - HTTP client
  • pyperclip>=1.9.0 - Clipboard integration

External Services:

  • Whisper server (OpenAI Whisper API compatible)
  • AI provider (Anthropic/OpenAI/Ollama)

Contributing

Contributions welcome! Areas for improvement:

  • Additional AI provider integrations
  • Windows/macOS audio support
  • Web interface for configuration
  • Voice command training/customization
  • Performance optimizations

License

MIT License - See LICENSE file for details

Acknowledgments

  • OpenAI for Whisper speech recognition
  • Anthropic for Claude AI models
  • The sounddevice and NumPy communities
  • PipeWire/PulseAudio teams for low-latency audio

Security Note: This tool can execute arbitrary commands. Use confirmation mode (VOICE_CONFIRM=1) in production environments and review generated commands before execution.

About

A set of utilities to take advantage of whisper voice-to-text models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published