Installing the command-line executable

Assuming you have Rust/Cargo installed, run this command in a terminal:

cargo install sensevoice-cli

It will make the sensevoice-cli command available in your PATH if you've allowed the PATH to be modified when installing Rust. cargo uninstall sensevoice-cli uninstalls.

Back to the crate overview.

Readme

SenseVoice CLI

A lightweight command-line front end for the SenseVoice multilingual speech recognition model.

Installation

Prerequisites

Rust 1.75 or later
Cargo package manager
pkg-config
cmake
opus

Linux:

apt-get install -y cmake pkg-config

Mac:

brew install cmake

cargo install sensevoice-cli

# or without opus(.ogg) format
cargo install sensevoice-cli --no-default-features

Usage

SenseVoice Rust CLI (ORT + Symphonia + HF Hub)

Usage: sensevoice-cli [OPTIONS] [AUDIO]

Arguments:
  [AUDIO]  Input audio file (wav/mp3/ogg/flac/opus/vorbis)

Options:
    --models-path <MODELS_PATH>     Download/cache directory for models and resources [default: ~/.sensevoice-models]
  -t, --threads <NUM_THREADS>         Intra-op threads for ONNX Runtime [default: 1]
  -l, --language <LANGUAGE>           Language code: auto, zh, en, yue, ja, ko, nospeech [default: auto]
    --use-itn                       Use ITN post-processing
    --vad-int8                      Use int8 Silero VAD model
    --no-vad                        Disable Silero VAD segmentation
    --vad-threshold <VAD_THRESHOLD> VAD probability threshold (0.0-1.0) [default: 0.5]
    --vad-min-speech-ms <VAD_MIN_SPEECH_MS>
                     Minimum speech duration in milliseconds [default: 400]
    --vad-min-silence-ms <VAD_MIN_SILENCE_MS>
                     Minimum silence duration in milliseconds [default: 200]
    --vad-speech-pad-ms <VAD_SPEECH_PAD_MS>
                     Additional padding in milliseconds around segments [default: 120]
  --vad-merge-gap-ms <VAD_MERGE_GAP_MS>
               Merge adjacent segments separated by <= gap milliseconds [default: 1200]
    --hf-endpoint <HF_ENDPOINT>     Optional HF endpoint/mirror (overrides env HF_ENDPOINT/HF_MIRROR)
    --log <LOG>                     Log level
  -o, --output <OUTPUT>               Output JSON file path
  -c, --channels <CHANNELS>           Maximum number of audio channels to transcribe (0 = all) [default: 1]
    --download-only                 Download models only and exit
  -h, --help                          Print help
  -V, --version                       Print version

Quick start

sensevoice-cli path/to/audio.wav
sensevoice-cli -o transcript.json path/to/audio.wav

Output:

[
  {
    "channel": 0,
    "duration_sec": 7.152,
    "rtf": 0.019359846,
    "segments": [
      {
        "start_sec": 1.09,
        "end_sec": 3.614,
        "text": "THE DRIBL TEETHIN CALLD FOR THE BOY",
        "tags": []
      },
      {
        "start_sec": 3.842,
        "end_sec": 6.59,
        "text": "AND PRESENTED HIM WITH FIFTY PIECES OF COATD",
        "tags": []
      }
    ]
  }
]

Input formats: WAV, MP3, OGG, and FLAC.
Default output: JSON written to stdout with per-channel segments.
Models download into ~/.sensevoice-models on first run (override with --models-path).

Handy flags

sensevoice-cli -l zh --use-itn -c 2 samples/demo.wav

-l/--language: explicit language hint (auto, zh, en, yue, ja, ko, nospeech).
--use-itn: enable inverse text normalization for cleaner numbers and dates.
-c/--channels: limit the number of channels to transcribe (default 1, set 0 for all).
-o/--output: write JSON to a file instead of stdout.
--log: set log verbosity (e.g. info, debug).
--download-only: prefetch model assets without running inference.
--no-vad: bypass voice activity detection and transcribe each channel as a whole.
--vad-*: tune Silero VAD behaviour (threshold, speech/silence durations, padding, merge gap) without editing code.

Advanced tips

Mirror-friendly downloads: add --hf-endpoint https://hf-mirror.com (or set HF_ENDPOINT/HF_MIRROR) to speed up model fetches from mainland China.
Multi-channel aware: every audio channel is decoded separately; VAD segments are merged into a single JSON array with channel metadata.
VAD precision: append --vad-int8 to prefer the quantized Silero VAD model when CPU resources are limited.
VAD controls: fine-tune segmentation with the --vad-* flags (threshold, speech/silence durations, padding, merge gap).
Performance tuning: adjust -t/--threads to match available CPU cores. GPU execution currently requires rebuilding with CUDA-enabled ONNX Runtime.
Session warm-up: the first run saves optimized .ort graphs next to the downloaded models; later runs reuse them to avoid ONNX Runtime re-optimization costs.