Name	Name	Last commit message	Last commit date
Latest commit History 27 Commits
configs	configs
scripts	scripts
stt-rs	stt-rs
.gitignore	.gitignore
LICENSE-APACHE	LICENSE-APACHE
LICENSE-MIT	LICENSE-MIT
README.md	README.md

delayed-streams-modeling

Delayed Streams Modeling (DSM) is a flexible formulation for streaming, multimodal sequence-to-sequence learning.

Speech-to-text

DSM can be used to build streaming speech-to-text models. These models can be batched for efficiency, return word level timestamps, and are great for interactive applications. We provide two such models, these models are characterized by their size as well as the delay it takes for audio to be transcribed into text. We provide two such models:

An English and French model with ~1b parameters using a 0.5 second delay, kyutai/stt-1b-en_fr.
An English only model with ~2.6b parameters using a 2.5 second delay, kyutai/stt-2.6b-en.

More details can be found on the project page.

You can retrieve the sample files used in the following snippets via:

wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
wget https://github.com/kyutai-labs/moshi/raw/refs/heads/main/data/sample_fr_hibiki_crepes.mp3

PyTorch implementation

This requires the moshi package with version 0.2.5 or later, which can be installed via pip.

python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3

If you have uv installed, you can skip the installation step and run directly:

uvx --with moshi python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3

It will install the moshi package in a temporary environment and run the speech-to-text.

MLX implementation

This requires the moshi-mlx package with version 0.2.5 or later, which can be installed via pip.

python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0

If you have uv installed, you can skip the installation step and run directly:

uvx --with moshi-mlx python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0

It will install the moshi package in a temporary environment and run the speech-to-text.

Rust implementation

A standalone Rust example is provided in the stt-rs directory in this repo. This can be used as follows:

cd stt-rs
cargo run --features cuda -r -- bria.mp3

Rust server

The Rust implementation provides a server that can process multiple streaming queries in parallel. Dependening on the amount of memory on your GPU, you may have to adjust the batch size from the config file. For a L40S GPU, a batch size of 64 works well and requests can be processed at 3x real-time speed.

In order to run the server, install the moshi-server crate via the following command. The server code can be found in the kyutai-labs/moshi repository.

cargo install --features cuda moshi-server

Then the server can be started via the following command using the config file from this repository.

moshi-server worker --config configs/config-stt-hf.toml

Once the server has started you can run a streaming inference with the following script.

uv run scripts/asr-streaming-query.py bria.mp3

The script limits the decoding speed to simulates real-time processing of the audio. Faster processing can be triggered by setting the real-time factor, e.g. --rtf 500 will process the data as fast as possible.

Text-to-Speech

We're in the process of open-sourcing our TTS models. Check back for updates!

License

The present code is provided under the MIT license for the Python parts, and Apache license for the Rust backend. The web client code is provided under the MIT license. Note that parts of this code is based on AudioCraft, released under the MIT license.

The weights for the speech-to-text models are released under the CC-BY 4.0 license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

delayed-streams-modeling

Speech-to-text

PyTorch implementation

MLX implementation

Rust implementation

Rust server

Text-to-Speech

License

About

Licenses found

Uh oh!

Releases

Packages

Contributors 4

Languages

License

Licenses found

kyutai-labs/delayed-streams-modeling

Folders and files

Latest commit

History

Repository files navigation

delayed-streams-modeling

Speech-to-text

PyTorch implementation

MLX implementation

Rust implementation

Rust server

Text-to-Speech

License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages