Real-Time Speech Recognition WebSocket with NeMo toolkit

This project is a real-time Automatic Speech Recognition (ASR) client using WebSockets to stream audio to a server for transcription. It captures audio from a microphone, sends it to a WebSocket server, and prints the received transcriptions.

Prerequisites

Before setting up the project, ensure you have the following installed:

Python 3.10
Conda (optional)
Docker image (optional, recommended for environment management)
NeMo (NVIDIA's conversational AI toolkit)

Preview

Installation

Step 1: Set Up a Conda Environment

To avoid dependency conflicts, create a new conda environment:

conda create --name rt_asr python=3.10 -y
conda activate rt_asr

Step 2: Install NeMo

NeMo is required for speech recognition processing. You can install it using the following command:

pip install nemo_toolkit[all]

If you encounter issues, follow the official NeMo installation guide: NVIDIA NeMo GitHub.

Run the following command to install NeMo:

pip install git+https://github.com/NVIDIA/NeMo.git

Step 3: Install Project Dependencies

PyAudio may require additional system dependencies; install them using:

sudo apt-get install sox libsndfile1 ffmpeg portaudio19-dev # (For Linux)

After activating the environment, install the required Python packages:

pip install -r requirements.txt

Or you can use docker container for asr server.

Docker Setup

To build and run the Docker container for this project, follow these steps:

Build the Docker Image

docker build . -t ws_asr

Run the Docker Container

docker run --gpus all -p 8766:8766 --rm ws_asr

Usage

1. Start the WebSocket Server

Ensure your WebSocket ASR server is running at ws://localhost:8766 before starting the client.

python server.py

2. Run the Client

python client.py

3. Select the Audio Input Device

After running the client, it will list available audio input devices. Choose the appropriate device by entering its corresponding ID.

4. Start Speaking

Once the connection is established, the system will capture audio and send it to the WebSocket server for transcription in real-time.

Cache-Aware Streaming FastConformer

This project utilizes NeMo models trained for streaming applications, as described in the paper: Noroozi et al. "Stateful FastConformer with Cache-based Inference for Streaming Automatic Speech Recognition" (accepted to ICASSP 2024).

Model Features

Trained with limited left and right-side context to enable low-latency streaming transcription.
Implements caching to avoid recomputation of previous activations, reducing latency further.

Available Model Checkpoints

stt_en_fastconformer_hybrid_large_streaming_80ms - 80ms lookahead / 160ms chunk size
stt_en_fastconformer_hybrid_large_streaming_480ms - 480ms lookahead / 540ms chunk size
stt_en_fastconformer_hybrid_large_streaming_1040ms - 1040ms lookahead / 1120ms chunk size
stt_en_fastconformer_hybrid_large_streaming_multi - 0ms, 80ms, 480ms, 1040ms lookahead / 80ms, 160ms, 540ms, 1120ms chunk size

Model Inference Process

Audio is continuously recorded in chunks and fed into the ASR model.
Using pyaudio, an audio input stream passes the audio to a stream_callback function at set intervals.
The transcribe function processes the audio chunk and returns transcriptions in real-time.
Chunk size determines the duration of audio processed per step.
Lookahead size is calculated as chunk size - 80 ms (since FastConformer models have a fixed 80ms output timestep duration).

Configuration

WebSocket Server URL: You can change the WebSocket server address in server.py by modifying the WS_URL variable.
Audio Parameters: Adjust SAMPLE_RATE, chunk_size, and ENCODER_STEP_LENGTH in client.py to fine-tune the audio streaming behavior.

Troubleshooting

If NeMo installation fails, refer to the NeMo Installation Guide for specific dependencies and troubleshooting steps.
Ensure your WebSocket server is running and accessible at the correct URL.
If no audio input devices are found, check your microphone settings and ensure that pyaudio is correctly installed.

License

This project is open-source. Feel free to modify and improve it!

Notes

The WebSocket server handling ASR is expected to be running separately.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Dockerfile		Dockerfile
README.md		README.md
client.py		client.py
requirements.txt		requirements.txt
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Real-Time Speech Recognition WebSocket with NeMo toolkit

Prerequisites

Preview

Installation

Step 1: Set Up a Conda Environment

Step 2: Install NeMo

Step 3: Install Project Dependencies

Docker Setup

Build the Docker Image

Run the Docker Container

Usage

1. Start the WebSocket Server

2. Run the Client

3. Select the Audio Input Device

4. Start Speaking

Cache-Aware Streaming FastConformer

Model Features

Available Model Checkpoints

Model Inference Process

Configuration

Troubleshooting

License

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Languages

mesut92/nemo_asr_websocket

Folders and files

Latest commit

History

Repository files navigation

Real-Time Speech Recognition WebSocket with NeMo toolkit

Prerequisites

Preview

Installation

Step 1: Set Up a Conda Environment

Step 2: Install NeMo

Step 3: Install Project Dependencies

Docker Setup

Build the Docker Image

Run the Docker Container

Usage

1. Start the WebSocket Server

2. Run the Client

3. Select the Audio Input Device

4. Start Speaking

Cache-Aware Streaming FastConformer

Model Features

Available Model Checkpoints

Model Inference Process

Configuration

Troubleshooting

License

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages