Introduction

The PengChengStarling project is a multilingual ASR system development toolkit built upon the icefall project. Compared to the original icefall, it incorporates several task-specific optimizations for ASR. Firstly, we replaced the recipe-based approach with a more flexible design, decoupling parameter configurations from functional code. This allows a unified codebase to support ASR tasks across multiple languages. Secondly, we integrated language IDs into the RNN-Transducer architecture, significantly improving the performance of multilingual ASR systems.

To evaluate the capabilities of PengChengStarling, we developed a multilingual streaming ASR model supporting eight languages: Chinese, English, Russian, Vietnamese, Japanese, Thai, Indonesian, and Arabic. Each language was trained with approximately 2,000 hours of audio data, primarily sourced from open datasets. Our model achieves comparable or superior streaming ASR performance in six of these languages compared to Whisper-Large v3, while being only 20% of its size. Additionally, our model offers a remarkable 7x speed improvement in inference compared to Whisper-Large v3.

Language	Testset	Whisper-Large v3	Ours
Chinese	wenetspeech test meeting	22.99	22.67
Vietnamese	gigaspeech2-vi test	17.94	7.09
Japanese	reazonspeech test	16.3	13.34
Thai	gigaspeech2-th test	20.44	17.39
Indonesia	gigaspeech2-id test	20.03	20.54
Arabic	mgb2 test	30.3	24.37

Our model checkpoint is open-sourced on Hugging Face and provided in two formats: PyTorch state dict for fine-tuning and testing, and ONNX format for deployment.

Installation

Please refer to the document for installation instructions. If the installation test is successful, PengChengStarling is ready for use.

Training

1. Data Preparation

Before starting the training process, you must first preprocess the raw data into the required input format. Typically, this involves adapting the make_*_list method in zipformer/prepare.py to your dataset to generate the data.list file. The subsequent steps are applicable to all datasets. Once completed, the script will produce the corresponding cuts and fbank features for each dataset, which serve as the input data for PengChengStarling. The script’s parameters are configured through YAML files located in the config_data directory. Once you’ve prepared the YAML file for your dataset, you can run the script using the following command:

export CUDA_VISIBLE_DEVICES="0"

python zipformer/prepare.py --config-file config_data/<your_dataset_config>.yaml

For more details about cut, please refer to the document.

2. BPE Training

A multilingual ASR model must be capable of generating transcriptions in multiple languages. To achieve this, the BPE model should be trained on text data from various languages. Since transcriptions are embedded in cuts, you can use the training cuts from all the target languages to train the BPE model.

The script to train the BPE model is located in zipformer/prepare_bpe.py. To run the script, you need to first configure the parameters in a YAML file and place it in the config_bpe directory. Please note that the langtags parameter should cover all target languages. For non-space-delimited languages such as Chinese, Japanese, and Thai, word segmentation should be performed before BPE training to assist the process. Once the YAML file for your BPE model is prepared, you can run the script with the following command:

python zipformer/prepare_bpe.py --config-file config_bpe/<your_bpe_config>.yaml

3. Model Training and Finetuning

We adopt a Transducer architecture with Zipformer as the Encoder for our multilingual ASR model. To mitigate cross-linguistic interference issues, we explicitly specify the target language by replacing the <SOS> token with a corresponding <langtag> in the Decoder.

The script for training the multilingual ASR model is located in zipformer/train.py. Before running the script, you need to configure the training parameters in YAML format. An example configuration is provided in the config_train directory.

The parameters for model training are categorized into six sections: record-related, data-related, training-related, finetuning-related, model-related, and decoding-related. Each section includes both active and default parameters. Active parameters frequently change depending on the specific training setup and should be the main focus. Default parameters, on the other hand, typically remain unchanged.

After setting up the parameters, you can begin training as follows:

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"

python zipformer/train.py --config-file config_train/<your_train_config>.yaml

To fine-tune the model, set the do_finetune parameter to true. This will load the pretrained checkpoint and fine-tune it using the specified finetuning dataset. Keep in mind that fine-tuning is highly sensitive to both the data and the learning rate. To improve performance in a specific area, ensure that the training data has a distribution consistent with the test set, and use a learning rate that is not too large. Since fine-tuning uses the same configuration file as training, you can execute the fine-tuning process with the same command as above.

Evaluation and Exportation

Once the training process is complete, you should evaluate the performance of the multilingual ASR model on the test set. The evaluation uses the same configuration file as the training. For a streaming model, you can use the following command for evaluation:

export CUDA_VISIBLE_DEVICES="0"

python zipformer/streaming_decode.py \
  --epoch 43 \
  --avg 15 \
  --config-file config_train/<your_train_config>.yaml

If you have evaluated multiple checkpoints, the best results can be obtained by running local/get_best_results.py. To export the checkpoint with the best results for fine-tuning (e.g., epoch 43, avg 15), use the following command:

python zipformer/export.py \
  --epoch 43 \
  --avg 15 \
  --config-file config_train/<your_train_config>.yaml

To export the best checkpoint in ONNX format for deployment, use the following command:

python zipformer/export-onnx-streaming.py \
  --epoch 43 \
  --avg 15 \
  --config-file config_train/<your_train_config>.yaml

To deploy the onnx checkpoint, please refer to this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config_bpe		config_bpe
config_data		config_data
config_train		config_train
figures		figures
local		local
shared		shared
zipformer		zipformer
README.md		README.md
eval.sh		eval.sh
export.sh		export.sh
export_onnx.sh		export_onnx.sh
prepare_bpe.sh		prepare_bpe.sh
prepare_data.sh		prepare_data.sh
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Installation

Training

1. Data Preparation

2. BPE Training

3. Model Training and Finetuning

Evaluation and Exportation

About

Uh oh!

Releases

Packages

Languages

FlyWolfFreedom/PengChengStarling

Folders and files

Latest commit

History

Repository files navigation

Introduction

Installation

Training

1. Data Preparation

2. BPE Training

3. Model Training and Finetuning

Evaluation and Exportation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages