This model is also hosted on Hugging Face for easy access and inference:
If you find this project helpful or inspiring, please consider giving it a star 🌟 on GitHub!
DeepVoiceGuard is a robust solution for detecting spoofed audio in Automatic Speaker Verification (ASV) systems. This project utilizes the RawNet2 model, trained on the ASVspoof 2019 dataset, and deploys the trained model using FastAPI for real-time inference. The repository also includes ONNX model conversion and sliding window inference for efficient processing of long audio files.
- Introduction
- Features
- Dataset
- Model Training
- Model Conversion
- Sliding Window Inference
- Deployment
- How to Use
- Results
- References
Automatic Speaker Verification (ASV) systems are vulnerable to spoofing attacks, such as voice conversion and speech synthesis. DeepVoiceGuard leverages the RawNet2 architecture to detect and mitigate such attacks. The project also includes a FastAPI-based deployment for real-time inference, ensuring practical usability.
- Training the RawNet2 model on the ASVspoof 2019 dataset.
- Conversion of the PyTorch-trained model to ONNX format for efficient deployment.
- Real-time inference via FastAPI.
- Sliding window inference for processing long audio files.
- Supports both genuine and spoofed audio classification.
The ASVspoof 2019 dataset is a benchmark dataset for ASV anti-spoofing tasks. It includes genuine and spoofed audio samples for training, validation, and testing.
- Audio normalization.
- Log-Mel spectrogram extraction.
- Label encoding for binary classification (genuine/spoofed).
For more details, visit the ASVspoof 2019 official website.
- Model: RawNet2
- Optimizer: Adam
- Loss Function: Binary Cross-Entropy (BCE)
- Batch Size: 32
- Learning Rate Scheduler: Cosine Annealing
- Epochs: 50
The training process is handled using main.py
. Run the following command to train the model:
python main.py --data_path LA --epochs 100 --batch_size 32
- Run TensorBoard to visualize logs:
tensorboard --logdir=path/to/logs
After training, the PyTorch model was converted to the ONNX format for deployment. The conversion process is as follows:
import torch
from model import RawNet
def export_to_onnx(model, onnx_path, device):
"""
Export the RawNet model to ONNX format.
"""
dummy_input = torch.randn(1, 64600, device=device) # Dummy input tensor for tracing
torch.onnx.export(
model,
dummy_input,
onnx_path,
export_params=True, # Store the model parameters
opset_version=11, # ONNX opset version
do_constant_folding=True, # Optimize constants
input_names=['input'], # Input tensor name
output_names=['output'], # Output tensor name
dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}} # Dynamic batch size
)
print(f"Model successfully exported to {onnx_path}")
The project includes a sliding window inference method to process long audio files efficiently. This method divides the audio into smaller, fixed-length segments, processes each segment, and aggregates the results using majority voting and average probabilities. This approach ensures faster and more accurate predictions for long audio inputs.
- Processes audio in fixed-length windows (e.g., 64600 samples).
- Pads or trims audio segments as required.
- Aggregates predictions using majority voting.
- Computes average confidence probabilities.
The predict_with_sliding_window
function in inference_onnx.py
handles the inference.
Here is how you can perform inference with the ONNX model and a sliding window:
python inference_onnx.py --model_path <path_to_model.onnx> --audio_path <path_to_audio_file>
Segment 1: Real, Probability: 98.34%
Segment 2: Fake, Probability: 87.42%
Segment 3: Real, Probability: 92.15%
...
Final Result: Real, Average Probability: 94.56%
The ONNX model was deployed using FastAPI for real-time inference. The FastAPI server accepts audio files as input, processes them, and returns classification results (genuine or spoofed).
-
Install required packages:
pip install fastapi uvicorn onnxruntime librosa
-
Save the following script as
app.py
:from fastapi import FastAPI, File, UploadFile import onnxruntime as ort import numpy as np import librosa app = FastAPI() session = ort.InferenceSession("model.onnx") @app.post("/predict/") async def predict(audio_file: UploadFile = File(...)): audio, sr = librosa.load(audio_file.file, sr=16000) audio = np.expand_dims(audio[:16000], axis=0).astype(np.float32) input_data = {session.get_inputs()[0].name: audio} output = session.run(None, input_data) prediction = "Genuine" if np.argmax(output[0]) == 0 else "Spoofed" return {"prediction": prediction}
-
Run the server:
uvicorn app:app --host 0.0.0.0 --port 8000
- POST
/predict/
- Input: Audio file (WAV format, 16 kHz, mono).
- Output: JSON with prediction result (
Real
orFake
).
- Train the model using the provided script.
- Convert the trained model to ONNX format.
- Deploy the model using FastAPI.
- Test the deployment by sending audio files to the
/predict/
endpoint.
Example using curl
:
curl -X POST "http://localhost:8000/predict/" -F "[email protected]"
Metric | Development Set | Evaluation Set |
---|---|---|
EER (%) | 4.21 | 5.03 |
Accuracy (%) | 95.8 | 94.7 |
ROC-AUC | 0.986 | 0.975 |