Machine Synesthesia : using images for speech recognition

The goal of this project is to build a voice recording digit classifier using the spoken digits dataset available here.

The classifier takes as an input any wave recording and will perform a speech/keyword detection algorithm outputting what digit is being spoken on the input recording.

The key part is using computer-vision techniques such as convolutional neural network to perform prediction, by leveraging the power and utility of mel spectrograms, as seen in the figure below, we are able to teach the machine "to see the sound recording" (an example of synesthesia).

Get started

Create a virtual environment (It is preferable as not avoid any dependecy problems) and install the required packages as mentioned in the requirements.txt file.

pip install -r requirements.txt

Once all the required packages are installed open the spoken_digits notebook and follow the the instructions to either train the model from scratch or load and test the trained model.

Prepare the data

To prepare the dataset, first using tensorflow_datasets download the dataset and split as needed.

Take a moment to explore the dataset after reading more about it using the utils.py functions or by yourself.

The function preprocess.py will be particularely useful to convert all the recordings to a uniformly sized spectrogram images that will served as input for the CNN model.

Train the model

Using the Keras library interface, we then build a sequential model with multiple CONV2D + MAXPOOling blocks.

Feel free to customize the layers and hyperparameters to see if you can get a higher accuracy!

When satisfied save yout model as done in the notebook. You will be able to load it again and use for prediction. You can also add a callback so that you can train it again.

Test on your own recordings

Finally, you can test the model on your own recordings for fun.

As described in the last part of the notebook, upload your sample recordings into the directory and load them using librosa.load('path_to_filename', 'sample_rate') and specify the sample rate accordingly.

Then load your model, test and enjoy :) .

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
images		images
models/model		models/model
README.md		README.md
requirements.txt		requirements.txt
spoken_digits.ipynb		spoken_digits.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Machine Synesthesia : using images for speech recognition

Get started

Prepare the data

Train the model

Test on your own recordings

About

Uh oh!

Releases

Packages

Uh oh!

Languages

khuss/Machine_synesthesia

Folders and files

Latest commit

History

Repository files navigation

Machine Synesthesia : using images for speech recognition

Get started

Prepare the data

Train the model

Test on your own recordings

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages