Language parallel text exploratory data analysis and data preparation

Exploratory data analysis and data preparation scripts for model training, done on the ug-language-parallel-text dataset

Step by step outline of our process of data preparation and model training:

Concatenate any new chunks of data into the publicly available json dataset by running the Data EDA and cleaning.ipynb notebook
Create the model training dataset files and put them in the right folder structure by running the Model Training Data Prep.ipynb and Multilingual Data Prep.ipynb notebooks, in that order
Upload the dataset to the sunbird-translate bucket on AWS S3 (as a versioned dataset, for example v4-dataset.zip) and make sure the resource is public
Update the dataset URL in the SunbirdAI/datasets repository with the new dataset link (as shown in the next section of this README). The code in this file picks up the models from S3 bucket whose link we added as the dataset URL
Run the SunbirdAI language model training notebook on AWS SageMaker. The training notebook refers to SunbirdAI/datasets for the datasets to be used in training
Save the checkpoints and upload them to the sunbird-translate AWS S3 bucket, in the models folder
Upload the models to Hugging Face

How to add the dataset link in `SunbirdAI/datasets`

Find the _URL constant in the datasets/sunbird/sunbird.py file on the init-sunbird-dataset branch of the SunbirdAI/datasets repository.
The image below shows an example of this:

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
back_translation		back_translation
images		images
monolingualdata		monolingualdata
tests		tests
training		training
utils		utils
.gitignore		.gitignore
Combine translation models.ipynb		Combine translation models.ipynb
Convert translation data to SALT v2 format.ipynb		Convert translation data to SALT v2 format.ipynb
Data EDA and cleaning.ipynb		Data EDA and cleaning.ipynb
Data checks on new data.ipynb		Data checks on new data.ipynb
Model Training Data Prep.ipynb		Model Training Data Prep.ipynb
Multilingual Data Prep.ipynb		Multilingual Data Prep.ipynb
Prepare named entity training data.ipynb		Prepare named entity training data.ipynb
Prepare_supplementary_translation_data_(MAFAND+TICO19+Mozilla).ipynb		Prepare_supplementary_translation_data_(MAFAND+TICO19+Mozilla).ipynb
Prepare_supplementary_translation_data_(MT560+FLORES101+AI4D).ipynb		Prepare_supplementary_translation_data_(MT560+FLORES101+AI4D).ipynb
Public dataset creation.ipynb		Public dataset creation.ipynb
README.md		README.md
Speech dataset prep.ipynb		Speech dataset prep.ipynb
Test translation models.ipynb		Test translation models.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Language parallel text exploratory data analysis and data preparation

Step by step outline of our process of data preparation and model training:

How to add the dataset link in `SunbirdAI/datasets`

About

Uh oh!

Releases

Packages

Languages

mekaneeky/parallel-text-EDA

Folders and files

Latest commit

History

Repository files navigation

Language parallel text exploratory data analysis and data preparation

Step by step outline of our process of data preparation and model training:

How to add the dataset link in SunbirdAI/datasets

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

How to add the dataset link in `SunbirdAI/datasets`

Packages