Exploratory data analysis and data preparation scripts for model training, done on the ug-language-parallel-text dataset
- Concatenate any new chunks of data into the publicly available json dataset by running the
Data EDA and cleaning.ipynbnotebook - Create the model training dataset files and put them in the right folder structure by running the
Model Training Data Prep.ipynbandMultilingual Data Prep.ipynbnotebooks, in that order - Upload the dataset to the
sunbird-translatebucket onAWS S3(as a versioned dataset, for examplev4-dataset.zip) and make sure the resource is public - Update the dataset URL in the SunbirdAI/datasets repository with the new dataset link (as shown in the next section of this
README). The code in this file picks up the models fromS3 bucketwhose link we added as the dataset URL - Run the
SunbirdAIlanguage model training notebook onAWS SageMaker. The training notebook refers toSunbirdAI/datasetsfor the datasets to be used in training - Save the checkpoints and upload them to the
sunbird-translateAWS S3bucket, in themodelsfolder - Upload the models to Hugging Face
Find the _URL constant in the datasets/sunbird/sunbird.py file on the init-sunbird-dataset branch of the SunbirdAI/datasets repository.
The image below shows an example of this:
