Data manipulation using the datasets library
Datasets come with many dictionaries of subsets, where the split parameter is used to decide which subset(s) or portion of the subset is to be loaded. If this is none by default, it will return a dataset dictionary of all subsets (train, test, validation, or any other combination). If the split parameter is specified, it will return a single dataset rather than a dictionary. For the following example, we retrieve a train split of the cola dataset only:
cola_train = load_dataset('glue', 'cola', split ='train') We can get a mixture of the train and validation subsets as follows:
cola_sel = load_dataset('glue', 'cola',
split = 'train[:300]+validation[-30:]') The split expression means that the first 300 examples of train and the last 30 examples of validation are obtained as cola_sel.
We can apply different combinations, as shown in the following...