Adding validation splits to (experimental) text_classification datasets that do not have vocabulary built over them

## 🚀 Feature
The experimental text_classification datasets should have a way to build a validation set from them, without the vocabulary being built over the validation set. 

**Motivation**

In ML, you should always have a test, validation and training set. In NLP, your vocabulary should be built from the training set only, and not from the test/validation. 

The current experimental text classification (IMDB) dataset does not have a validation set and automatically builds the vocabulary whilst loading the train/test sets. After loading the train and test sets, we would need to construct a validation set with `torch.utils.data.random_split`. The issue here is that our vocabulary has already been built over the validation set we are about to create. There is currently no way to solve this issue.

**Pitch**

`'valid'` should be accepted as a `data_select` argument and should create a validation set before the vocabulary has been created over the training set. As the IMDB dataset does not have a standardized validation split, we can do something like taking the last 20% of the training set. 

I am proposing something like the following after the `iters_group` is created [here](https://github.com/pytorch/text/blob/master/torchtext/experimental/datasets/text_classification.py#L63):

```python
from itertools import islice, tee

if 'valid' in iters_group.keys():
    train_iter_a, train_iter_b, train_iter_c = tee(iters_group['train'], 3)
    len_train = int(sum(1 for _ in train_iter_a) * 0.8)
    iters_group['valid'] = islice(train_iter_b, len_train, None)
    iters_group['train'] = islice(train_iter_c, 0, len_train)
    iters_group['vocab'] = islice(iters_group['vocab'] 0, len_train)
```

`tee` duplicates generators and `islice` slices into generators. We need to duplicate the training data iterator as we will be using it three times. We use the first iterator to get the length of the training set so we know what size the validation set will be (the last 20% of the examples from the training set). We then use `islice` to get the last 20% of the training examples to form the validation set, the first 80% of the training examples to use as the new training set, and the first 80% examples of the "vocab" set as this needs to match the training set as it is what we want to build our vocab from.

We can now correctly load a train, valid and test set with vocabulary built only over the training set:
```python
from torchtext.experimental import datasets

train_data, valid_data, test_data = datasets.IMDB(data_select=('train', 'valid', 'test'))
```

Can also load a custom vocabulary built from the original vocabulary like so (note that 'valid' needs to be in the `data_select` when building the original vocabulary):
```python
from torchtext import vocab
from torchtext.experimental import datasets

def get_IMDB(root, tokenizer, vocab_max_size, vocab_min_freq):
    
    os.makedirs(root, exist_ok = True)
    
    train_data, _ = datasets.IMDB(tokenizer = tokenizer, 
                                 data_select = ('train', 'valid'))
    
    old_vocab = train_data.get_vocab()
    
    new_vocab = vocab.Vocab(old_vocab.freqs, 
                            max_size = vocab_max_size, 
                            min_freq = vocab_min_freq)
    
    train_data, valid_data, test_data = datasets.IMDB(tokenizer = tokenizer, 
                                                      vocab = new_vocab,
                                                      data_select=('train', 'valid', 'test'))
    
    return train_data, valid_data, test_data
```

Happy to make the PR if this is given the go-ahead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding validation splits to (experimental) text_classification datasets that do not have vocabulary built over them #690

🚀 Feature

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adding validation splits to (experimental) text_classification datasets that do not have vocabulary built over them #690

Description

🚀 Feature

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions