Import torchtext #1325 57a1df3

mthrok · facebook-github-bot · commit c56bfbd5b74c · 2021-06-09T10:22:19.000-07:00
Reviewed By: NicolasHug

Differential Revision: D28994054

fbshipit-source-id: 4c679f56ef37b18f6d2acaaaed8518facbeaa41c
diff --git a/README.rst b/README.rst
@@ -15,20 +15,22 @@ This repository consists of:
 * `torchtext.datasets <https://github.com/pytorch/text/tree/master/torchtext/datasets>`_: The raw text iterators for common NLP datasets
 * `torchtext.data <https://github.com/pytorch/text/tree/master/torchtext/data>`_: Some basic NLP building blocks (tokenizers, metrics, functionals etc.)
 * `torchtext.nn <https://github.com/pytorch/text/tree/master/torchtext/nn>`_: NLP related modules
+* `torchtext.vocab <https://github.com/pytorch/text/tree/master/torchtext/vocab.py>`_: Vocab and Vectors related classes and factory functions
 * `examples <https://github.com/pytorch/text/tree/master/examples>`_: Example NLP workflows with PyTorch and torchtext library.
 
-Note: the legacy code discussed in `torchtext v0.7.0 release note <https://github.com/pytorch/text/releases/tag/v0.7.0-rc3>`_ has been retired to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_ folder. Those legacy code will not be maintained by the development team, and we plan to fully remove them in the future release. See `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_ folder for more details.
+Note: The legacy code discussed in `torchtext v0.7.0 release note <https://github.com/pytorch/text/releases/tag/v0.7.0-rc3>`_ has been retired to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_ folder. Those legacy code will not be maintained by the development team, and we plan to fully remove them in the future release. See `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_ folder for more details.
 
 Installation
 ============
 
-We recommend Anaconda as Python package management system. Please refer to `pytorch.org <https://pytorch.org/>`_ for the detail of PyTorch installation. The following is the corresponding ``torchtext`` versions and supported Python versions.
+We recommend Anaconda as a Python package management system. Please refer to `pytorch.org <https://pytorch.org/>`_ for the details of PyTorch installation. The following are the corresponding ``torchtext`` versions and supported Python versions.
 
 .. csv-table:: Version Compatibility
    :header: "PyTorch version", "torchtext version", "Supported Python version"
    :widths: 10, 10, 10
 
    nightly build, master, 3.6+
+   1.9, 0.10, 3.6+
    1.8, 0.9, 3.6+
    1.7, 0.8, 3.6+
    1.6, 0.7, 3.6+
@@ -93,7 +95,7 @@ Datasets
 The datasets module currently contains:
 
 * Language modeling: WikiText2, WikiText103, PennTreebank, EnWik9
-* Machine translation: IWSLT2016, IWSLT2017
+* Machine translation: IWSLT2016, IWSLT2017, Multi30k
 * Sequence tagging (e.g. POS/NER): UDPOS, CoNLL2000Chunking
 * Question answering: SQuAD1, SQuAD2 
 * Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
@@ -113,15 +115,22 @@ For example, to access the raw text from the AG_NEWS dataset:
       >>> train_iter = AG_NEWS(split='train')
       >>> dataloader = DataLoader(train_iter, batch_size=8, shuffle=False)
 
-A tutorial for the end-to-end text classification workflow can be found in `PyTorch tutorial <https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html>`_
+Tutorials
+=========
+
+To get started with torchtext, users may refer to the following tutorials available on PyTorch website.
+
+* `Text classification with AG_NEWS dataset <https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html>`_
+* `Translation trained with Multi30k dataset using transformers and torchtext <https://pytorch.org/tutorials/beginner/translation_transformer.html>`_
+* `Language modeling using transforms and torchtext <https://pytorch.org/tutorials/beginner/transformer_tutorial.html>`_
+
 
 [Prototype] Experimental Code
 =============================
 
 We have re-written several building blocks under ``torchtext.experimental``:
 
 * `Transforms <https://github.com/pytorch/text/blob/master/torchtext/experimental/transforms.py>`_: some basic data processing building blocks
-* `Vocabulary <https://github.com/pytorch/text/blob/master/torchtext/experimental/vocab.py>`_: a vocabulary to numericalize tokens
 * `Vectors <https://github.com/pytorch/text/blob/master/torchtext/experimental/vectors.py>`_: the vectors to convert tokens into tensors.
 
 These prototype building blocks in the experimental folder are available in the nightly release only. The nightly packages are accessible via Pip and Conda for Windows, Mac, and Linux. For example, Linux users can install the nightly wheels with the following command::
@@ -133,7 +142,7 @@ For more detailed instructions, please refer to `Install PyTorch <https://pytorc
 [BC Breaking] Legacy
 ====================
 
-In v0.9.0 release, we move the following legacy code to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_. This is part of the work to revamp the torchtext library and the motivation has been discussed in `Issue #664 <https://github.com/pytorch/text/issues/664>`_:
+In the v0.9.0 release, we moved the following legacy code to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_. This is part of the work to revamp the torchtext library and the motivation has been discussed in `Issue #664 <https://github.com/pytorch/text/issues/664>`_:
 
 * ``torchtext.legacy.data.field``
 * ``torchtext.legacy.data.batch``
@@ -144,6 +153,8 @@ In v0.9.0 release, we move the following legacy code to `torchtext.legacy <https
 
 We have a `migration tutorial <https://colab.research.google.com/github/pytorch/text/blob/master/examples/legacy_tutorial/migration_tutorial.ipynb>`_ to help users switch to the torchtext datasets in ``v0.9.0`` release. For the users who still want the legacy components, they can add ``legacy`` to the import path.  
 
+In the v0.10.0 release, we retire the Vocab class to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_. Users can still access the legacy Vocab via ``torchtext.legacy.vocab``. This class has been replaced by a Vocab module that is backed by efficient C++ implementation and provides common functional APIs for NLP workflows. 
+
 Disclaimer on Datasets
 ======================
 
diff --git a/test/data/test_builtin_datasets.py b/test/data/test_builtin_datasets.py
@@ -207,23 +207,14 @@ def test_next_method_dataset(self):
 
     def test_imdb(self):
         from torchtext.experimental.datasets import IMDB
-        from torchtext.legacy.vocab import Vocab
         # smoke test to ensure imdb works properly
         train_dataset, test_dataset = IMDB()
         self._helper_test_func(len(train_dataset), 25000, train_dataset[0][1][:10],
                                [13, 1568, 13, 246, 35468, 43, 64, 398, 1135, 92])
         self._helper_test_func(len(test_dataset), 25000, test_dataset[0][1][:10],
                                [13, 125, 1051, 5, 246, 1652, 8, 277, 66, 20])
 
-        # Test API with a vocab input object
-        old_vocab = train_dataset.get_vocab()
-        new_vocab = Vocab(counter=old_vocab.freqs, max_size=2500)
-        new_train_data, new_test_data = IMDB(vocab=new_vocab)
-
         # Add test for the subset of the standard datasets
-        train_dataset = IMDB(split='train')
-        self._helper_test_func(len(train_dataset), 25000, train_dataset[0][1][:10],
-                               [13, 1568, 13, 246, 35468, 43, 64, 398, 1135, 92])
         train_iter, test_iter = torchtext.datasets.IMDB()
         self._helper_test_func(len(train_iter), 25000, next(train_iter)[1][:25], 'I rented I AM CURIOUS-YEL')
         self._helper_test_func(len(test_iter), 25000, next(test_iter)[1][:25], 'I love sci-fi and am will')
@@ -241,8 +232,8 @@ def test_iwslt2017(self):
         de_vocab, en_vocab = train_dataset.get_vocab()
 
         def assert_nth_pair_is_equal(n, expected_sentence_pair):
-            de_sentence = [de_vocab.itos[index] for index in train_dataset[n][0]]
-            en_sentence = [en_vocab.itos[index] for index in train_dataset[n][1]]
+            de_sentence = [de_vocab.lookup_token(index) for index in train_dataset[n][0]]
+            en_sentence = [en_vocab.lookup_token(index) for index in train_dataset[n][1]]
 
             expected_de_sentence, expected_en_sentence = expected_sentence_pair
 
@@ -267,8 +258,8 @@ def test_iwslt2016(self):
         de_vocab, en_vocab = train_dataset.get_vocab()
 
         def assert_nth_pair_is_equal(n, expected_sentence_pair):
-            de_sentence = [de_vocab.itos[index] for index in train_dataset[n][0]]
-            en_sentence = [en_vocab.itos[index] for index in train_dataset[n][1]]
+            de_sentence = [de_vocab.lookup_token(index) for index in train_dataset[n][0]]
+            en_sentence = [en_vocab.lookup_token(index) for index in train_dataset[n][1]]
             expected_de_sentence, expected_en_sentence = expected_sentence_pair
 
             self.assertEqual(de_sentence, expected_de_sentence)
@@ -462,7 +453,6 @@ def test_conll_sequence_tagging(self):
 
     def test_squad1(self):
         from torchtext.experimental.datasets import SQuAD1
-        from torchtext.legacy.vocab import Vocab
         # smoke test to ensure imdb works properly
         train_dataset, dev_dataset = SQuAD1()
         context, question, answers, ans_pos = train_dataset[100]
@@ -472,16 +462,8 @@ def test_squad1(self):
         self._helper_test_func(len(dev_dataset), 10570, (question, ans_pos[0]),
                                ([42, 27, 669, 7438, 17, 2, 1950, 3273, 17252, 389, 16], [45, 48]))
 
-        # Test API with a vocab input object
-        old_vocab = train_dataset.get_vocab()
-        new_vocab = Vocab(counter=old_vocab.freqs, max_size=2500)
-        new_train_data, new_test_data = SQuAD1(vocab=new_vocab)
-
         # Add test for the subset of the standard datasets
         train_dataset = SQuAD1(split='train')
-        context, question, answers, ans_pos = train_dataset[100]
-        self._helper_test_func(len(train_dataset), 87599, (question[:5], ans_pos[0]),
-                               ([7, 24, 86, 52, 2], [72, 72]))
         train_iter, dev_iter = torchtext.datasets.SQuAD1()
         self._helper_test_func(len(train_iter), 87599, next(train_iter)[0][:50],
                                'Architecturally, the school has a Catholic charact')
@@ -491,7 +473,6 @@ def test_squad1(self):
 
     def test_squad2(self):
         from torchtext.experimental.datasets import SQuAD2
-        from torchtext.legacy.vocab import Vocab
         # smoke test to ensure imdb works properly
         train_dataset, dev_dataset = SQuAD2()
         context, question, answers, ans_pos = train_dataset[200]
@@ -501,16 +482,8 @@ def test_squad2(self):
         self._helper_test_func(len(dev_dataset), 11873, (question, ans_pos[0]),
                                ([41, 29, 2, 66, 17016, 30, 0, 1955, 16], [40, 46]))
 
-        # Test API with a vocab input object
-        old_vocab = train_dataset.get_vocab()
-        new_vocab = Vocab(counter=old_vocab.freqs, max_size=2500)
-        new_train_data, new_test_data = SQuAD2(vocab=new_vocab)
-
         # Add test for the subset of the standard datasets
         train_dataset = SQuAD2(split='train')
-        context, question, answers, ans_pos = train_dataset[200]
-        self._helper_test_func(len(train_dataset), 130319, (question[:5], ans_pos[0]),
-                               ([84, 50, 1421, 12, 5439], [9, 9]))
         train_iter, dev_iter = torchtext.datasets.SQuAD2()
         self._helper_test_func(len(train_iter), 130319, next(train_iter)[0][:50],
                                'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-Y')
diff --git a/torchtext/experimental/datasets/language_modeling.py b/torchtext/experimental/datasets/language_modeling.py
@@ -1,7 +1,7 @@
 import torch
 import logging
 from torchtext.data.utils import get_tokenizer
-from torchtext.legacy.vocab import build_vocab_from_iterator
+from torchtext.vocab import build_vocab_from_iterator
 from torchtext import datasets as raw
 from torchtext.experimental.datasets import raw as experimental_raw
 from torchtext.data.datasets_utils import _check_default_set
@@ -15,7 +15,9 @@ def apply_transforms(data):
         for line in data:
             tokens = transforms(line)
             yield tokens
-    return build_vocab_from_iterator(apply_transforms(data), len(data))
+    vocab = build_vocab_from_iterator(apply_transforms(data), specials=['<unk>', '<pad>'])
+    vocab.set_default_index(vocab['<unk>'])
+    return vocab
 
 
 class LanguageModelingDataset(torch.utils.data.Dataset):
diff --git a/torchtext/experimental/datasets/question_answer.py b/torchtext/experimental/datasets/question_answer.py
@@ -1,7 +1,7 @@
 import torch
 import logging
 from torchtext.data.utils import get_tokenizer
-from torchtext.legacy.vocab import build_vocab_from_iterator
+from torchtext.vocab import build_vocab_from_iterator
 from torchtext import datasets as raw
 from torchtext.data.datasets_utils import _check_default_set
 from torchtext.data.datasets_utils import _wrap_datasets
@@ -81,7 +81,8 @@ def apply_transform(data):
                     tok_ans += text_transform(item)
                 yield text_transform(_context) + text_transform(_question) + tok_ans
         logger_.info('Building Vocab based on train data')
-        vocab = build_vocab_from_iterator(apply_transform(raw_data['train']), len(raw_data['train']))
+        vocab = build_vocab_from_iterator(apply_transform(raw_data['train']), specials=['<unk>', '<pad>'])
+        vocab.set_default_index(vocab['<unk>'])
     logger_.info('Vocab has %d entries', len(vocab))
     text_transform = sequential_transforms(text_transform, vocab_func(vocab), totensor(dtype=torch.long))
     transforms = {'context': text_transform, 'question': text_transform,
diff --git a/torchtext/experimental/datasets/sequence_tagging.py b/torchtext/experimental/datasets/sequence_tagging.py
@@ -3,7 +3,7 @@
 from torchtext.data.datasets_utils import _check_default_set
 from torchtext.data.datasets_utils import _wrap_datasets
 from torchtext import datasets as raw
-from torchtext.legacy.vocab import build_vocab_from_iterator
+from torchtext.vocab import build_vocab_from_iterator
 from torchtext.experimental.functional import (
     vocab_func,
     totensor,
@@ -22,7 +22,9 @@ def build_vocab(data):
         for idx, col in enumerate(line):
             data_list[idx].append(col)
     for it in data_list:
-        vocabs.append(build_vocab_from_iterator(it, len(it)))
+        vocab = build_vocab_from_iterator(it, specials=['<unk>', '<pad>'])
+        vocab.set_default_index(vocab['<unk>'])
+        vocabs.append(vocab)
 
     return vocabs
 
diff --git a/torchtext/experimental/datasets/text_classification.py b/torchtext/experimental/datasets/text_classification.py
@@ -1,7 +1,7 @@
 import torch
 import logging
 from torchtext.data.utils import get_tokenizer
-from torchtext.legacy.vocab import build_vocab_from_iterator
+from torchtext.vocab import build_vocab_from_iterator
 from torchtext import datasets as raw
 from torchtext.data.datasets_utils import _check_default_set
 from torchtext.data.datasets_utils import _wrap_datasets
@@ -19,7 +19,9 @@ def build_vocab(data, transforms):
     def apply_transforms(data):
         for _, line in data:
             yield transforms(line)
-    return build_vocab_from_iterator(apply_transforms(data), len(data))
+    vocab = build_vocab_from_iterator(apply_transforms(data), specials=['<unk>', '<pad>'])
+    vocab.set_default_index(vocab['<unk>'])
+    return vocab
 
 
 class TextClassificationDataset(torch.utils.data.Dataset):
diff --git a/torchtext/experimental/datasets/translation.py b/torchtext/experimental/datasets/translation.py
@@ -4,7 +4,7 @@
 from torchtext.data.datasets_utils import _wrap_datasets
 from torchtext import datasets as raw
 from torchtext.experimental.datasets import raw as experimental_raw
-from torchtext.legacy.vocab import Vocab, build_vocab_from_iterator
+from torchtext.vocab import Vocab, build_vocab_from_iterator
 from torchtext.data.utils import get_tokenizer
 from ..functional import vocab_func, totensor, sequential_transforms
 
@@ -15,7 +15,9 @@ def build_vocab(data, transforms, index):
     def apply_transforms(data):
         for line in data:
             yield transforms(line[index])
-    return build_vocab_from_iterator(apply_transforms(data), len(data))
+    vocab = build_vocab_from_iterator(apply_transforms(data), specials=['<unk>', '<pad>'])
+    vocab.set_default_index(vocab['<unk>'])
+    return vocab
 
 
 def _setup_datasets(dataset_name,
diff --git a/torchtext/vocab.py b/torchtext/vocab.py
@@ -258,14 +258,16 @@ def build_vocab_from_iterator(iterator: Iterable, min_freq: int = 1, specials: O
     counter = Counter()
     for tokens in iterator:
         counter.update(tokens)
-    sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
-    ordered_dict = OrderedDict(sorted_by_freq_tuples)
 
     if specials is not None:
-        for symbol in specials:
-            if symbol in ordered_dict:
-                del ordered_dict[symbol]
+        for tok in specials:
+            del counter[tok]
 
+    sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[0])
+    sorted_by_freq_tuples.sort(key=lambda x: x[1], reverse=True)
+    ordered_dict = OrderedDict(sorted_by_freq_tuples)
+
+    if specials is not None:
         if special_first:
             specials = specials[::-1]
         for symbol in specials: