Skip to content

Experimental translation datasets #751

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Jun 5, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
3ab5c1e
Merge pull request #1 from pytorch/master
akurniawan Feb 13, 2019
db42b7f
Merge remote-tracking branch 'upstream/master'
Apr 28, 2020
20545fb
first commit for both WMT14 and Multi30k
Apr 30, 2020
f57adf0
add IWSLT dataset
May 1, 2020
ba23a50
update argument documentation for all datasets
May 2, 2020
1359d4b
add unit test for all datasets and updating behaviour for conditional…
May 2, 2020
dc39604
add utilities for loading translation data
May 2, 2020
2b34d39
Merge branch 'master' of https://github.com/pytorch/text into new_tra…
May 2, 2020
b064941
fix linting
May 2, 2020
c9fabc5
remove slow unittest
May 2, 2020
c794d9e
update spacy model names and add them to requirements
May 2, 2020
5d0eb66
add deependency for spacy models
May 2, 2020
48cbb63
restore unnecessary changes
May 18, 2020
1cd21f5
move functionalities to specific dataset
May 18, 2020
21f49fd
Merge branch 'master' of https://github.com/pytorch/text into new_tra…
May 18, 2020
2264e16
adding raw module for translation dataset
May 19, 2020
756e746
fix documentation
May 19, 2020
d4deab4
keeping the generator without converting them to list in the first place
May 20, 2020
19e9c0f
remove data_select and add missing import
May 20, 2020
d9e0324
add translation dataset to __init__
May 20, 2020
4ccbfdc
first working example for translation dataset
May 20, 2020
f861120
remove iter method and fix wrong cache variable
May 20, 2020
85ae041
fix order in train, test and valid
May 20, 2020
83ccfc4
revert changes on test_functional
May 20, 2020
03b7e53
fix revert leftover changes for test_functional
May 20, 2020
c437f05
add spacy model requirement in windows
May 20, 2020
2e285ef
remove default tokenizer parameters as it's already covered on _setup…
May 20, 2020
d12b64f
revert unnecessary changes on test_builtin_dataset
May 20, 2020
36f41b0
update multi30k to new datasets
May 23, 2020
c7d559e
add new functionality to extrazt .gz files
May 26, 2020
fa05946
change multi30k dataset links to github
May 26, 2020
95ccf0c
Merge branch 'master' of https://github.com/pytorch/text into new_tra…
May 27, 2020
47ae37a
add documentation for IWSLT and Multi30k
Jun 2, 2020
69901dc
remove unnecessary code
Jun 2, 2020
98dceb4
Merge branch 'master' of https://github.com/pytorch/text into new_tra…
Jun 2, 2020
bf0820e
remove spacy specific model from requirements
Jun 2, 2020
b203b57
use modules already available in functional and remove languages para…
Jun 3, 2020
0628072
change return of get_iterator to return both src and tgt
Jun 3, 2020
7e768a1
add data_select parameter
Jun 3, 2020
906f792
add more documentation for tokenizer and vocab
Jun 4, 2020
ade3cda
add doc string
Jun 4, 2020
a758d6c
fix escessive underline
Jun 4, 2020
0726cbf
fix indentation
Jun 5, 2020
2e64b4c
Merge branch 'master' of https://github.com/pytorch/text into new_tra…
Jun 5, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .circleci/unittest/linux/scripts/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,5 @@ dependencies:
- sphinx
- sphinx-rtd-theme
- tqdm
- https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.2.5/de_core_news_sm-2.2.5.tar.gz#egg=de_core_news_sm==2.2.5
- https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz#egg=en_core_web_sm==2.2.5
2 changes: 2 additions & 0 deletions .circleci/unittest/windows/scripts/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,5 @@ dependencies:
- tqdm
- sentencepiece
- future
- https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.2.5/de_core_news_sm-2.2.5.tar.gz#egg=de_core_news_sm==2.2.5
- https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz#egg=en_core_web_sm==2.2.5
28 changes: 28 additions & 0 deletions docs/source/experimental_datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -139,3 +139,31 @@ PennTreebank

.. autoclass:: PennTreebank
:members: __init__


Machine Translation
^^^^^^^^^^^^^^^^^^^

Language modeling datasets are subclasses of ``TranslationDataset`` class.

.. autoclass:: TranslationDataset
:members: __init__

Multi30k
~~~~~~~~

.. autoclass:: Multi30k
:members: __init__

IWSLT
~~~~~

.. autoclass:: IWSLT
:members: __init__

WMT14
~~~~~

.. autoclass:: WMT14
:members: __init__

42 changes: 38 additions & 4 deletions test/data/test_builtin_datasets.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#!/user/bin/env python3
# Note that all the tests in this module require dataset (either network access or cached)
import os
import glob
import shutil
import torchtext.data as data
from torchtext.datasets import AG_NEWS
Expand All @@ -10,10 +11,11 @@


def conditional_remove(f):
if os.path.isfile(f):
os.remove(f)
elif os.path.isdir(f):
shutil.rmtree(f)
for path in glob.glob(f):
if os.path.isfile(path):
os.remove(path)
elif os.path.isdir(path):
shutil.rmtree(path)


class TestDataset(TorchtextTestCase):
Expand Down Expand Up @@ -122,3 +124,35 @@ def test_imdb(self):
old_vocab = train_dataset.get_vocab()
new_vocab = Vocab(counter=old_vocab.freqs, max_size=2500)
new_train_data, new_test_data = IMDB(vocab=new_vocab)

def test_multi30k(self):
from torchtext.experimental.datasets.translation import Multi30k
# smoke test to ensure multi30k works properly
train_dataset, valid_dataset, test_dataset = Multi30k()
self.assertEqual(len(train_dataset), 29000)
self.assertEqual(len(valid_dataset), 1000)
self.assertEqual(len(test_dataset), 1014)

de_vocab, en_vocab = train_dataset.get_vocab()
de_tokens_ids = [
de_vocab[token] for token in
'Zwei Männer verpacken Donuts in Kunststofffolie'.split()
]
self.assertEqual(de_tokens_ids, [19, 29, 18703, 4448, 5, 6240])

en_tokens_ids = [
en_vocab[token] for token in
'Two young White males are outside near many bushes'.split()
]
self.assertEqual(en_tokens_ids,
[17, 23, 1167, 806, 15, 55, 82, 334, 1337])

datafile = os.path.join(self.project_root, ".data", "train*")
conditional_remove(datafile)
datafile = os.path.join(self.project_root, ".data", "val*")
conditional_remove(datafile)
datafile = os.path.join(self.project_root, ".data", "test*")
conditional_remove(datafile)
datafile = os.path.join(self.project_root, ".data",
"multi30k_task*.tar.gz")
conditional_remove(datafile)
28 changes: 28 additions & 0 deletions test/test_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,34 @@ def test_download_extract_tar(self):
conditional_remove(f)
conditional_remove(archive_path)

def test_download_extract_gz(self):
# create root directory for downloading data
root = '.data'
if not os.path.exists(root):
os.makedirs(root)

# ensure archive is not already downloaded, if it is then delete
url = 'https://raw.githubusercontent.com/multi30k/dataset/master/data/task2/raw/val.5.en.gz'
target_archive_path = os.path.join(root, 'val.5.en.gz')
conditional_remove(target_archive_path)

# download archive and ensure is in correct location
archive_path = utils.download_from_url(url)
assert target_archive_path == archive_path

# extract files and ensure they are correct
files = utils.extract_archive(archive_path)
assert files == [os.path.join(root, 'val.5.en')]

# extract files with overwrite option True
files = utils.extract_archive(archive_path, overwrite=True)
assert files == [os.path.join(root, 'val.5.en')]

# remove files and archive
for f in files:
conditional_remove(f)
conditional_remove(archive_path)

def test_download_extract_zip(self):
# create root directory for downloading data
root = '.data'
Expand Down
4 changes: 4 additions & 0 deletions torchtext/experimental/datasets/raw/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from .text_classification import AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, \
YelpReviewFull, YahooAnswers, \
AmazonReviewPolarity, AmazonReviewFull, IMDB
from .translation import Multi30k, IWSLT, WMT14
from .language_modeling import WikiText2, WikiText103, PennTreebank, WMTNewsCrawl

__all__ = ['IMDB',
Expand All @@ -12,6 +13,9 @@
'YahooAnswers',
'AmazonReviewPolarity',
'AmazonReviewFull',
'Multi30k',
'IWSLT',
'WMT14',
'WikiText2',
'WikiText103',
'PennTreebank',
Expand Down
Loading