Return raw tokens for experimental IMDB dataset #696

zhangguanheng66 · 2020-02-19T21:50:14Z

Return raw text data.

Return tokens

import torchtext
from torchtext.experimental.datasets import IMDB
vocab = lambda x: x
train_dataset, test_dataset = IMDB(vocab=vocab)

Return raw text

import torchtext
from torchtext.experimental.datasets import IMDB
from torchtext.data.utils import get_tokenizer
vocab = lambda x: x
tokenizer = get_tokenizer('empty_tokenizer')
train_dataset, test_dataset = IMDB(vocab=vocab, tokenizer=tokenizer)

test/data/test_builtin_datasets.py

torchtext/experimental/datasets/text_classification.py

zhangguanheng66 · 2020-02-19T22:39:01Z

fix #690
@bentrevett

cpuhrsch · 2020-02-27T19:23:26Z

torchtext/experimental/datasets/text_classification.py

+            if callable(vocab):
+                yield cls, iter(map(lambda x: vocab(x),
+                                filter(lambda x: x not in removed_tokens, tokens)))
+            else:


The assumption here is that if vocab is not a callable it'll be an instance of Vocab? We could add a type assertion here at the beginning.

We could also create a version that accepts a callable and then pass in the callable lambda x: vocab[x].

cpuhrsch · 2020-02-27T19:24:15Z

torchtext/experimental/datasets/text_classification.py

-                                       torch.tensor([token_id for token_id in tokens])))
+            tok_list = [tok_id for tok_id in tokens]
+            try:
+                data[item]['data'].append((torch.tensor(cls),


In which cases will this conversion not succeed? Using try except is a pretty big hammer. ValueError could be raised by many functions.

This will fail when tokens is a list of string(s), e.g. using an "empty" vocab (vocab = lambda x : x) and a tokenizer to get the data as a list of strings or using the empty vocab with the new empty_tokenizer when you want your data as a string, untokenized.

zhangguanheng66 · 2020-04-21T20:03:14Z

merged in #701

Guanheng Zhang added 2 commits February 19, 2020 13:33

Raw text data as IterableDataset for experimental IMDB dataset

2e33c7e

add unit test

c2400ae

zhangguanheng66 requested a review from cpuhrsch February 19, 2020 22:20

cpuhrsch reviewed Feb 19, 2020

View reviewed changes

test/data/test_builtin_datasets.py Outdated Show resolved Hide resolved

cpuhrsch reviewed Feb 19, 2020

View reviewed changes

torchtext/experimental/datasets/text_classification.py Outdated Show resolved Hide resolved

Guanheng Zhang added 6 commits February 20, 2020 11:45

change raw text design.

e995264

Merge remote-tracking branch 'upstream/master' into raw_text

cb19af9

update doc

d5d45d0

switch to a new pattern

348af23

update test

85c986f

Minor in test

1a426fe

anmolsjoshi mentioned this pull request Feb 27, 2020

Added WMT News Crawl Dataset for language modeling #688

Open

cpuhrsch reviewed Feb 27, 2020

View reviewed changes

zhangguanheng66 closed this Apr 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Return raw tokens for experimental IMDB dataset #696

Return raw tokens for experimental IMDB dataset #696

Uh oh!

zhangguanheng66 commented Feb 19, 2020 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

zhangguanheng66 commented Feb 19, 2020 •

edited

Loading

Uh oh!

cpuhrsch Feb 27, 2020

Uh oh!

cpuhrsch Feb 27, 2020

Uh oh!

bentrevett Feb 27, 2020 •

edited

Loading

Uh oh!

zhangguanheng66 commented Apr 21, 2020

Uh oh!

Uh oh!

Return raw tokens for experimental IMDB dataset #696

Return raw tokens for experimental IMDB dataset #696

Uh oh!

Conversation

zhangguanheng66 commented Feb 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Return tokens

Return raw text

Uh oh!

Uh oh!

Uh oh!

zhangguanheng66 commented Feb 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpuhrsch Feb 27, 2020

Choose a reason for hiding this comment

Uh oh!

cpuhrsch Feb 27, 2020

Choose a reason for hiding this comment

Uh oh!

bentrevett Feb 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangguanheng66 commented Apr 21, 2020

Uh oh!

Uh oh!

zhangguanheng66 commented Feb 19, 2020 •

edited

Loading

zhangguanheng66 commented Feb 19, 2020 •

edited

Loading

bentrevett Feb 27, 2020 •

edited

Loading