TREC dataset #92

bmccann · 2017-08-10T02:48:17Z

import torch
from torchtext import data
from torchtext import datasets
inputs = data.Field(lower=True, include_lengths=True, batch_first=True)
answers = data.Field(sequential=False)
print('Generating train, dev, test splits')
train, test = datasets.TREC.splits(inputs, answers)
print(train[0].text, train[0].label)

yields

Generating train, dev, test splits
['how', 'did', 'serfdom', 'develop', 'in', 'and', 'then', 'leave', 'russia', '?'] DESC

When fine_grained=True:

Generating train, dev, test splits
['how', 'did', 'serfdom', 'develop', 'in', 'and', 'then', 'leave', 'russia', '?'] DESC:manner

Let me know if there is anything else you'd like to see. I'll leave the WIP until I hear
back.

jekbradbury · 2017-08-10T08:50:42Z

Does the dataset contain corrupted UTF-8? If not I’d rather just load it with Python rather than adding another dependency? (If it does then this is probably fine, although a built-in error strategy might also work)

torchtext/datasets/trec.py

+        examples = []
+
+        def get_label_str(label):
+                return label.split(':')[0] if not fine_grained else label


torchtext/datasets/trec.py

+            text_field: The field that will be used for the sentence.
+            label_field: The field that will be used for label data.
+            root: The root directory that the dataset's zip archive will be
+                expanded into; therefore the directory in whose trees


torchtext/datasets/trec.py

+from six.moves import urllib
+
+
+class TREC(data.ZipDataset):


adding trec data

91afe1b

bmccann self-assigned this Aug 10, 2017

bmccann requested a review from jekbradbury August 10, 2017 02:48

whitespace and indentation

2aa3854

bmccann force-pushed the trec branch from 1ac9f36 to 2aa3854 Compare August 10, 2017 03:06

bmccann added 2 commits August 10, 2017 03:21

typo in docstring

9451528

unicode enforcing

ffeb215

bmccann added 4 commits August 10, 2017 19:46

handling for the one problem byte

8ba5a62

adding test and sort for TREC

3d1911f

removing unnecessary imports

e3c473e

lines

527a7e5

jekbradbury reviewed Aug 11, 2017

View reviewed changes

Dataset, spacing, rm trees in comments

eb2465d

jekbradbury merged commit 086bcc2 into master Aug 11, 2017

bmccann changed the title ~~[WIP] TREC dataset~~ TREC dataset Aug 15, 2017

jekbradbury deleted the trec branch September 7, 2017 22:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TREC dataset #92

TREC dataset #92

Uh oh!

bmccann commented Aug 10, 2017 •

edited

Loading

Uh oh!

jekbradbury commented Aug 10, 2017

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

Uh oh!

TREC dataset #92

TREC dataset #92

Uh oh!

Conversation

bmccann commented Aug 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jekbradbury commented Aug 10, 2017

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

Uh oh!

bmccann commented Aug 10, 2017 •

edited

Loading