Skip to content

TREC dataset #92

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Aug 11, 2017
Merged

TREC dataset #92

merged 9 commits into from
Aug 11, 2017

Conversation

bmccann
Copy link
Contributor

@bmccann bmccann commented Aug 10, 2017

import torch
from torchtext import data
from torchtext import datasets
inputs = data.Field(lower=True, include_lengths=True, batch_first=True)
answers = data.Field(sequential=False)
print('Generating train, dev, test splits')
train, test = datasets.TREC.splits(inputs, answers)
print(train[0].text, train[0].label)

yields

Generating train, dev, test splits
['how', 'did', 'serfdom', 'develop', 'in', 'and', 'then', 'leave', 'russia', '?'] DESC

When fine_grained=True:

Generating train, dev, test splits
['how', 'did', 'serfdom', 'develop', 'in', 'and', 'then', 'leave', 'russia', '?'] DESC:manner

Let me know if there is anything else you'd like to see. I'll leave the WIP until I hear
back.

@bmccann bmccann self-assigned this Aug 10, 2017
@bmccann bmccann requested a review from jekbradbury August 10, 2017 02:48
@jekbradbury
Copy link
Contributor

Does the dataset contain corrupted UTF-8? If not I’d rather just load it with Python rather than adding another dependency? (If it does then this is probably fine, although a built-in error strategy might also work)

examples = []

def get_label_str(label):
return label.split(':')[0] if not fine_grained else label

This comment was marked as off-topic.

text_field: The field that will be used for the sentence.
label_field: The field that will be used for label data.
root: The root directory that the dataset's zip archive will be
expanded into; therefore the directory in whose trees

This comment was marked as off-topic.

from six.moves import urllib


class TREC(data.ZipDataset):

This comment was marked as off-topic.

@jekbradbury jekbradbury merged commit 086bcc2 into master Aug 11, 2017
@bmccann bmccann changed the title [WIP] TREC dataset TREC dataset Aug 15, 2017
@jekbradbury jekbradbury deleted the trec branch September 7, 2017 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants