-
Notifications
You must be signed in to change notification settings - Fork 813
Return raw tokens for experimental IMDB dataset #696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fix #690 |
if callable(vocab): | ||
yield cls, iter(map(lambda x: vocab(x), | ||
filter(lambda x: x not in removed_tokens, tokens))) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumption here is that if vocab is not a callable it'll be an instance of Vocab? We could add a type assertion here at the beginning.
We could also create a version that accepts a callable and then pass in the callable lambda x: vocab[x]
.
torch.tensor([token_id for token_id in tokens]))) | ||
tok_list = [tok_id for tok_id in tokens] | ||
try: | ||
data[item]['data'].append((torch.tensor(cls), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which cases will this conversion not succeed? Using try except is a pretty big hammer. ValueError could be raised by many functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail when tokens
is a list of string(s), e.g. using an "empty" vocab (vocab = lambda x : x
) and a tokenizer to get the data as a list of strings or using the empty vocab with the new empty_tokenizer
when you want your data as a string, untokenized.
merged in #701 |
Return raw text data.
Return tokens
Return raw text