-
Notifications
You must be signed in to change notification settings - Fork 813
Modifying and moving Vocab factory functions #1304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
test/experimental/test_with_asset.py
Outdated
'pension', 'representing', 'say', 'stricken', 't', 'they', 'turner', | ||
'unions', 'with', 'workers'] | ||
v = build_vocab_from_text_file(asset_path) | ||
expected_itos = ['after', 'talks', "'disappointed'", 'Fears', 'Federal', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened to the "'" in the previous expected_itos?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we changed the tokenizer to python split() function to get away with the dependency on jited tokenizer as they are only available in experimental. This changed the order and expected words for the new vocab constructed using split() function tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basic_english_normalize() tokenizer/normalizer broke the 'disappointed' into "'", "disappointed" and "'" which is not the case with split()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, but using _build_vocab_from_text_file_using_python_tokenizer below this should work now even if it's not JIT'd right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, now we support both JIT'd and pure Python object as tokenizers. Added corresponding tests to check it works with both the cases.
torchtext/csrc/vocab.cpp
Outdated
auto counter_ptr = std::make_shared<IndexDict>(); | ||
thread_count++; | ||
at::launch([&, file_path, num_lines, chunk_size, j, i, counter_ptr]() { | ||
parse_raw_text_file_chunk_using_python_tokenizer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will likely cause hefty issues because of the global interpreter lock. You can't parallelize this without multiple processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh great catch!
torchtext/vocab.py
Outdated
>>> #generating vocab from text file | ||
>>> import io | ||
>>> from torchtext.vocab import build_vocab_from_iterator | ||
>>> def yield_tokens_batch(file_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "batch"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are expecting it to yield list/iterator of tokens instead of just tokens. We could potentially use yield_tokens_list if 'batch' is may be misleading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I only worry that someone might think this means "batch of sentences" whereas it's just a list of tokens for a single line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, got you!! I just named it to yield_tokens, hopefully the doc should provide the clarify for users what we expect as input to build_vocab_from_iterator.
|
||
Args: | ||
file_object: A file object to read data from. | ||
tokenizer: A python callable to split input sentence into tokens. It can also be a Jited Module. By default, the function will do tokenization based on python split() function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add an example of this to the docstring
torchtext/csrc/vocab.cpp
Outdated
@@ -356,6 +357,54 @@ Vocab _build_vocab_from_text_file(const std::string &file_path, | |||
return Vocab(std::move(tokens)); | |||
} | |||
|
|||
Vocab _build_vocab_from_text_file_using_python_tokenizer( | |||
const std::string &file_path, const int64_t min_freq, | |||
const int64_t num_cpus, py::object tokenizer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_cpus is unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh my bad! will fix it before merge
file_object: A file object to read data from. | ||
tokenizer: A python callable to split input sentence into tokens. It can also be a Jited Module. By default, the function will do tokenization based on python split() function. | ||
min_freq: The minimum frequency needed to include a token in the vocabulary. | ||
num_cpus: the number of cpus to use when loading the vectors from file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add that num_cpus only applies when the tokenizer is JIT'd
This PR moves two of the factory functions for creating Vocab object out of experimental. Furthermore, it also removes unknown token argument. Issue #1305 discusses the behavior in further details.