Skip to content

Modifying and moving Vocab factory functions #1304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
May 14, 2021

Conversation

parmeet
Copy link
Contributor

@parmeet parmeet commented May 12, 2021

This PR moves two of the factory functions for creating Vocab object out of experimental. Furthermore, it also removes unknown token argument. Issue #1305 discusses the behavior in further details.

@parmeet parmeet changed the title [WIP] Modifying and moving Vocab factory functions Modifying and moving Vocab factory functions May 13, 2021
'pension', 'representing', 'say', 'stricken', 't', 'they', 'turner',
'unions', 'with', 'workers']
v = build_vocab_from_text_file(asset_path)
expected_itos = ['after', 'talks', "'disappointed'", 'Fears', 'Federal',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened to the "'" in the previous expected_itos?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we changed the tokenizer to python split() function to get away with the dependency on jited tokenizer as they are only available in experimental. This changed the order and expected words for the new vocab constructed using split() function tokenizer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic_english_normalize() tokenizer/normalizer broke the 'disappointed' into "'", "disappointed" and "'" which is not the case with split()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, but using _build_vocab_from_text_file_using_python_tokenizer below this should work now even if it's not JIT'd right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now we support both JIT'd and pure Python object as tokenizers. Added corresponding tests to check it works with both the cases.

auto counter_ptr = std::make_shared<IndexDict>();
thread_count++;
at::launch([&, file_path, num_lines, chunk_size, j, i, counter_ptr]() {
parse_raw_text_file_chunk_using_python_tokenizer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will likely cause hefty issues because of the global interpreter lock. You can't parallelize this without multiple processes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh great catch!

>>> #generating vocab from text file
>>> import io
>>> from torchtext.vocab import build_vocab_from_iterator
>>> def yield_tokens_batch(file_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "batch"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are expecting it to yield list/iterator of tokens instead of just tokens. We could potentially use yield_tokens_list if 'batch' is may be misleading?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I only worry that someone might think this means "batch of sentences" whereas it's just a list of tokens for a single line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, got you!! I just named it to yield_tokens, hopefully the doc should provide the clarify for users what we expect as input to build_vocab_from_iterator.


Args:
file_object: A file object to read data from.
tokenizer: A python callable to split input sentence into tokens. It can also be a Jited Module. By default, the function will do tokenization based on python split() function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add an example of this to the docstring

@@ -356,6 +357,54 @@ Vocab _build_vocab_from_text_file(const std::string &file_path,
return Vocab(std::move(tokens));
}

Vocab _build_vocab_from_text_file_using_python_tokenizer(
const std::string &file_path, const int64_t min_freq,
const int64_t num_cpus, py::object tokenizer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_cpus is unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh my bad! will fix it before merge

file_object: A file object to read data from.
tokenizer: A python callable to split input sentence into tokens. It can also be a Jited Module. By default, the function will do tokenization based on python split() function.
min_freq: The minimum frequency needed to include a token in the vocabulary.
num_cpus: the number of cpus to use when loading the vectors from file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add that num_cpus only applies when the tokenizer is JIT'd

@parmeet parmeet merged commit 1087109 into pytorch:master May 14, 2021
@parmeet parmeet deleted the vocabfactory branch May 14, 2021 01:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants