Modifying and moving Vocab factory functions #1304

parmeet · 2021-05-12T21:48:40Z

This PR moves two of the factory functions for creating Vocab object out of experimental. Furthermore, it also removes unknown token argument. Issue #1305 discusses the behavior in further details.

cpuhrsch · 2021-05-13T16:03:46Z

test/experimental/test_with_asset.py

-                         'pension', 'representing', 'say', 'stricken', 't', 'they', 'turner',
-                         'unions', 'with', 'workers']
+        v = build_vocab_from_text_file(asset_path)
+        expected_itos = ['after', 'talks', "'disappointed'", 'Fears', 'Federal',


What happened to the "'" in the previous expected_itos?

Actually we changed the tokenizer to python split() function to get away with the dependency on jited tokenizer as they are only available in experimental. This changed the order and expected words for the new vocab constructed using split() function tokenizer.

The basic_english_normalize() tokenizer/normalizer broke the 'disappointed' into "'", "disappointed" and "'" which is not the case with split()

Hm, but using _build_vocab_from_text_file_using_python_tokenizer below this should work now even if it's not JIT'd right?

Yes, now we support both JIT'd and pure Python object as tokenizers. Added corresponding tests to check it works with both the cases.

cpuhrsch · 2021-05-13T16:05:17Z

torchtext/csrc/vocab.cpp

+    auto counter_ptr = std::make_shared<IndexDict>();
+    thread_count++;
+    at::launch([&, file_path, num_lines, chunk_size, j, i, counter_ptr]() {
+      parse_raw_text_file_chunk_using_python_tokenizer(


This will likely cause hefty issues because of the global interpreter lock. You can't parallelize this without multiple processes.

Ahh great catch!

cpuhrsch · 2021-05-13T16:08:58Z

torchtext/vocab.py

+        >>> #generating vocab from text file
+        >>> import io
+        >>> from torchtext.vocab import build_vocab_from_iterator
+        >>> def yield_tokens_batch(file_path):


Why "batch"?

We are expecting it to yield list/iterator of tokens instead of just tokens. We could potentially use yield_tokens_list if 'batch' is may be misleading?

Yes, I only worry that someone might think this means "batch of sentences" whereas it's just a list of tokens for a single line.

right, got you!! I just named it to yield_tokens, hopefully the doc should provide the clarify for users what we expect as input to build_vocab_from_iterator.

cpuhrsch · 2021-05-13T18:01:13Z

torchtext/experimental/vocab_factory.py

+
+    Args:
+        file_object: A file object to read data from.
+        tokenizer: A python callable to split input sentence into tokens. It can also be a Jited Module. By default, the function will do tokenization based on python split() function.


I'd add an example of this to the docstring

cpuhrsch · 2021-05-13T18:03:03Z

torchtext/csrc/vocab.cpp

@@ -356,6 +357,54 @@ Vocab _build_vocab_from_text_file(const std::string &file_path,
  return Vocab(std::move(tokens));
 }

+Vocab _build_vocab_from_text_file_using_python_tokenizer(
+    const std::string &file_path, const int64_t min_freq,
+    const int64_t num_cpus, py::object tokenizer) {


num_cpus is unused?

ahh my bad! will fix it before merge

cpuhrsch · 2021-05-13T18:03:18Z

torchtext/experimental/vocab_factory.py

+        file_object: A file object to read data from.
+        tokenizer: A python callable to split input sentence into tokens. It can also be a Jited Module. By default, the function will do tokenization based on python split() function.
+        min_freq: The minimum frequency needed to include a token in the vocabulary.
+        num_cpus: the number of cpus to use when loading the vectors from file.


I'd add that num_cpus only applies when the tokenizer is JIT'd

parmeet added 4 commits May 12, 2021 15:26

moving Vocab Factory functions out of experimental

1c54548

remove unk token and add factory function to use python tokenizer

670218f

flake issues removal

9cbef53

adapting docs

9cb37f8

facebook-github-bot added the cla signed label May 12, 2021

added examples to clarify usage of vocab

c2473c5

parmeet mentioned this pull request May 13, 2021

On Vocab Factory functions behavior #1305

Open

parmeet added 2 commits May 13, 2021 10:18

fixing factory functions

a63afbe

minor fixes

4f8f167

parmeet changed the title ~~[WIP] Modifying and moving Vocab factory functions~~ Modifying and moving Vocab factory functions May 13, 2021

cpuhrsch reviewed May 13, 2021

View reviewed changes

parmeet added 2 commits May 13, 2021 13:47

fixed raw text file vocab creator using python tokenizer

3c8cfb4

linter issue

84b9505

cpuhrsch reviewed May 13, 2021

View reviewed changes

cpuhrsch approved these changes May 13, 2021

View reviewed changes

cpuhrsch reviewed May 13, 2021

View reviewed changes

parmeet added 2 commits May 13, 2021 15:48

fixing suggestions in comments

993993d

minor code comments

f2d4573

parmeet merged commit 1087109 into pytorch:master May 14, 2021

parmeet deleted the vocabfactory branch May 14, 2021 01:23

parmeet mentioned this pull request May 17, 2021

modifying default index API to accept None #1308

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Modifying and moving Vocab factory functions #1304

Modifying and moving Vocab factory functions #1304

Uh oh!

parmeet commented May 12, 2021 •

edited

Loading

Uh oh!

cpuhrsch May 13, 2021

Uh oh!

parmeet May 13, 2021

Uh oh!

parmeet May 13, 2021

Uh oh!

cpuhrsch May 13, 2021

Uh oh!

parmeet May 13, 2021

Uh oh!

cpuhrsch May 13, 2021

Uh oh!

parmeet May 13, 2021

Uh oh!

cpuhrsch May 13, 2021

Uh oh!

parmeet May 13, 2021

Uh oh!

cpuhrsch May 13, 2021

Uh oh!

parmeet May 13, 2021

Uh oh!

cpuhrsch May 13, 2021

Uh oh!

cpuhrsch May 13, 2021

Uh oh!

parmeet May 13, 2021

Uh oh!

cpuhrsch May 13, 2021

Uh oh!

Uh oh!

Modifying and moving Vocab factory functions #1304

Modifying and moving Vocab factory functions #1304

Uh oh!

Conversation

parmeet commented May 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

parmeet commented May 12, 2021 •

edited

Loading