fix parameter specials in Field.build_vocab #495

speedcell4 · 2019-01-28T11:55:55Z

No description provided.

mttk · 2019-01-31T20:16:30Z

I believe that the specials argument was off-limits to users intentionally. Do you have any use-case not covered by pad, bos, eos or unk?

speedcell4 · 2019-02-01T02:57:20Z

case 1

say we have a sentence [I, have, an, apple], when I want to convert it to its contiguous character-level representation [I, <space>, h, a, v, e, <space>, a, n, <space>, a, p, p, l, e], I need a special token for <space>.

case 2

another case is building a separated bi-LSTM. I want the forward field forward_word uses <fwd_init> and <fwd_eos> while the backward field backward_word utilizes <bwd_init> and <bwd_eos>, and their vocabulary should be shared. if torchtext supports that parameter, then we can do this by

forward_word = Field(..., init='<fwd_init>', eos='<fwd_eos>')
backward_word = Field(..., init='<bwd_init>', eos='<bwd_eos>')

forward_word.build_vocab(train, speicals=['<bwd_init>', '<bwd_eos>'])
backward_word.vocab = forward_word.vocab

case 3

one more case is like BERT, we randomly replace word by <mask> in the postprocessing stage. so in the build_vocab stage, the field even has not seen any <mask> but <mask> is needed for the future postprocessing work

def postprocessing(batch, vocab):
    res_batch = []
    for ex in batch:
        res_ex = []
        for token in ex:
            if random.random() < 0.2:
                res_ex.append(vocab.stoi['<mask>'])
            else:
                res_ex.append(token)
        res_batch.append(res_ex)
    return res_batch


word = Field(..., postprocessing=postprocessing)
word.build_vocab(train, speicals=['<mask>'])

mttk · 2019-02-01T10:40:20Z

Makes sense. AFAIK, the case # 3 should be better handled while creating batches (because post-processing is used just once, and you would want to have different tokens masked between epochs).
Could you fix the travis errors & write a test for one of those use-cases (ex. add a special token and then validate it's added)

speedcell4 · 2019-02-01T11:09:28Z

no, postprocessing will be used once in every batch

sentence = 'I have an apple'


def postprocessing(batch, vocab):
    print(f'postprocessing is called')
    res_batch = []
    for ex in batch:
        res_ex = []
        for token in ex:
            if random.random() < 0.5:
                res_ex.append(vocab.stoi['<unk>'])
            else:
                res_ex.append(token)
        res_batch.append(res_ex)
    return res_batch


class Dummy(Dataset):
    def __init__(self, examples, fields):
        super(Dummy, self).__init__(examples, fields)

    @classmethod
    def iters(cls):
        WORD = Field(postprocessing=postprocessing)
        fields = [('word', WORD)]
        examples = [
            Example.fromlist([sentence], fields=fields),
        ]
        dataset = cls(examples, fields)
        WORD.build_vocab(dataset)
        return Iterator(dataset, batch_size=1)


if __name__ == '__main__':
    train = Dummy.iters()
    for _ in range(2):
        for batch in train:
            print(batch.word.tolist())

# postprocessing is called
# [[2], [0], [3], [4]]
# postprocessing is called
# [[2], [5], [0], [4]]

I will finish the unit tests soon

speedcell4 · 2019-02-01T11:52:14Z

this project should provide an edit configuration file, my PyCharm always fix the non-PEP8 format automatically

mttk · 2019-02-01T12:28:17Z

Thanks!

Could you elaborate on the edit configuration file? The flake config is written in .flake8., but I assume you're not referring to that.

fix parameter specials in Field.build_vocab

04c356d

Merge branch 'master' into fix/build_vocab

84cb207

speedcell4 added 3 commits February 1, 2019 20:17

style(field): fix travis format issue

799d7f6

test(field): update unit test for build_vocab with argument specials

3edb58b

style(field): fix travis format issue

6631bf0

mttk merged commit a6e520e into pytorch:master Feb 1, 2019

speedcell4 deleted the fix/build_vocab branch February 1, 2019 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix parameter specials in Field.build_vocab #495

fix parameter specials in Field.build_vocab #495

Uh oh!

speedcell4 commented Jan 28, 2019

Uh oh!

mttk commented Jan 31, 2019

Uh oh!

speedcell4 commented Feb 1, 2019 •

edited

Loading

Uh oh!

mttk commented Feb 1, 2019 •

edited

Loading

Uh oh!

speedcell4 commented Feb 1, 2019

Uh oh!

speedcell4 commented Feb 1, 2019 •

edited

Loading

Uh oh!

mttk commented Feb 1, 2019

Uh oh!

Uh oh!

fix parameter specials in Field.build_vocab #495

fix parameter specials in Field.build_vocab #495

Uh oh!

Conversation

speedcell4 commented Jan 28, 2019

Uh oh!

mttk commented Jan 31, 2019

Uh oh!

speedcell4 commented Feb 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

case 1

case 2

case 3

Uh oh!

mttk commented Feb 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

speedcell4 commented Feb 1, 2019

Uh oh!

speedcell4 commented Feb 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mttk commented Feb 1, 2019

Uh oh!

Uh oh!

speedcell4 commented Feb 1, 2019 •

edited

Loading

mttk commented Feb 1, 2019 •

edited

Loading

speedcell4 commented Feb 1, 2019 •

edited

Loading