-
Notifications
You must be signed in to change notification settings - Fork 812
fix parameter specials in Field.build_vocab #495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I believe that the specials argument was off-limits to users intentionally. Do you have any use-case not covered by pad, bos, eos or unk? |
case 1say we have a sentence case 2another case is building a separated bi-LSTM. I want the forward field forward_word = Field(..., init='<fwd_init>', eos='<fwd_eos>')
backward_word = Field(..., init='<bwd_init>', eos='<bwd_eos>')
forward_word.build_vocab(train, speicals=['<bwd_init>', '<bwd_eos>'])
backward_word.vocab = forward_word.vocab case 3one more case is like BERT, we randomly replace word by def postprocessing(batch, vocab):
res_batch = []
for ex in batch:
res_ex = []
for token in ex:
if random.random() < 0.2:
res_ex.append(vocab.stoi['<mask>'])
else:
res_ex.append(token)
res_batch.append(res_ex)
return res_batch
word = Field(..., postprocessing=postprocessing)
word.build_vocab(train, speicals=['<mask>']) |
Makes sense. AFAIK, the case # 3 should be better handled while creating batches (because post-processing is used just once, and you would want to have different tokens masked between epochs). |
no, sentence = 'I have an apple'
def postprocessing(batch, vocab):
print(f'postprocessing is called')
res_batch = []
for ex in batch:
res_ex = []
for token in ex:
if random.random() < 0.5:
res_ex.append(vocab.stoi['<unk>'])
else:
res_ex.append(token)
res_batch.append(res_ex)
return res_batch
class Dummy(Dataset):
def __init__(self, examples, fields):
super(Dummy, self).__init__(examples, fields)
@classmethod
def iters(cls):
WORD = Field(postprocessing=postprocessing)
fields = [('word', WORD)]
examples = [
Example.fromlist([sentence], fields=fields),
]
dataset = cls(examples, fields)
WORD.build_vocab(dataset)
return Iterator(dataset, batch_size=1)
if __name__ == '__main__':
train = Dummy.iters()
for _ in range(2):
for batch in train:
print(batch.word.tolist())
# postprocessing is called
# [[2], [0], [3], [4]]
# postprocessing is called
# [[2], [5], [0], [4]] I will finish the unit tests soon |
this project should provide an edit configuration file, my PyCharm always fix the non-PEP8 format automatically |
Thanks! Could you elaborate on the edit configuration file? The flake config is written in |
No description provided.