[Question]: 如何通过自定义的vocab.txt生成tokenizer？ #5508

zouhan6806504 · 2023-04-03T06:22:16Z

请提出你的问题

文本被加密，所以得到的就是一串数字14 108 28 30 15 13 294 29 20 18 23 21 25
大概总共单词1300左右，我自己生成vacab.txt，加上[PAD] [CLS] [SEP] [MASK] [UNK]
通常tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
如果我想tokenizer基于我自己的vocab.txt，要如何操作？

w5688414 · 2024-05-08T11:24:18Z

如果tokenizer的词表是自己构造的话，则需要重新预训练，一般不建议。

zouhan6806504 added the question Further information is requested label Apr 3, 2023

github-actions bot added the triage label Apr 3, 2023

westfish assigned wj-Mcat Apr 3, 2023

paddle-bot bot closed this as completed May 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: 如何通过自定义的vocab.txt生成tokenizer？ #5508

[Question]: 如何通过自定义的vocab.txt生成tokenizer？ #5508

zouhan6806504 commented Apr 3, 2023

w5688414 commented May 8, 2024

[Question]: 如何通过自定义的vocab.txt生成tokenizer？ #5508

[Question]: 如何通过自定义的vocab.txt生成tokenizer？ #5508

Comments

zouhan6806504 commented Apr 3, 2023

请提出你的问题

w5688414 commented May 8, 2024