Skip to content

[Question]: 如何通过自定义的vocab.txt生成tokenizer? #5508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zouhan6806504 opened this issue Apr 3, 2023 · 1 comment
Closed
Assignees
Labels
question Further information is requested triage

Comments

@zouhan6806504
Copy link

请提出你的问题

文本被加密,所以得到的就是一串数字14 108 28 30 15 13 294 29 20 18 23 21 25
大概总共单词1300左右,我自己生成vacab.txt,加上[PAD] [CLS] [SEP] [MASK] [UNK]
通常tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
如果我想tokenizer基于我自己的vocab.txt,要如何操作?

@zouhan6806504 zouhan6806504 added the question Further information is requested label Apr 3, 2023
@github-actions github-actions bot added the triage label Apr 3, 2023
@w5688414
Copy link
Contributor

w5688414 commented May 8, 2024

如果tokenizer的词表是自己构造的话,则需要重新预训练,一般不建议。

@paddle-bot paddle-bot bot closed this as completed May 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested triage
Projects
None yet
Development

No branches or pull requests

3 participants