[Question]: 为什么vocab.txt里会有两个相同的字符？ #5974

agoodpp · 2023-05-19T10:33:36Z

请提出你的问题

我从/paddlenlp/transformers/ernie/tokenizer.py的154行提取了vocab.txt的下载地址：
https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_nano_zh_vocab.txt
我发现这个文件的12085行和18006行存在相同的字符“$“。为什么会出现这种情况？
这会对huggingface transformers库的tokenizer产生影响。由于vocab.txt存在相同的字符，tokenizer会丢失token id：12084，且在新增special token时，新增的special token将无法被赋予正确的新id（实际上会产生与最后一个字符[UNK]相同的id）。

w5688414 · 2024-05-08T09:30:44Z

收到，这是一个已知的问题

agoodpp added the question Further information is requested label May 19, 2023

github-actions bot added the triage label May 19, 2023

gongel assigned LiuChiachi May 23, 2023

paddle-bot bot assigned wawltor Mar 8, 2024

paddle-bot bot closed this as completed May 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: 为什么vocab.txt里会有两个相同的字符？ #5974

[Question]: 为什么vocab.txt里会有两个相同的字符？ #5974

agoodpp commented May 19, 2023

w5688414 commented May 8, 2024

[Question]: 为什么vocab.txt里会有两个相同的字符？ #5974

[Question]: 为什么vocab.txt里会有两个相同的字符？ #5974

Comments

agoodpp commented May 19, 2023

请提出你的问题

w5688414 commented May 8, 2024