Skip to content

[Bug]: convert_ids_to_tokens() got an unexpected keyword argument 'skip_special_tokens' #6106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
xiehuanyi opened this issue Jun 6, 2023 · 1 comment
Closed
1 task done
Assignees
Labels
bug Something isn't working triage

Comments

@xiehuanyi
Copy link

xiehuanyi commented Jun 6, 2023

软件环境

paddle-bfloat                  0.1.2
paddle2onnx                    1.0.6
paddlefsl                      1.1.0
paddlehub                      2.0.4
paddlenlp                      2.5.2
paddlepaddle-gpu               2.3.0rc0.post112
tb-paddle                      0.3.6

重复问题

  • I have searched the existing issues

错误描述

使用cpm-gpt的tokenizer遇到了两个问题。
第一个问题是没有pad_id

[2023-06-06 09:06:23,496] [   ERROR] - Using pad_token, but it is not set yet.

添加了pad_token之后出现了解码问题,如下:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_11555/2309472343.py in <module>
     32     print(res['input_ids'])
     33     print(len(res['input_ids']))
---> 34     print(dec_tokenizer.decode(res['input_ids']))
     35     print()
     36     print()

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py in decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
   3124             skip_special_tokens=skip_special_tokens,
   3125             clean_up_tokenization_spaces=clean_up_tokenization_spaces,
-> 3126             **kwargs,
   3127         )
   3128 

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py in _decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, spaces_between_special_tokens, **kwargs)
   1443         self._decode_use_source_tokenizer = kwargs.pop("use_source_tokenizer", False)
   1444 
-> 1445         filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
   1446 
   1447         # To avoid mixing byte-level and unicode for byte-level BPT

TypeError: convert_ids_to_tokens() got an unexpected keyword argument 'skip_special_tokens'

这似乎是接口不一致导致的

稳定复现步骤 & 代码

from paddlenlp.transformers import AutoTokenizer

enc_tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
dec_tokenizer = AutoTokenizer.from_pretrained('gpt-cpm-small-cn-distill')
# dec_tokenizer.add_special_tokens({'pad_token': dec_tokenizer.eos_token})

for i in range(5):
    j_dic = json.loads(data[i])
    feat = template(get_title_result(j_dic), get_asr_result(j_dic), get_ocr_result(j_dic), get_faces_result(j_dic))
    label = get_label(j_dic)
    print(label)
    # res = enc_tokenizer(
    #     text=feat, 
    #     max_length=512, 
    #     padding=True, 
    #     truncation=True, 
    #     return_token_type_ids=True, 
    #     return_attention_mask=True
    #     )
    # print(res['input_ids'])
    # print(len(res['input_ids']))
    # print(enc_tokenizer.decode(res['input_ids']))
    # print()
    res = dec_tokenizer(
        text=label, 
        max_length=256, 
        padding=True, 
        truncation=True, 
        return_token_type_ids=False, 
        return_attention_mask=True
    )
    print(res['input_ids'])
    print(len(res['input_ids']))
    print(dec_tokenizer.decode(res['input_ids']))
    print()
    print()

# help(tokenizer)
@xiehuanyi xiehuanyi added the bug Something isn't working label Jun 6, 2023
@github-actions github-actions bot added the triage label Jun 6, 2023
@w5688414
Copy link
Contributor

w5688414 commented May 8, 2024

您提供的代码不全,请提供一下最小复现的代码:

j_dic = json.loads(data[i])
    feat = template(get_title_result(j_dic), get_asr_result(j_dic), get_ocr_result(j_dic), get_faces_result(j_dic))
    label = get_label(j_dic)

@paddle-bot paddle-bot bot closed this as completed May 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

4 participants