We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
paddle-bfloat 0.1.2 paddle2onnx 1.0.6 paddlefsl 1.1.0 paddlehub 2.0.4 paddlenlp 2.5.2 paddlepaddle-gpu 2.3.0rc0.post112 tb-paddle 0.3.6
使用cpm-gpt的tokenizer遇到了两个问题。 第一个问题是没有pad_id [2023-06-06 09:06:23,496] [ ERROR] - Using pad_token, but it is not set yet.
添加了pad_token之后出现了解码问题,如下:
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) /tmp/ipykernel_11555/2309472343.py in <module> 32 print(res['input_ids']) 33 print(len(res['input_ids'])) ---> 34 print(dec_tokenizer.decode(res['input_ids'])) 35 print() 36 print() /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py in decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs) 3124 skip_special_tokens=skip_special_tokens, 3125 clean_up_tokenization_spaces=clean_up_tokenization_spaces, -> 3126 **kwargs, 3127 ) 3128 /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py in _decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, spaces_between_special_tokens, **kwargs) 1443 self._decode_use_source_tokenizer = kwargs.pop("use_source_tokenizer", False) 1444 -> 1445 filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens) 1446 1447 # To avoid mixing byte-level and unicode for byte-level BPT TypeError: convert_ids_to_tokens() got an unexpected keyword argument 'skip_special_tokens'
这似乎是接口不一致导致的
from paddlenlp.transformers import AutoTokenizer enc_tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh') dec_tokenizer = AutoTokenizer.from_pretrained('gpt-cpm-small-cn-distill') # dec_tokenizer.add_special_tokens({'pad_token': dec_tokenizer.eos_token}) for i in range(5): j_dic = json.loads(data[i]) feat = template(get_title_result(j_dic), get_asr_result(j_dic), get_ocr_result(j_dic), get_faces_result(j_dic)) label = get_label(j_dic) print(label) # res = enc_tokenizer( # text=feat, # max_length=512, # padding=True, # truncation=True, # return_token_type_ids=True, # return_attention_mask=True # ) # print(res['input_ids']) # print(len(res['input_ids'])) # print(enc_tokenizer.decode(res['input_ids'])) # print() res = dec_tokenizer( text=label, max_length=256, padding=True, truncation=True, return_token_type_ids=False, return_attention_mask=True ) print(res['input_ids']) print(len(res['input_ids'])) print(dec_tokenizer.decode(res['input_ids'])) print() print() # help(tokenizer)
The text was updated successfully, but these errors were encountered:
您提供的代码不全,请提供一下最小复现的代码:
j_dic = json.loads(data[i]) feat = template(get_title_result(j_dic), get_asr_result(j_dic), get_ocr_result(j_dic), get_faces_result(j_dic)) label = get_label(j_dic)
Sorry, something went wrong.
wj-Mcat
gongel
No branches or pull requests
软件环境
重复问题
错误描述
添加了pad_token之后出现了解码问题,如下:
这似乎是接口不一致导致的
稳定复现步骤 & 代码
The text was updated successfully, but these errors were encountered: