Skip to content

[Question]: ernie3预训练报错 #6716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ZTurboX opened this issue Aug 14, 2023 · 1 comment
Closed

[Question]: ernie3预训练报错 #6716

ZTurboX opened this issue Aug 14, 2023 · 1 comment
Assignees
Labels
question Further information is requested triage

Comments

@ZTurboX
Copy link

ZTurboX commented Aug 14, 2023

请提出你的问题

使用ernie-1.0中的文档训练ernie-3.0-tiny-micro-v2-zh报错
Traceback (most recent call last):
File "run_pretrain.py", line 762, in
do_train(config)
File "run_pretrain.py", line 459, in do_train
train_data_loader, valid_data_loader, test_data_loader = create_pretrained_dataset(
File "run_pretrain.py", line 73, in create_pretrained_dataset
train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
File "/opt/llm_pretrain/data_tools/dataset_utils.py", line 621, in build_train_valid_test_datasets
output = get_datasets_weights_and_num_samples(data_prefix, train_valid_test_num_samples)
File "/opt/llm_pretrain/data_tools/dataset_utils.py", line 140, in get_datasets_weights_and_num_samples
assert weight_sum > 0.0
AssertionError

制作数据脚本为
python create_pretraining_data.py
--model_name ernie-3.0-tiny-micro-v2-zh
--tokenizer_name ErnieTokenizer
--input_path ./data/llm_data.jsonl
--split_sentences
--chinese
--cn_whole_word_segment
--cn_seg_func jieba
--output_prefix llm_data
--workers 32
--log_interval 10000

文档中数据输出格式是npy和npz,而我这里是bin和idx,是不是数据处理有问题

@ZTurboX ZTurboX added the question Further information is requested label Aug 14, 2023
@w5688414
Copy link
Contributor

w5688414 commented May 8, 2024

请问您的paddle和paddlenlp的版本是多少?

@paddle-bot paddle-bot bot closed this as completed May 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested triage
Projects
None yet
Development

No branches or pull requests

3 participants