Description
Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
[rank0]: Traceback (most recent call last):
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/cli/pt.py", line 5, in
[rank0]: pt_main()
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/train/pt.py", line 24, in pt_main
[rank0]: return SwiftPt(args).main()
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/base.py", line 49, in main
[rank0]: result = self.run()
[rank0]: ^^^^^^^^^^
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/train/sft.py", line 96, in run
[rank0]: train_dataset, val_dataset = self._encode_dataset(train_dataset, val_dataset)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/train/sft.py", line 224, in _encode_dataset
[rank0]: train_dataset = packing_dataset_cls(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/dataset/utils.py", line 307, in init
[rank0]: self.create_packed_dataset()
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/dataset/utils.py", line 324, in create_packed_dataset
[rank0]: self.packing_dataset()
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/dataset/utils.py", line 350, in packing_dataset
[rank0]: res, data = self.calculate_matched_group(self.template, data, is_finished=is_finished)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/dataset/utils.py", line 145, in calculate_matched_group
[rank0]: packed = template.packing_row(row)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/template/base.py", line 522, in packing_row
[rank0]: packed.update(self._data_collator_mm_data([r[0] for r in row]))
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/template/base.py", line 1601, in _data_collator_mm_data
[rank0]: res['pixel_values'] = torch.concat(pixel_values)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: TypeError: expected Tensor as element 0 in argument 0, but got list
#!/bin/bash
NPROC_PER_NODE=4
CUDA_VISIBLE_DEVICES=0,1,2,3
swift pt
--model $XLF/downloads/models/AIDC-AI/Ovis2-2B
--dataset $XLF/downloads/datasets/VisualStar/visualwebinstruct/pt-event-0608-01-230M.jsonl
--output_dir $XLF/scripts/logs/Ovis2-2B-CPT/C230
--add_version false
--train_type full
--torch_dtype bfloat16
--learning_rate 1e-5
--warmup_ratio 0.05
--freeze_llm false
--freeze_vit false
--freeze_aligner false
--num_train_epochs 3
--gradient_accumulation_steps 2
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--dataset_num_proc 8
--dataloader_num_workers 4
--split_dataset_ratio 0
--max_length 4096
--truncation_strategy delete
--attn_impl flash_attn
--packing true
--save_strategy epoch
--save_steps 1
--save_only_model true
--eval_strategy epoch
--eval_steps 1
--logging_steps 1
--deepspeed zero3 \
#!/bin/bash
#SBATCH --job-name=swift
#SBATCH --time=168:00:00
#SBATCH --output=/gpfs/home/int/qiufengwang/xinlong_fu/slurms/logs/%j.out
#SBATCH --error=/gpfs/home/int/qiufengwang/xinlong_fu/slurms/logs/%j.err
#SBATCH --partition=gpua800
#SBATCH --qos=4gpus
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-gpu=8
export PATH=$XLF/anaconda3/bin:$PATH
export PATH=$XLF/cuda/12.4/bin:$PATH
export LD_LIBRARY_PATH=$XLF/cuda/12.4/lib64:$LD_LIBRARY_PATH
export OMP_NUM_THREADS=4
source activate $XLF/conda/env/swift
cd $XLF/downloads/datasets/VisualStar/images
srun $XLF/scripts/swift-pt.sh
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
(/gpfs/home/int/qiufengwang/xinlong_fu/conda/env/swift) [qiufengwang@xpxecdtn1 xinlong_fu]$ python -c "import sys; import torch; print('Python Version:', sys.version); print('CUDA Version:', torch.version.cuda); print('PyTorch Version:', torch.version); print('CXX11 ABI Enabled:', torch._C._GLIBCXX_USE_CXX11_ABI); print('CUDA Available:', torch.cuda.is_available()); print('GPU Count:', torch.cuda.device_count());"
Python Version: 3.11.11 (main, Dec 11 2024, 16:28:39) [GCC 11.2.0]
CUDA Version: 12.4
PyTorch Version: 2.6.0+cu124
CXX11 ABI Enabled: False
CUDA Available: False
GPU Count: 0
(/gpfs/home/int/qiufengwang/xinlong_fu/conda/env/swift) [qiufengwang@xpxecdtn1 xinlong_fu]$ pip show ms-swift
Name: ms_swift
Version: 3.6.0.dev0
Summary: Swift: Scalable lightWeight Infrastructure for Fine-Tuning
Home-page: https://github.com/modelscope/swift
Author: DAMO ModelScope teams
Author-email: [email protected]
License: Apache License 2.0
Location: /gpfs/home/int/qiufengwang/xinlong_fu/programs/swift
Editable project location: /gpfs/home/int/qiufengwang/xinlong_fu/programs/swift
Requires: accelerate, addict, aiohttp, attrdict, binpacking, charset_normalizer, cpm_kernels, dacite, datasets, einops, fastapi, gradio, importlib_metadata, jieba, matplotlib, modelscope, nltk, numpy, openai, oss2, pandas, peft, pillow, requests, rouge, safetensors, scipy, sentencepiece, simplejson, sortedcontainers, tensorboard, tiktoken, tqdm, transformers, transformers_stream_generator, trl, uvicorn, zstandard
Required-by:
(/gpfs/home/int/qiufengwang/xinlong_fu/conda/env/swift) [qiufengwang@xpxecdtn1 xinlong_fu]$
Additional context
Add any other context about the problem here(在这里补充其他信息)
None