Skip to content

Ovis-2B 预训练报错 #4818

Open
Open
@XylonFu

Description

@XylonFu

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
[rank0]: Traceback (most recent call last):
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/cli/pt.py", line 5, in
[rank0]: pt_main()
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/train/pt.py", line 24, in pt_main
[rank0]: return SwiftPt(args).main()
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/base.py", line 49, in main
[rank0]: result = self.run()
[rank0]: ^^^^^^^^^^
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/train/sft.py", line 96, in run
[rank0]: train_dataset, val_dataset = self._encode_dataset(train_dataset, val_dataset)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/train/sft.py", line 224, in _encode_dataset
[rank0]: train_dataset = packing_dataset_cls(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/dataset/utils.py", line 307, in init
[rank0]: self.create_packed_dataset()
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/dataset/utils.py", line 324, in create_packed_dataset
[rank0]: self.packing_dataset()
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/dataset/utils.py", line 350, in packing_dataset
[rank0]: res, data = self.calculate_matched_group(self.template, data, is_finished=is_finished)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/dataset/utils.py", line 145, in calculate_matched_group
[rank0]: packed = template.packing_row(row)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/template/base.py", line 522, in packing_row
[rank0]: packed.update(self._data_collator_mm_data([r[0] for r in row]))
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/gpfs/home/int/qiufengwang/xinlong_fu/programs/swift/swift/llm/template/base.py", line 1601, in _data_collator_mm_data
[rank0]: res['pixel_values'] = torch.concat(pixel_values)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: TypeError: expected Tensor as element 0 in argument 0, but got list

#!/bin/bash

NPROC_PER_NODE=4
CUDA_VISIBLE_DEVICES=0,1,2,3
swift pt
--model $XLF/downloads/models/AIDC-AI/Ovis2-2B
--dataset $XLF/downloads/datasets/VisualStar/visualwebinstruct/pt-event-0608-01-230M.jsonl
--output_dir $XLF/scripts/logs/Ovis2-2B-CPT/C230
--add_version false

--train_type full
--torch_dtype bfloat16
--learning_rate 1e-5
--warmup_ratio 0.05

--freeze_llm false
--freeze_vit false
--freeze_aligner false

--num_train_epochs 3
--gradient_accumulation_steps 2
--per_device_train_batch_size 4
--per_device_eval_batch_size 4

--dataset_num_proc 8
--dataloader_num_workers 4
--split_dataset_ratio 0

--max_length 4096
--truncation_strategy delete
--attn_impl flash_attn
--packing true

--save_strategy epoch
--save_steps 1
--save_only_model true
--eval_strategy epoch
--eval_steps 1
--logging_steps 1

--deepspeed zero3 \

#!/bin/bash
#SBATCH --job-name=swift
#SBATCH --time=168:00:00
#SBATCH --output=/gpfs/home/int/qiufengwang/xinlong_fu/slurms/logs/%j.out
#SBATCH --error=/gpfs/home/int/qiufengwang/xinlong_fu/slurms/logs/%j.err
#SBATCH --partition=gpua800
#SBATCH --qos=4gpus
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-gpu=8

export PATH=$XLF/anaconda3/bin:$PATH
export PATH=$XLF/cuda/12.4/bin:$PATH
export LD_LIBRARY_PATH=$XLF/cuda/12.4/lib64:$LD_LIBRARY_PATH

export OMP_NUM_THREADS=4

source activate $XLF/conda/env/swift

cd $XLF/downloads/datasets/VisualStar/images

srun $XLF/scripts/swift-pt.sh

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)

(/gpfs/home/int/qiufengwang/xinlong_fu/conda/env/swift) [qiufengwang@xpxecdtn1 xinlong_fu]$ python -c "import sys; import torch; print('Python Version:', sys.version); print('CUDA Version:', torch.version.cuda); print('PyTorch Version:', torch.version); print('CXX11 ABI Enabled:', torch._C._GLIBCXX_USE_CXX11_ABI); print('CUDA Available:', torch.cuda.is_available()); print('GPU Count:', torch.cuda.device_count());"
Python Version: 3.11.11 (main, Dec 11 2024, 16:28:39) [GCC 11.2.0]
CUDA Version: 12.4
PyTorch Version: 2.6.0+cu124
CXX11 ABI Enabled: False
CUDA Available: False
GPU Count: 0
(/gpfs/home/int/qiufengwang/xinlong_fu/conda/env/swift) [qiufengwang@xpxecdtn1 xinlong_fu]$ pip show ms-swift
Name: ms_swift
Version: 3.6.0.dev0
Summary: Swift: Scalable lightWeight Infrastructure for Fine-Tuning
Home-page: https://github.com/modelscope/swift
Author: DAMO ModelScope teams
Author-email: [email protected]
License: Apache License 2.0
Location: /gpfs/home/int/qiufengwang/xinlong_fu/programs/swift
Editable project location: /gpfs/home/int/qiufengwang/xinlong_fu/programs/swift
Requires: accelerate, addict, aiohttp, attrdict, binpacking, charset_normalizer, cpm_kernels, dacite, datasets, einops, fastapi, gradio, importlib_metadata, jieba, matplotlib, modelscope, nltk, numpy, openai, oss2, pandas, peft, pillow, requests, rouge, safetensors, scipy, sentencepiece, simplejson, sortedcontainers, tensorboard, tiktoken, tqdm, transformers, transformers_stream_generator, trl, uvicorn, zstandard
Required-by:
(/gpfs/home/int/qiufengwang/xinlong_fu/conda/env/swift) [qiufengwang@xpxecdtn1 xinlong_fu]$

Additional context
Add any other context about the problem here(在这里补充其他信息)

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions