Skip to content

Qwen2-VL-2B 预训练到后期会出现梯度爆炸,其他VLM不会出现 #4819

Open
@XylonFu

Description

@XylonFu

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
#!/bin/bash

NPROC_PER_NODE=4
CUDA_VISIBLE_DEVICES=0,1,2,3
swift pt
--model $XLF/downloads/models/Qwen/Qwen2-VL-2B
--dataset $XLF/downloads/datasets/VisualStar/visualwebinstruct/pt-event-0608-01-230M.jsonl
--output_dir $XLF/scripts/logs/Qwen2-VL-2B-CPT/C230
--add_version false

--train_type full
--torch_dtype bfloat16
--learning_rate 1e-5
--warmup_ratio 0.05

--freeze_llm false
--freeze_vit false
--freeze_aligner false

--num_train_epochs 1
--gradient_accumulation_steps 2
--per_device_train_batch_size 4
--per_device_eval_batch_size 4

--dataset_num_proc 8
--dataloader_num_workers 4
--split_dataset_ratio 0

--max_length 4096
--truncation_strategy delete
--attn_impl flash_attn
--packing true

--save_strategy epoch
--save_steps 1
--save_only_model true
--eval_strategy epoch
--eval_steps 1
--logging_steps 1

--deepspeed zero3 \

Image Image

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
(/gpfs/work/int/xinlongfu24/xinlong_fu/conda/env/swift) [xinlongfu24@xpxecdtn1 xinlong_fu]$ python -c "import sys; import torch; print('Python Version:', sys.version); print('CUDA Version:', torch.version.cuda); print('PyTorch Version:', torch.version); print('CXX11 ABI Enabled:', torch._C._GLIBCXX_USE_CXX11_ABI); print('CUDA Available:', torch.cuda.is_available()); print('GPU Count:', torch.cuda.device_count());"
Python Version: 3.11.11 (main, Dec 11 2024, 16:28:39) [GCC 11.2.0]
CUDA Version: 12.4
PyTorch Version: 2.6.0+cu124
CXX11 ABI Enabled: False
CUDA Available: False
GPU Count: 0
(/gpfs/work/int/xinlongfu24/xinlong_fu/conda/env/swift) [xinlongfu24@xpxecdtn1 xinlong_fu]$ pip show ms-swift
Name: ms_swift
Version: 3.5.3
Summary: Swift: Scalable lightWeight Infrastructure for Fine-Tuning
Home-page: https://github.com/modelscope/swift
Author: DAMO ModelScope teams
Author-email: [email protected]
License: Apache License 2.0
Location: /gpfs/work/int/xinlongfu24/xinlong_fu/programs/swift
Editable project location: /gpfs/work/int/xinlongfu24/xinlong_fu/programs/swift
Requires: accelerate, addict, aiohttp, attrdict, binpacking, charset_normalizer, cpm_kernels, dacite, datasets, einops, fastapi, gradio, importlib_metadata, jieba, matplotlib, modelscope, nltk, numpy, openai, oss2, pandas, peft, pillow, requests, rouge, safetensors, scipy, sentencepiece, simplejson, sortedcontainers, tensorboard, tiktoken, tqdm, transformers, transformers_stream_generator, trl, uvicorn, zstandard
Required-by:
(/gpfs/work/int/xinlongfu24/xinlong_fu/conda/env/swift) [xinlongfu24@xpxecdtn1 xinlong_fu]$
Additional context
Add any other context about the problem here(在这里补充其他信息)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions