Skip to content

Too slow sft process #3971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hienhayho opened this issue Apr 24, 2025 · 2 comments
Closed

Too slow sft process #3971

hienhayho opened this issue Apr 24, 2025 · 2 comments

Comments

@hienhayho
Copy link

Describe the bug
I'm finetuning Qwen2.5-3B-Instruct but encounter a very slow finetuning process.

Image

Step to reproduce

  1. Installation
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift/
pip install -v -e .
pip install polars polars-lts-cpu deepspeed wandb datasets
  1. Prepare data file

Run python gen_data.py

import polars as pl
from tqdm import tqdm
from datasets import load_dataset

data = load_dataset(
    "BlossomsAI/reduced_vietnamese_instruction_dataset",
    split="train",
    cache_dir="cache_data",
)

results = []
for d in tqdm(data, total=len(data)):
    # print(d)
    r = {
        "instruction": d["instruction"],
        "input": d["input"],
        "output": d["output"],
    }

    results.append(r)

df = pl.DataFrame(results)

df.write_ndjson("data.jsonl")
  1. Run script

Run bash sft_qwen2_5_3b.sh

#!/bin/bash

NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 \
    swift sft \
    --model Qwen/Qwen2.5-3B-Instruct \
    --train_type lora \
    --dataset 'data.jsonl' \
    --torch_dtype bfloat16 \
    --report_to wandb \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --deepspeed zero3 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 16 \
    --eval_steps 500 \
    --save_steps 500 \
    --save_total_limit 2 \
    --logging_steps 50 \
    --max_length 4096 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataset_num_proc 1 \
    --dataloader_num_workers 4 \
    --use_hf true

Your hardware and system info

  • OS: ubuntu 22.04.3 LTS
  • CUDA toolkit: 12.1
  • GPUs: 4 NVIDIA GeForce RTX 2080 Ti
  • torch: 2.7.0

Additional context

  • I can't use flash attn 2 since my GPUs are not supported
@Jintao-Huang
Copy link
Collaborator

total_batch_size = 4 * 2 * 16

ZeRO-3 is relatively slow, which I think is normal.

@hienhayho
Copy link
Author

Hi @Jintao-Huang, thanks a lot, I have changed to zero2 and it's much faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants