InternVL3-9B LoRA微调数据集预处理速度缓慢问题（大约7h）

### 数据集示例
{
        "messages": [
            {
                "role": "user",
                "content": "<image>\nIs there a blue or green color cast in the photo?"
            },
            {
                "role": "assistant",
                "content": "Yes"
            }
        ],
        "images": [
            "/fine_tune/M_Database/1.jpg"
        ]
    },
我的数据集中共有78170个上面的样本。
### 环境
RTX 3090 * 4
python 3.10.0
ms-swift 3.4.0
### --lazy_tokenize
起初我没有注意到这个参数，官方文档描述它在MLLM微调中默认为True，意味着模型的微调过程会边微调边做数据预处理，在这种情况下我需要11天才能完成微调任务。
所以我将其设置为False，但是它的数据预处理过程依然很缓慢，我设置了dataset_num_proc=12依然需要花费大概7小时才能完成。
### 微调指令
export HF_DATASETS_CACHE="/fine_tune/cachefile/"
swift sft \
    --model /fine_tune/InternVL3-9B/ \
    --train_type lora \
    --dataset '/fine_tune/InternVL3-9B/swift_data.json' \
    --enable_cache True \
    --lazy_tokenize False \
    --dataset_num_proc 12 \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 16 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --system 'You are a helpful assistant.' \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

InternVL3-9B LoRA微调数据集预处理速度缓慢问题（大约7h） #4076

数据集示例

环境

--lazy_tokenize

微调指令

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

InternVL3-9B LoRA微调数据集预处理速度缓慢问题（大约7h） #4076

Description

数据集示例

环境

--lazy_tokenize

微调指令

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions