Skip to content

swift infer 单机多卡显存分配不均 #3923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wgzhendong opened this issue Apr 17, 2025 · 2 comments
Open

swift infer 单机多卡显存分配不均 #3923

wgzhendong opened this issue Apr 17, 2025 · 2 comments

Comments

@wgzhendong
Copy link

如题,使用swift infer 进行qwen2.5-vl-72b模型对测试集进行推理,发现显存分配不均,0卡显存占满,其他卡占用不到一半。
脚本
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ MAX_PIXELS=1003520 \ swift infer \ --model /njfs/train-material/models/Qwen2.5-VL-72B-Instruct \ --infer_backend pt \ --temperature 0 \ --max_new_tokens 4096 \ --val_dataset data/quality_check_porn/swift.jsonl \ --result_path infer_result/swift.jsonl \ --max_batch_size 4

Image
@Jintao-Huang
Copy link
Collaborator

Please try: --attn_impl flash_attn or --infer_backend vllm --tensor_parallel_size xxx

@wgzhendong
Copy link
Author

wgzhendong commented Apr 23, 2025

Please try: --attn_impl flash_attn or --infer_backend vllm --tensor_parallel_size xxx

感谢回复,改成 --infer_backend vllm --tensor_parallel_size xxx 可以了。我又遇到一个问题,我在4台A800*8的机器上进行lora微调,报显存溢出,但实际上显存分配,每张卡分配不均,有的卡不到60G。请问可以怎么解决?

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NNODES=$TORCH_NNODES
NPROC_PER_NODE=$TORCH_NPROC_PER_NODE
NODE_RANK=$TORCH_NODE_RANK
MASTER_ADDR=$TORCH_MASTER_ADDR
MASTER_PORT=$TORCH_MASTER_PORT
MAX_PIXELS=1003520
swift sft
--model /models/Qwen2.5-VL-7B-Instruct
--dataset 'data/train_swift.jsonl'
--train_type lora
--torch_dtype bfloat16
--num_train_epochs 4
--per_device_train_batch_size 2
--per_device_eval_batch_size 2
--learning_rate 2e-5
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--freeze_vit true
--gradient_accumulation_steps 8
--eval_steps 100
--save_steps 500
--save_total_limit 5
--logging_steps 5
--max_length 10240
--output_dir output/sft
--warmup_ratio 0.05
--dataloader_num_workers 32
--attn_impl flash_attn
--deepspeed zero2

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants