-
Notifications
You must be signed in to change notification settings - Fork 472
Open
Description
Thanks for the great work!
I’m trying to train the model using the train_qwen.py
. I start training with the following bash script:
UPDATE: this only happens when export CUDA_VISIBLE_DEVICES is more than 1 gpu.
#!/bin/bash
export CUDA_VISIBLE_DEVICES=0,1
python ml-fastvlm/llava/train/train_qwen.py \
--model_name_or_path "ml-fastvlm/checkpoints/llava-fastvithd_0.5b_stage2" \
--data_path dummy_data.json \
--image_folder ml-fastvlm/images \
--output_dir ./checkpoints/test-run-fastvla \
--num_train_epochs 1 \
--vision_tower mobileclip \
However, I keep getting the same error, even after trying different environments:
(fastvlm) [ml-fastvlm]$ bash train_fastvlm.sh
mobileclip_l_1024 is already loaded, `load_model` called again, skipping.
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
/codes/nader/backup/ml-fastvlm/llava/train/train_qwen.py:1219: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead.
trainer = LLaVATrainer(model=model,
0%| | 0/8 [00:00<?, ?it/s]Traceback (most recent call last):
File "/codes/nader/backup/ml-fastvlm/llava/train/train_qwen.py", line 1249, in <module>
train()
File "/codes/nader/backup/ml-fastvlm/llava/train/train_qwen.py", line 1227, in train
trainer.train()
File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/transformers/trainer.py", line 2171, in train
return inner_training_loop(
File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/transformers/trainer.py", line 2531, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/transformers/trainer.py", line 3675, in training_step
loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/transformers/trainer.py", line 3731, in compute_loss
outputs = model(**inputs)
File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 193, in forward
outputs = self.parallel_apply(replicas, inputs, module_kwargs)
File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 212, in parallel_apply
return parallel_apply(
File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 126, in parallel_apply
output.reraise()
File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/_utils.py", line 733, in reraise
raise exception
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 96, in _worker
output = module(*input, **kwargs)
File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "ml-fastvlm/llava/model/language_model/llava_qwen.py", line 82, in forward
) = self.prepare_inputs_labels_for_multimodal(
File "ml-fastvlm/llava/model/llava_arch.py", line 210, in prepare_inputs_labels_for_multimodal
image_features = self.encode_images(images)
File "ml-fastvlm/llava/model/llava_arch.py", line 142, in encode_images
image_features = self.get_model().get_vision_tower()(images)
File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "ml-fastvlm/llava/model/multimodal_encoder/mobileclip_encoder.py", line 72, in forward
return self.forward_images(images)
File "ml-fastvlm/llava/model/multimodal_encoder/mobileclip_encoder.py", line 85, in forward_images
image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), return_image_embeddings=True)
File "ml-fastvlm/llava/model/multimodal_encoder/mobileclip_encoder.py", line 100, in device
return next(self.vision_tower.parameters()).device
StopIteration
0%| | 0/8 [00:01<?, ?it/s]
It looks like the vision encoder is not utilizing its parameters. I tried debugging the non-zero parameters and they were large positive numbers, which should be fine.
nonzero = torch.sum(torch.abs(model.vision_tower.parameters().__next__().data) > 0).item()
print("Non-zero parameters in vision_tower:", nonzero)
Metadata
Metadata
Assignees
Labels
No labels