Skip to content

StopIteration error at "return next(self.vision_tower.parameters()).device" #66

@moured

Description

@moured

Thanks for the great work!

I’m trying to train the model using the train_qwen.py. I start training with the following bash script:

UPDATE: this only happens when export CUDA_VISIBLE_DEVICES is more than 1 gpu.

#!/bin/bash
export CUDA_VISIBLE_DEVICES=0,1

python ml-fastvlm/llava/train/train_qwen.py \
  --model_name_or_path "ml-fastvlm/checkpoints/llava-fastvithd_0.5b_stage2" \
  --data_path dummy_data.json \
  --image_folder ml-fastvlm/images \
  --output_dir ./checkpoints/test-run-fastvla \
  --num_train_epochs 1 \
  --vision_tower mobileclip \

However, I keep getting the same error, even after trying different environments:


(fastvlm) [ml-fastvlm]$ bash train_fastvlm.sh 
mobileclip_l_1024 is already loaded, `load_model` called again, skipping.
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
/codes/nader/backup/ml-fastvlm/llava/train/train_qwen.py:1219: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead.
  trainer = LLaVATrainer(model=model,
  0%|                                                                                                                  | 0/8 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/codes/nader/backup/ml-fastvlm/llava/train/train_qwen.py", line 1249, in <module>
    train()
  File "/codes/nader/backup/ml-fastvlm/llava/train/train_qwen.py", line 1227, in train
    trainer.train()
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/transformers/trainer.py", line 2171, in train
    return inner_training_loop(
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/transformers/trainer.py", line 2531, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/transformers/trainer.py", line 3675, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/transformers/trainer.py", line 3731, in compute_loss
    outputs = model(**inputs)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 193, in forward
    outputs = self.parallel_apply(replicas, inputs, module_kwargs)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 212, in parallel_apply
    return parallel_apply(
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 126, in parallel_apply
    output.reraise()
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/_utils.py", line 733, in reraise
    raise exception
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 96, in _worker
    output = module(*input, **kwargs)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "ml-fastvlm/llava/model/language_model/llava_qwen.py", line 82, in forward
    ) = self.prepare_inputs_labels_for_multimodal(
  File "ml-fastvlm/llava/model/llava_arch.py", line 210, in prepare_inputs_labels_for_multimodal
    image_features = self.encode_images(images)
  File "ml-fastvlm/llava/model/llava_arch.py", line 142, in encode_images
    image_features = self.get_model().get_vision_tower()(images)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "ml-fastvlm/llava/model/multimodal_encoder/mobileclip_encoder.py", line 72, in forward
    return self.forward_images(images)
  File "ml-fastvlm/llava/model/multimodal_encoder/mobileclip_encoder.py", line 85, in forward_images
    image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), return_image_embeddings=True)
  File "ml-fastvlm/llava/model/multimodal_encoder/mobileclip_encoder.py", line 100, in device
    return next(self.vision_tower.parameters()).device
StopIteration

  0%|          | 0/8 [00:01<?, ?it/s]

It looks like the vision encoder is not utilizing its parameters. I tried debugging the non-zero parameters and they were large positive numbers, which should be fine.

nonzero = torch.sum(torch.abs(model.vision_tower.parameters().__next__().data) > 0).item()
print("Non-zero parameters in vision_tower:", nonzero)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions