StopIteration error at "return next(self.vision_tower.parameters()).device"

Thanks for the great work!

I’m trying to train the model using the **`train_qwen.py`**. I start training with the following bash script: 

**UPDATE: this only happens when export CUDA_VISIBLE_DEVICES is more than 1 gpu.** 

```
#!/bin/bash
export CUDA_VISIBLE_DEVICES=0,1

python ml-fastvlm/llava/train/train_qwen.py \
  --model_name_or_path "ml-fastvlm/checkpoints/llava-fastvithd_0.5b_stage2" \
  --data_path dummy_data.json \
  --image_folder ml-fastvlm/images \
  --output_dir ./checkpoints/test-run-fastvla \
  --num_train_epochs 1 \
  --vision_tower mobileclip \
```

However, I keep getting the same error, even after trying different environments:

```

(fastvlm) [ml-fastvlm]$ bash train_fastvlm.sh 
mobileclip_l_1024 is already loaded, `load_model` called again, skipping.
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
/codes/nader/backup/ml-fastvlm/llava/train/train_qwen.py:1219: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead.
  trainer = LLaVATrainer(model=model,
  0%|                                                                                                                  | 0/8 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/codes/nader/backup/ml-fastvlm/llava/train/train_qwen.py", line 1249, in <module>
    train()
  File "/codes/nader/backup/ml-fastvlm/llava/train/train_qwen.py", line 1227, in train
    trainer.train()
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/transformers/trainer.py", line 2171, in train
    return inner_training_loop(
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/transformers/trainer.py", line 2531, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/transformers/trainer.py", line 3675, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/transformers/trainer.py", line 3731, in compute_loss
    outputs = model(**inputs)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 193, in forward
    outputs = self.parallel_apply(replicas, inputs, module_kwargs)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 212, in parallel_apply
    return parallel_apply(
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 126, in parallel_apply
    output.reraise()
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/_utils.py", line 733, in reraise
    raise exception
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 96, in _worker
    output = module(*input, **kwargs)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "ml-fastvlm/llava/model/language_model/llava_qwen.py", line 82, in forward
    ) = self.prepare_inputs_labels_for_multimodal(
  File "ml-fastvlm/llava/model/llava_arch.py", line 210, in prepare_inputs_labels_for_multimodal
    image_features = self.encode_images(images)
  File "ml-fastvlm/llava/model/llava_arch.py", line 142, in encode_images
    image_features = self.get_model().get_vision_tower()(images)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/anaconda3/envs/fastvlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "ml-fastvlm/llava/model/multimodal_encoder/mobileclip_encoder.py", line 72, in forward
    return self.forward_images(images)
  File "ml-fastvlm/llava/model/multimodal_encoder/mobileclip_encoder.py", line 85, in forward_images
    image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), return_image_embeddings=True)
  File "ml-fastvlm/llava/model/multimodal_encoder/mobileclip_encoder.py", line 100, in device
    return next(self.vision_tower.parameters()).device
StopIteration

  0%|          | 0/8 [00:01<?, ?it/s]
```

It looks like the vision encoder is not utilizing its parameters.   I tried debugging the non-zero parameters and they were large positive numbers, which should be fine.

```python
nonzero = torch.sum(torch.abs(model.vision_tower.parameters().__next__().data) > 0).item()
print("Non-zero parameters in vision_tower:", nonzero)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

StopIteration error at "return next(self.vision_tower.parameters()).device" #66

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

StopIteration error at "return next(self.vision_tower.parameters()).device" #66

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions