Skip to content

RuntimeError: shape '[50, 1, 5, 64]' is invalid for input of size 8000 in Attention module during UNet forward with SVD-XT #447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
richelieuli opened this issue May 14, 2025 · 0 comments

Comments

@richelieuli
Copy link

RuntimeError: shape '[50, 1, 5, 64]' is invalid for input of size 8000 in Attention module during UNet forward with SVD-XT

Description:

When attempting to run the DiffusionEngine (UNet) forward pass with an input tensor derived from VAE-encoded video frames, we encounter a RuntimeError within the Attention module (sgm/modules/attention.py). This occurs consistently when processing 25-frame video clips, regardless of whether the UNet is called as part of the Sampler's process or directly after noise injection at a specific timestep.

The RuntimeError message's reported input size and target shape appear significantly inconsistent with the actual tensor dimensions and the target shape calculated by the code's logic at the point of failure, as observed during debugging.

Context:

  • Model: SVD-XT (using scripts/sampling/configs/svd_xt.yaml configuration)
  • Task: Processing existing video clips (specifically tested with 25 frames per clip) and extracting intermediate features (BFM - BottleNeck Feature Map/final denoised latent or intermediate UNet features).
  • Error Location: Deep within the UNet's Attention module during its forward pass.

Steps to Reproduce:

  1. Load the SVD-XT model using instantiate_from_config with the svd_xt.yaml config and load the corresponding svd_xt.safetensors checkpoint.
  2. Process a 25-frame video clip by:
    • Encoding the frames using the VAE (model.first_stage_model.encode) to obtain a latent tensor (shape [25, 4, 64, 64]).
    • Preparing conditioner inputs (c, uc) using model.conditioner, conditioning on the first frame and relevant parameters (fps, motion bucket).
    • Ensuring c and uc tensors are correctly repeated and rearranged to match the number of frames (Batch size becomes 25 for individual inputs, or 50 when conditional and unconditional are processed together).
    • Obtaining a tensor to input to the UNet's forward pass. This can be done either by:
      • Calling model.sampler() (which internally calls the UNet).
      • Or, by injecting noise into the VAE output latent (latent_x) at a specific timestep and calling the UNet's forward pass directly (model.model.diffusion_model.forward) with the noisy latent and conditions.
  3. The RuntimeError occurs during the execution of the UNet's forward pass, specifically within an Attention block.

Expected Behavior:

The UNet forward pass should complete without a RuntimeError related to dimension mismatches in the Attention module.

Observed Behavior (Error Traceback):
Traceback (most recent call last):
File "/home/li/Desktop/generative-models/scripts/BFM_final.py", line 475, in <module> # Or the line calling model.sampler()
_ = unet(
File "/home/li/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/li/Desktop/generative-models/sgm/modules/diffusionmodules/video_model.py", line 470, in forward
h = module(
File "/home/li/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/li/Desktop/generative-models/sgm/modules/diffusionmodules/openaimodel.py", line 116, in forward
x = layer(
File "/home/li/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/li/Desktop/generative-models/sgm/modules/video_attention.py", line 282, in forward
x = block(
File "/home/li/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/li/Desktop/generative-models/sgm/modules/attention.py", line 546, in forward
return checkpoint(self._forward, x, context)
File "/home/li/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/home/li/.local/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/li/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/home/li/Desktop/generative-models/sgm/modules/attention.py", line 566, in _forward
self.attn2( # Or self.attn1
File "/home/li/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/li/Desktop/generative-models/sgm/modules/attention.py", line 407, in forward # Inner Attention forward
q, k, v = map(
File "/home/li/.local/lib/python3.10/site-packages/torch/nn/modules/attention.py", line 409, in <lambda> # Exact line with reshape
.reshape(b, t.shape[1], self.heads, self.dim_head)
RuntimeError: shape '[50, 1, 5, 64]' is invalid for input of size 8000

Debugging Information:

By adding print statements inside sgm/modules/attention.py at the location of the failing reshape (line 409 within the lambda), we collected the following information:

  • The input tensor to the reshape operation (represented as t within the lambda for q, k, and v) has the shape torch.Size([50, 4096, 320]).
  • The total number of elements in the input tensor is $50 \times 4096 \times 320 = 65,536,000$.
  • The reshape operation is .reshape(b, t.shape[1], self.heads, self.dim_head).
  • At this point, b=50, t.shape[1]=4096, self.heads=5, and self.dim_head=64 (from the model configuration).
  • Based on these values, the intended target shape according to the code's logic is [50, 4096, 5, 64].
  • The total number of elements required for this intended target shape is $50 \times 4096 \times 5 \times 64 = 65,536,000$.

Inconsistency:

There is a major inconsistency between the error message reported by PyTorch and the actual tensor dimensions and calculated target shape:

  • Error Message Reports: Target shape [50, 1, 5, 64] (total elements 16,000), Input size 8,000.
  • Observed Debugging Shows: Input shape [50, 4096, 320] (total elements 65,536,000), Code's Target shape [50, 4096, 5, 64] (total elements 65,536,000).

The error message's reported dimensions ([50, 1, 5, 64], size 8000) do not match the actual observed input shape or the target shape calculated by the code ([50, 4096, 5, 64]). This suggests the error message itself might be providing incorrect details about the shapes at the exact point of failure. The core issue appears to be a dimension mismatch or incorrect calculation within the Attention or Video Attention logic that leads to an invalid reshape, but the reported numbers are misleading.

Configuration:

Using scripts/sampling/configs/svd_xt.yaml.

Environment:
Python version: 3.10.12 (main, Feb 4 2025, 14:57:36) [GCC 11.4.0]
PyTorch version: 2.0.1+cu117

PyTorch config:
Could not run torch.show_config().
This might indicate an issue with your PyTorch installation.

Checking CUDA availability:
CUDA is available: Yes
CUDA version (from torch): 11.7
cuDNN version (from torch): 8500
Number of GPUs available: 1
GPU 0: NVIDIA GeForce RTX 3090

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant