Closed
Description
Describe the bug
When applying a LoRA state dict that is already loaded into the memory, the load_lora_weights()
+ unload_lora_weights()
cycle takes ~5.5 seconds which a majority of the time is spent on repeated dtype queries.
Reproduction
import time
import torch
import safetensors.torch
from diffusers import StableDiffusionXLPipeline
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16",
use_safetensors=True,
local_files_only=True,
)
pipe = pipe.to("cuda")
# !wget https://civitai.com/api/download/models/135931 -O loras/pixel-art-xl.safetensors
lora_weights = safetensors.torch.load_file(
"loras/pixel-art-xl.safetensors", device="cpu"
)
for _ in range(5):
t0 = time.perf_counter()
pipe.load_lora_weights(lora_weights.copy())
pipe.unload_lora_weights()
print("Load + unload cycle took: ", time.perf_counter() - t0)
Logs
Load + unload cycle took: 5.548293198924512
Load + unload cycle took: 6.468372649978846
Load + unload cycle took: 6.315054736100137
Load + unload cycle took: 5.443292624084279
Load + unload cycle took: 6.357059679925442
### System Info
- `diffusers` version: 0.21.0.dev0
- Platform: Linux-5.15.0-71-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Huggingface_hub version: 0.16.4
- Transformers version: 4.33.1
- Accelerate version: 0.22.0
- xFormers version: 0.0.21
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
### Who can help?
@williamberman, @patrickvonplaten, and @sayakpaul