GPU memory keeps increasing during PatchCore training with Folder datamodule #3086

jeaho322 · 2025-11-05T01:47:23Z

jeaho322
Nov 5, 2025

Hi,
I’m trying to train a PatchCore model using the Folder class to create the datamodule.
Since I’m working in an offline environment, I manually loaded the backbone weights via model.model.feature_extractor.

However, during training, GPU memory usage keeps increasing within a single epoch — it accumulates batch by batch until it eventually causes an out-of-memory error.
I’ve already tried lowering the sampling_ratio, as well as reducing both the batch_size and num_neighbors, but the issue persists.

Any advice or insights would be greatly appreciated. Thanks in advance!

Custom Train Code

from anomalib.data import Folder, PredictDataset
from anomalib.models import Patchcore
from anomalib.metrics import Evaluator, F1Score
from anomalib.engine import Engine
import torch, yaml
import pandas as pd

def load_yaml(path):
with open(path, "r", encoding="utf-8") as f:
return yaml.safe_load(f)

def make_datamodule(path, mode):
cfg = load_yaml(path)
if mode == 'train':
datamodule = Folder(
name=cfg['dataset']['name'],
normal_dir=cfg['dataset']['normal_dir'],
train_batch_size=cfg['train']['train_batch_size'],

    )

else:
    datamodule = Folder(
        name=cfg['dataset']['name'],
        normal_dir=cfg['dataset']['predict_dir'],
    )

datamodule.setup()
return datamodule, cfg

def load_local_backbone_weights(feature_extractor, weight_path):
state = torch.load(weight_path, map_location="cpu")
if isinstance(state, dict) and "state_dict" in state:
state = state["state_dict"]

cleaned = {}
for k, v in state.items():
    nk = k
    if nk.startswith("module."):
        nk = nk[len("module."):]
    if nk.startswith("model."):
        nk = nk[len("model."):]
    cleaned[nk] = v

feature_extractor.load_state_dict(cleaned, strict=False)

if name == 'main':

CONFIG = {
"learning_rate": 1e-4,   
"local_weight_path": "./pretrained/wide_resnet50_2-95faca4d.pth",
"device": "cuda" if torch.cuda.is_available() else "cpu",

"max_epochs" : 1,
"log_every_n_steps" : 1,
"check_val_every_n_epoch" :1}


data_yaml = 'pc_config.yaml'
dm, data_cfg = make_datamodule(data_yaml, 'train')


model = Patchcore(
    pre_trained=False,  # False
    coreset_sampling_ratio=0.05,
    num_neighbors=4
)

cfg = CONFIG
load_local_backbone_weights(model.model.feature_extractor, cfg["local_weight_path"])

engine = Engine(
    logger=True,
    max_epochs=cfg["max_epochs"],
    log_every_n_steps=cfg["log_every_n_steps"],
    check_val_every_n_epoch=cfg["check_val_every_n_epoch"]        
)


engine.fit(model, datamodule=dm)

configs

dataset:
name : dataset
normal_dir : /home/sample
abnormal_dir : None

train:
train_batch_size : 32

model:
name: patchcore
backbone: wide_resnet50_2
pre_trained: true
layers:
- layer2
- layer3
coreset_sampling_ratio: 0.05

project:
seed: 42
path: ./results/patchcore_bottle

trainer:
accelerator: gpu
devices: 1
max_epochs: 1
precision: 16
enable_progress_bar: true

logging:
log_graph: false

Error

File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 215, in forward
input = module(input)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/namu/.local/lib/python3.10/site-packages/timm/models/resnet.py", line 257, in forward
x = self.conv3(x)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":1154, please report a bug to PyTorch.

waschsalz · 2025-11-05T08:57:42Z

waschsalz
Nov 5, 2025

Hello @jeaho322

Have you updated your NVIDIA drivers - are they equivalent to the Torch CUDA version which is installed in your environment? The error message does not really point towards a memory issue.

Does nvidia-smi work?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU memory keeps increasing during PatchCore training with Folder datamodule #3086

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GPU memory keeps increasing during PatchCore training with Folder datamodule #3086

Uh oh!

jeaho322 Nov 5, 2025

Custom Train Code

configs

Error

Replies: 1 comment

Uh oh!

waschsalz Nov 5, 2025

jeaho322
Nov 5, 2025

waschsalz
Nov 5, 2025