Replies: 1 comment
-
|
Hello @jeaho322 Have you updated your NVIDIA drivers - are they equivalent to the Torch CUDA version which is installed in your environment? The error message does not really point towards a memory issue. Does |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I’m trying to train a PatchCore model using the Folder class to create the datamodule.
Since I’m working in an offline environment, I manually loaded the backbone weights via model.model.feature_extractor.
However, during training, GPU memory usage keeps increasing within a single epoch — it accumulates batch by batch until it eventually causes an out-of-memory error.
I’ve already tried lowering the sampling_ratio, as well as reducing both the batch_size and num_neighbors, but the issue persists.
Any advice or insights would be greatly appreciated. Thanks in advance!
Custom Train Code
from anomalib.data import Folder, PredictDataset
from anomalib.models import Patchcore
from anomalib.metrics import Evaluator, F1Score
from anomalib.engine import Engine
import torch, yaml
import pandas as pd
def load_yaml(path):
with open(path, "r", encoding="utf-8") as f:
return yaml.safe_load(f)
def make_datamodule(path, mode):
cfg = load_yaml(path)
if mode == 'train':
datamodule = Folder(
name=cfg['dataset']['name'],
normal_dir=cfg['dataset']['normal_dir'],
train_batch_size=cfg['train']['train_batch_size'],
def load_local_backbone_weights(feature_extractor, weight_path):
state = torch.load(weight_path, map_location="cpu")
if isinstance(state, dict) and "state_dict" in state:
state = state["state_dict"]
if name == 'main':
configs
dataset:
name : dataset
normal_dir : /home/sample
abnormal_dir : None
train:
train_batch_size : 32
model:
name: patchcore
backbone: wide_resnet50_2
pre_trained: true
layers:
- layer2
- layer3
coreset_sampling_ratio: 0.05
project:
seed: 42
path: ./results/patchcore_bottle
trainer:
accelerator: gpu
devices: 1
max_epochs: 1
precision: 16
enable_progress_bar: true
logging:
log_graph: false
Error
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 215, in forward
input = module(input)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/namu/.local/lib/python3.10/site-packages/timm/models/resnet.py", line 257, in forward
x = self.conv3(x)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":1154, please report a bug to PyTorch.
Beta Was this translation helpful? Give feedback.
All reactions