Skip to content

Error in torch_compile.ipynb #1988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
KumoLiu opened this issue May 7, 2025 · 0 comments
Open

Error in torch_compile.ipynb #1988

KumoLiu opened this issue May 7, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@KumoLiu
Copy link
Contributor

KumoLiu commented May 7, 2025

[2025-05-06T16:47:27.159Z] Running ./modules/torch_compile.ipynb
[2025-05-06T16:47:27.159Z] Checking PEP8 compliance...
[2025-05-06T16:47:27.719Z] Running notebook...
[2025-05-06T16:47:34.247Z] Unable to import quantization op. Please install modelopt library (https://github.com/NVIDIA/TensorRT-Model-Optimizer?tab=readme-ov-file#installation) to add support for compiling quantized models
[2025-05-06T16:47:37.507Z] MONAI version: 1.4.1rc1+46.gb58e883c
[2025-05-06T16:47:37.507Z] Numpy version: 1.26.4
[2025-05-06T16:47:37.507Z] Pytorch version: 2.7.0+cu126
[2025-05-06T16:47:37.507Z] MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
[2025-05-06T16:47:37.507Z] MONAI rev id: b58e883c887e0f99d382807550654c44d94f47bd
[2025-05-06T16:47:37.507Z] MONAI __file__: /home/jenkins/agent/workspace/Monai-notebooks/MONAI/monai/__init__.py
[2025-05-06T16:47:37.507Z] 
[2025-05-06T16:47:37.507Z] Optional dependencies:
[2025-05-06T16:47:38.067Z] Pytorch Ignite version: 0.4.11
[2025-05-06T16:47:38.067Z] ITK version: 5.4.3
[2025-05-06T16:47:38.067Z] Nibabel version: 5.3.2
[2025-05-06T16:47:38.067Z] scikit-image version: 0.19.3
[2025-05-06T16:47:38.067Z] scipy version: 1.14.0
[2025-05-06T16:47:38.067Z] Pillow version: 7.0.0
[2025-05-06T16:47:38.067Z] Tensorboard version: 2.16.2
[2025-05-06T16:47:38.067Z] gdown version: 5.2.0
[2025-05-06T16:47:38.067Z] TorchVision version: 0.22.0+cu126
[2025-05-06T16:47:38.067Z] tqdm version: 4.66.5
[2025-05-06T16:47:38.067Z] lmdb version: 1.6.2
[2025-05-06T16:47:38.067Z] psutil version: 6.0.0
[2025-05-06T16:47:38.067Z] pandas version: 2.2.2
[2025-05-06T16:47:38.067Z] einops version: 0.8.0
[2025-05-06T16:47:38.067Z] transformers version: 4.40.2
[2025-05-06T16:47:38.067Z] mlflow version: 2.22.0
[2025-05-06T16:47:38.067Z] pynrrd version: 1.1.3
[2025-05-06T16:47:38.067Z] clearml version: 2.0.0rc0
[2025-05-06T16:47:38.067Z] 
[2025-05-06T16:47:38.067Z] For details about installing the optional dependencies, please visit:
[2025-05-06T16:47:38.067Z]     https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies
[2025-05-06T16:47:38.067Z] 
[2025-05-06T16:47:41.332Z] papermill  --progress-bar --log-output -k python3
[2025-05-06T16:47:41.332Z] /usr/local/lib/python3.10/dist-packages/papermill/iorw.py:149: UserWarning: the file is not specified with any extension : -
[2025-05-06T16:47:41.332Z]   warnings.warn(f"the file is not specified with any extension : {os.path.basename(path)}")
[2025-05-06T16:55:02.641Z] 
Executing:   0%|          | 0/32 [00:00<?, ?cell/s]
Executing:   3%|▎         | 1/32 [00:00<00:29,  1.04cell/s]
Executing:  12%|█▎        | 4/32 [00:12<01:33,  3.33s/cell]
Executing:  19%|█▉        | 6/32 [00:22<01:44,  4.03s/cell]
Executing:  31%|███▏      | 10/32 [00:26<00:52,  2.38s/cell]
Executing:  50%|█████     | 16/32 [00:28<00:19,  1.23s/cell]
Executing:  56%|█████▋    | 18/32 [00:35<00:23,  1.71s/cell]
Executing:  62%|██████▎   | 20/32 [07:07<09:10, 45.85s/cell]
Executing:  69%|██████▉   | 22/32 [07:08<05:46, 34.69s/cell]
Executing:  72%|███████▏  | 23/32 [07:18<04:38, 30.98s/cell]
Executing:  72%|███████▏  | 23/32 [07:20<02:52, 19.17s/cell]
[2025-05-06T16:55:02.641Z] /usr/local/lib/python3.10/dist-packages/papermill/iorw.py:149: UserWarning: the file is not specified with any extension : -
[2025-05-06T16:55:02.641Z]   warnings.warn(f"the file is not specified with any extension : {os.path.basename(path)}")
[2025-05-06T16:55:02.641Z] Traceback (most recent call last):
[2025-05-06T16:55:02.641Z]   File "/usr/local/bin/papermill", line 8, in <module>
[2025-05-06T16:55:02.641Z]     sys.exit(papermill())
[2025-05-06T16:55:02.641Z]   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
[2025-05-06T16:55:02.641Z]     return self.main(*args, **kwargs)
[2025-05-06T16:55:02.641Z]   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
[2025-05-06T16:55:02.641Z]     rv = self.invoke(ctx)
[2025-05-06T16:55:02.641Z]   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
[2025-05-06T16:55:02.641Z]     return ctx.invoke(self.callback, **ctx.params)
[2025-05-06T16:55:02.641Z]   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
[2025-05-06T16:55:02.641Z]     return __callback(*args, **kwargs)
[2025-05-06T16:55:02.641Z]   File "/usr/local/lib/python3.10/dist-packages/click/decorators.py", line 33, in new_func
[2025-05-06T16:55:02.641Z]     return f(get_current_context(), *args, **kwargs)
[2025-05-06T16:55:02.641Z]   File "/usr/local/lib/python3.10/dist-packages/papermill/cli.py", line 235, in papermill
[2025-05-06T16:55:02.641Z]     execute_notebook(
[2025-05-06T16:55:02.641Z]   File "/usr/local/lib/python3.10/dist-packages/papermill/execute.py", line 131, in execute_notebook
[2025-05-06T16:55:02.641Z]     raise_for_execution_errors(nb, output_path)
[2025-05-06T16:55:02.641Z]   File "/usr/local/lib/python3.10/dist-packages/papermill/execute.py", line 251, in raise_for_execution_errors
[2025-05-06T16:55:02.641Z]     raise error
[2025-05-06T16:55:02.641Z] papermill.exceptions.PapermillExecutionError: 
[2025-05-06T16:55:02.641Z] ---------------------------------------------------------------------------
[2025-05-06T16:55:02.641Z] Exception encountered at "In [11]":
[2025-05-06T16:55:02.641Z] ---------------------------------------------------------------------------
[2025-05-06T16:55:02.641Z] InductorError                             Traceback (most recent call last)
[2025-05-06T16:55:02.641Z] Cell In[11], line 14
[2025-05-06T16:55:02.641Z]      12 inputs, labels = batch_data["image"].to(device), batch_data["label"].to(device)
[2025-05-06T16:55:02.641Z]      13 optimizer.zero_grad()
[2025-05-06T16:55:02.641Z] ---> 14 loss, train_time = timed(lambda: train(model_opt, inputs, labels))  # noqa: B023
[2025-05-06T16:55:02.641Z]      15 optimizer.step()
[2025-05-06T16:55:02.641Z]      16 epoch_loss += loss.item()
[2025-05-06T16:55:02.641Z] 
[2025-05-06T16:55:02.641Z] Cell In[6], line 5, in timed(fn)
[2025-05-06T16:55:02.641Z]       3 end = torch.cuda.Event(enable_timing=True)
[2025-05-06T16:55:02.641Z]       4 start.record()
[2025-05-06T16:55:02.641Z] ----> 5 result = fn()
[2025-05-06T16:55:02.641Z]       6 end.record()
[2025-05-06T16:55:02.641Z]       7 torch.cuda.synchronize()
[2025-05-06T16:55:02.641Z] 
[2025-05-06T16:55:02.641Z] Cell In[11], line 14, in <lambda>()
[2025-05-06T16:55:02.641Z]      12 inputs, labels = batch_data["image"].to(device), batch_data["label"].to(device)
[2025-05-06T16:55:02.641Z]      13 optimizer.zero_grad()
[2025-05-06T16:55:02.641Z] ---> 14 loss, train_time = timed(lambda: train(model_opt, inputs, labels))  # noqa: B023
[2025-05-06T16:55:02.641Z]      15 optimizer.step()
[2025-05-06T16:55:02.641Z]      16 epoch_loss += loss.item()
[2025-05-06T16:55:02.641Z] 
[2025-05-06T16:55:02.641Z] Cell In[6], line 12, in train(model, inputs, labels)
[2025-05-06T16:55:02.641Z]      11 def train(model, inputs, labels):
[2025-05-06T16:55:02.641Z] ---> 12     outputs = model(inputs)
[2025-05-06T16:55:02.642Z]      13     loss_function = monai.losses.DiceCELoss(to_onehot_y=True, softmax=True)
[2025-05-06T16:55:02.642Z]      14     loss = loss_function(outputs, labels)
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1751, in Module._wrapped_call_impl(self, *args, **kwargs)
[2025-05-06T16:55:02.642Z]    1749     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
[2025-05-06T16:55:02.642Z]    1750 else:
[2025-05-06T16:55:02.642Z] -> 1751     return self._call_impl(*args, **kwargs)
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1762, in Module._call_impl(self, *args, **kwargs)
[2025-05-06T16:55:02.642Z]    1757 # If we don't have any hooks, we want to skip the rest of the logic in
[2025-05-06T16:55:02.642Z]    1758 # this function, and just call forward.
[2025-05-06T16:55:02.642Z]    1759 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
[2025-05-06T16:55:02.642Z]    1760         or _global_backward_pre_hooks or _global_backward_hooks
[2025-05-06T16:55:02.642Z]    1761         or _global_forward_hooks or _global_forward_pre_hooks):
[2025-05-06T16:55:02.642Z] -> 1762     return forward_call(*args, **kwargs)
[2025-05-06T16:55:02.642Z]    1764 result = None
[2025-05-06T16:55:02.642Z]    1765 called_always_called_hooks = set()
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py:663, in _TorchDynamoContext.__call__.<locals>._fn(*args, **kwargs)
[2025-05-06T16:55:02.642Z]     659     raise e.with_traceback(None) from None
[2025-05-06T16:55:02.642Z]     660 except ShortenTraceback as e:
[2025-05-06T16:55:02.642Z]     661     # Failures in the backend likely don't have useful
[2025-05-06T16:55:02.642Z]     662     # data in the TorchDynamo frames, so we strip them out.
[2025-05-06T16:55:02.642Z] --> 663     raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
[2025-05-06T16:55:02.642Z]     664 finally:
[2025-05-06T16:55:02.642Z]     665     # Restore the dynamic layer stack depth if necessary.
[2025-05-06T16:55:02.642Z]     666     set_eval_frame(None)
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:760, in _compile_fx_inner(gm, example_inputs, **graph_kwargs)
[2025-05-06T16:55:02.642Z]     758     raise
[2025-05-06T16:55:02.642Z]     759 except Exception as e:
[2025-05-06T16:55:02.642Z] --> 760     raise InductorError(e, currentframe()).with_traceback(
[2025-05-06T16:55:02.642Z]     761         e.__traceback__
[2025-05-06T16:55:02.642Z]     762     ) from None
[2025-05-06T16:55:02.642Z]     763 finally:
[2025-05-06T16:55:02.642Z]     764     TritonBundler.end_compile()
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:745, in _compile_fx_inner(gm, example_inputs, **graph_kwargs)
[2025-05-06T16:55:02.642Z]     743 TritonBundler.begin_compile()
[2025-05-06T16:55:02.642Z]     744 try:
[2025-05-06T16:55:02.642Z] --> 745     mb_compiled_graph = fx_codegen_and_compile(
[2025-05-06T16:55:02.642Z]     746         gm, example_inputs, inputs_to_check, **graph_kwargs
[2025-05-06T16:55:02.642Z]     747     )
[2025-05-06T16:55:02.642Z]     748     assert mb_compiled_graph is not None
[2025-05-06T16:55:02.642Z]     749     mb_compiled_graph._time_taken_ns = time.time_ns() - start_time
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:1295, in fx_codegen_and_compile(gm, example_inputs, inputs_to_check, **graph_kwargs)
[2025-05-06T16:55:02.642Z]    1291     from .compile_fx_subproc import _SubprocessFxCompile
[2025-05-06T16:55:02.642Z]    1293     scheme = _SubprocessFxCompile()
[2025-05-06T16:55:02.642Z] -> 1295 return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:1119, in _InProcessFxCompile.codegen_and_compile(self, gm, example_inputs, inputs_to_check, graph_kwargs)
[2025-05-06T16:55:02.642Z]    1117 metrics_helper = metrics.CachedMetricsHelper()
[2025-05-06T16:55:02.642Z]    1118 with V.set_graph_handler(graph):
[2025-05-06T16:55:02.642Z] -> 1119     graph.run(*example_inputs)
[2025-05-06T16:55:02.642Z]    1120     output_strides: list[Optional[tuple[_StrideExprStr, ...]]] = []
[2025-05-06T16:55:02.642Z]    1121     if graph.graph_outputs is not None:
[2025-05-06T16:55:02.642Z]    1122         # We'll put the output strides in the compiled graph so we
[2025-05-06T16:55:02.642Z]    1123         # can later return them to the caller via TracingContext
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/_inductor/graph.py:877, in GraphLowering.run(self, *args)
[2025-05-06T16:55:02.642Z]     875 def run(self, *args: Any) -> Any:  # type: ignore[override]
[2025-05-06T16:55:02.642Z]     876     with dynamo_timed("GraphLowering.run"):
[2025-05-06T16:55:02.642Z] --> 877         return super().run(*args)
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/fx/interpreter.py:171, in Interpreter.run(self, initial_env, enable_io_processing, *args)
[2025-05-06T16:55:02.642Z]     168     continue
[2025-05-06T16:55:02.642Z]     170 try:
[2025-05-06T16:55:02.642Z] --> 171     self.env[node] = self.run_node(node)
[2025-05-06T16:55:02.642Z]     172 except Exception as e:
[2025-05-06T16:55:02.642Z]     173     if self.extra_traceback:
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/_inductor/graph.py:1527, in GraphLowering.run_node(self, n)
[2025-05-06T16:55:02.642Z]    1525 else:
[2025-05-06T16:55:02.642Z]    1526     debug("")
[2025-05-06T16:55:02.642Z] -> 1527     result = super().run_node(n)
[2025-05-06T16:55:02.642Z]    1529 # require the same stride order for dense outputs,
[2025-05-06T16:55:02.642Z]    1530 # 1. user-land view() will not throw because inductor
[2025-05-06T16:55:02.642Z]    1531 # output different strides than eager
[2025-05-06T16:55:02.642Z]    (...)
[2025-05-06T16:55:02.642Z]    1534 # 2: as_strided ops, we need make sure its input has same size/stride with
[2025-05-06T16:55:02.642Z]    1535 # eager model to align with eager behavior.
[2025-05-06T16:55:02.642Z]    1536 as_strided_ops = [
[2025-05-06T16:55:02.642Z]    1537     torch.ops.aten.as_strided.default,
[2025-05-06T16:55:02.642Z]    1538     torch.ops.aten.as_strided_.default,
[2025-05-06T16:55:02.642Z]    (...)
[2025-05-06T16:55:02.642Z]    1541     torch.ops.aten.resize_as.default,
[2025-05-06T16:55:02.642Z]    1542 ]
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/fx/interpreter.py:240, in Interpreter.run_node(self, n)
[2025-05-06T16:55:02.642Z]     238 assert isinstance(args, tuple)
[2025-05-06T16:55:02.642Z]     239 assert isinstance(kwargs, dict)
[2025-05-06T16:55:02.642Z] --> 240 return getattr(self, n.op)(n.target, args, kwargs)
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/_inductor/graph.py:1169, in GraphLowering.call_function(self, target, args, kwargs)
[2025-05-06T16:55:02.642Z]    1163             decided_constraint = None  # type: ignore[assignment]
[2025-05-06T16:55:02.642Z]    1165     # for implicitly fallback ops, we conservatively requires
[2025-05-06T16:55:02.642Z]    1166     # contiguous input since some eager kernels does not
[2025-05-06T16:55:02.642Z]    1167     # support non-contiguous inputs. They may silently cause
[2025-05-06T16:55:02.642Z]    1168     # accuracy problems. Check https://github.com/pytorch/pytorch/issues/140452
[2025-05-06T16:55:02.642Z] -> 1169     make_fallback(target, layout_constraint=decided_constraint)
[2025-05-06T16:55:02.642Z]    1171 elif get_decompositions([target]):
[2025-05-06T16:55:02.642Z]    1172     # There isn't a good way to dynamically patch this in
[2025-05-06T16:55:02.642Z]    1173     # since AOT Autograd already ran.  The error message tells
[2025-05-06T16:55:02.642Z]    1174     # the user how to fix it.
[2025-05-06T16:55:02.642Z]    1175     raise MissingOperatorWithDecomp(target, args, kwargs)
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/_inductor/lowering.py:2023, in make_fallback(op, layout_constraint, warn, override_decomp)
[2025-05-06T16:55:02.642Z]    2018         torch._dynamo.config.suppress_errors = False
[2025-05-06T16:55:02.642Z]    2019         log.warning(
[2025-05-06T16:55:02.642Z]    2020             "A make_fallback error occurred in suppress_errors config,"
[2025-05-06T16:55:02.642Z]    2021             " and suppress_errors is being disabled to surface it."
[2025-05-06T16:55:02.642Z]    2022         )
[2025-05-06T16:55:02.642Z] -> 2023     raise AssertionError(
[2025-05-06T16:55:02.642Z]    2024         f"make_fallback({op}): a decomposition exists, we should switch to it."
[2025-05-06T16:55:02.642Z]    2025         " To fix this error, either add a decomposition to core_aten_decompositions (preferred)"
[2025-05-06T16:55:02.642Z]    2026         " or inductor_decompositions, and delete the corresponding `make_fallback` line."
[2025-05-06T16:55:02.642Z]    2027         " Get help from the inductor team if unsure, don't pick arbitrarily to unblock yourself.",
[2025-05-06T16:55:02.642Z]    2028     )
[2025-05-06T16:55:02.642Z]    2030 def register_fallback(op_overload):
[2025-05-06T16:55:02.642Z]    2031     add_needs_realized_inputs(op_overload)
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] InductorError: AssertionError: make_fallback(aten.upsample_trilinear3d.default): a decomposition exists, we should switch to it. To fix this error, either add a decomposition to core_aten_decompositions (preferred) or inductor_decompositions, and delete the corresponding `make_fallback` line. Get help from the inductor team if unsure, don't pick arbitrarily to unblock yourself.
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] real	7m21.829s
[2025-05-06T16:55:02.642Z] user	8m18.975s
[2025-05-06T16:55:02.642Z] sys	5m22.797s
[2025-05-06T16:55:02.642Z] Check failed!
@KumoLiu KumoLiu added the bug Something isn't working label May 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant