Skip to content

Creating an error when a SLURM variable isn't found, usually because … #869

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Tristanjmeyers
Copy link

Raising an environment error when the variable SLURM_LAUNCH_NODE_IPADDR is not found. This is usually because the user isn't running SLURM with srun, and instead using something like sbatch. So, instead of raising a TypeError, this will now raise an EnvironmentError and print some help.

PhysicsNeMo Pull Request

Description

Checklist

  • [ x] I am familiar with the Contributing Guidelines.
  • [ x] New or existing tests cover these changes.
  • [ x] The documentation is up to date with these changes.
  • [ x] The CHANGELOG.md is up to date with these changes.
  • [x ] An issue is linked to this pull request.

Dependencies

N/A

Copy link
Collaborator

@coreyjadams coreyjadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Tristanjmeyers for sending this in! Looks reasonable.

@coreyjadams
Copy link
Collaborator

/blossom-ci

@coreyjadams
Copy link
Collaborator

FYI this is failing CI due to formatting issues. It's nothing wrong really, black is just picky. I can't edit your branch, can you please apply this diff to the manager.py file?

❯ git diff
diff --git a/physicsnemo/distributed/manager.py b/physicsnemo/distributed/manager.py
index f9cc5be..2eb0c67 100644
--- a/physicsnemo/distributed/manager.py
+++ b/physicsnemo/distributed/manager.py
@@ -359,8 +359,9 @@ class DistributedManager(object):
         try:
             addr = os.environ.get("SLURM_LAUNCH_NODE_IPADDR")
         except TypeError:
-            raise EnvironmentError('SLURM variable "SLURM_LAUNCH_NODE_IPADDR" was not detected in the environment. Maybe you need to run with "srun"?')
-
+            raise EnvironmentError(
+                'SLURM variable "SLURM_LAUNCH_NODE_IPADDR" was not detected in the environment. Maybe you need to run with "srun"?'
+            )

         DistributedManager.setup(
             rank=rank,

(it's just moving the string to a new line...)

@Tristanjmeyers
Copy link
Author

I've done that, sorry for not linting!

@coreyjadams
Copy link
Collaborator

/blossom-ci

Comment on lines +360 to +361
addr = os.environ.get("SLURM_LAUNCH_NODE_IPADDR")
except TypeError:
Copy link
Collaborator

@coreyjadams coreyjadams May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @Tristanjmeyers - I looked at this more closely and we need to adjust this. os.environ.get will return None for missing variables, and raise no error. I think you can either:

  • Remove the try/except type error, and change to if addr is None: raise EnvErr...; or
  • change from os.environ.get('SLURM_LAUNCH_NODE_IPADDR') to os.environ['SLURM_LAUNCH_NODE_IPADDR'] and TypeError to KeyError.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants