Multi-gpu training with slurm times out

### Bug description

I was transferring some checkpoints from a cluster that didn't use slurm to one that does use slurm. I trained the checkpoint using multiple gpus/nodes, and I found that I'm able to load and start training it when using an interactive job. However, when I use sbatch to submit my job, the job times out after some time.

I've seen this post: https://lightning.ai/docs/fabric/2.4.0/guide/multi_node/slurm.html and added `srun` to my submission script. However, even though 4 devices seem to be initialized, the model still gets stuck before training and times out.

A debug log and my submission script is linked. My sbatch script is a bit different since it runs another sh script, which does a bunch of stuff and then `litgpt pretrain <...>`, but I'm not sure this would be an issue...

I also tried setting the fabric initialization to explicitly have the number of nodes, devices, etc like in the example in `pretrain.py` but it didn't make a difference:
```python
fabric = L.Fabric(
        accelerator="gpu", devices=4, num_nodes=1, strategy=strategy, precision=precision, loggers=[logger]
    )
```
**Details:**
My script:
```bash
#!/bin/bash
#SBATCH --job-name=train_model
#SBATCH --output=slurm_logs/%j.out
#SBATCH --time=2-00:00:00
#SBATCH --nodes=1
#SBATCH --gres=gpu:A6000:4
#SBATCH --ntasks-per-node=4
#SBATCH --mem=50G
#SBATCH --partition=general
#SBATCH --mail-user=<email>
#SBATCH --mail-type=ALL

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export NCCL_DEBUG=INFO

# Check if training type is provided
if [ $# -eq 0 ]; then
    echo "Usage: $0 <sequential|mixed> [training args...]"
    exit 1
fi

# Get the training type and remove it from args
TRAIN_TYPE=$1
shift

case $TRAIN_TYPE in
    sequential)
        srun ./pretrain_then_finetune.sh "$@"
        ;;
    mixed)
        srun ./mixed_pretraining_fixed.sh "$@"
        ;;
    *)
        echo "Invalid training type. Use 'sequential' or 'mixed'"
        exit 1
        ;;
esac
```
Debug example error: 
```bash
[...previous stuff]
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
[rank: 0] Seed set to 42
[rank: 0] Seed set to 42
[rank: 0] Seed set to 42
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

[rank: 0] Seed set to 42
babel-0-31:2324237:2324237 [0] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324237:2324237 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324237:2324237 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324237:2324237 [0] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324237:2324237 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.21.5+cuda12.4
/home/mengyan3/.local/lib/python3.9/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
babel-0-31:2324240:2324240 [1] NCCL INFO cudaDriverVersion 12060
babel-0-31:2324240:2324240 [1] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324240:2324240 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324240:2324240 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324240:2324240 [1] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324240:2324524 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibs8:172.16.1.17<0>
babel-0-31:2324240:2324524 [1] NCCL INFO Using non-device net plugin version 0
babel-0-31:2324240:2324524 [1] NCCL INFO Using network IB
babel-0-31:2324240:2324524 [1] NCCL INFO ncclCommInitRank comm 0x555d6dc227b0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 81000 commId 0xb886029f29f1a815 - Init START
babel-0-31:2324240:2324524 [1] NCCL INFO Setting affinity for GPU 1 to 2b,00000000,00000000,00000000,0000002b,00000000,00000000
babel-0-31:2324240:2324524 [1] NCCL INFO NVLS multicast support is not available on dev 1
babel-0-31:2324240:2324524 [1] NCCL INFO comm 0x555d6dc227b0 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
babel-0-31:2324240:2324524 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
babel-0-31:2324240:2324524 [1] NCCL INFO P2P Chunksize set to 524288
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Connected all rings
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
babel-0-31:2324240:2324524 [1] NCCL INFO Connected all trees
babel-0-31:2324240:2324524 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
babel-0-31:2324240:2324524 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-0-31:2324240:2324524 [1] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
babel-0-31:2324240:2324524 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
babel-0-31:2324240:2324524 [1] NCCL INFO ncclCommInitRank comm 0x555d6dc227b0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 81000 commId 0xb886029f29f1a815 - Init COMPLETE
[rank1]:[E1119 13:21:12.512786305 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
babel-0-31:2324239:2324239 [3] NCCL INFO cudaDriverVersion 12060
babel-0-31:2324239:2324239 [3] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324239:2324239 [3] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324239:2324239 [3] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324239:2324239 [3] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324239:2324522 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibs8:172.16.1.17<0>
babel-0-31:2324239:2324522 [3] NCCL INFO Using non-device net plugin version 0
babel-0-31:2324239:2324522 [3] NCCL INFO Using network IB
babel-0-31:2324239:2324522 [3] NCCL INFO ncclCommInitRank comm 0x5584021f39f0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId e1000 commId 0xb886029f29f1a815 - Init START
babel-0-31:2324239:2324522 [3] NCCL INFO Setting affinity for GPU 3 to 2b,00000000,00000000,00000000,0000002b,00000000,00000000
babel-0-31:2324239:2324522 [3] NCCL INFO NVLS multicast support is not available on dev 3
babel-0-31:2324239:2324522 [3] NCCL INFO comm 0x5584021f39f0 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
babel-0-31:2324239:2324522 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
babel-0-31:2324239:2324522 [3] NCCL INFO P2P Chunksize set to 524288
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Connected all rings
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM
babel-0-31:2324239:2324522 [3] NCCL INFO Connected all trees
babel-0-31:2324239:2324522 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
babel-0-31:2324239:2324522 [3] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-0-31:2324239:2324522 [3] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
babel-0-31:2324239:2324522 [3] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
babel-0-31:2324239:2324522 [3] NCCL INFO ncclCommInitRank comm 0x5584021f39f0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId e1000 commId 0xb886029f29f1a815 - Init COMPLETE
[rank3]:[E1119 13:21:12.512781555 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
babel-0-31:2324238:2324238 [2] NCCL INFO cudaDriverVersion 12060
babel-0-31:2324238:2324238 [2] NCCL INFO Bootstrap : Using ibs8:172.16.1.17<0>
babel-0-31:2324238:2324238 [2] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
babel-0-31:2324238:2324238 [2] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
babel-0-31:2324238:2324238 [2] NCCL INFO NET/Plugin: Using internal network plugin.
babel-0-31:2324238:2324523 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibs8:172.16.1.17<0>
babel-0-31:2324238:2324523 [2] NCCL INFO Using non-device net plugin version 0
babel-0-31:2324238:2324523 [2] NCCL INFO Using network IB
babel-0-31:2324238:2324523 [2] NCCL INFO ncclCommInitRank comm 0x55880160e670 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId a1000 commId 0xb886029f29f1a815 - Init START
babel-0-31:2324238:2324523 [2] NCCL INFO Setting affinity for GPU 2 to 2b,00000000,00000000,00000000,0000002b,00000000,00000000
babel-0-31:2324238:2324523 [2] NCCL INFO NVLS multicast support is not available on dev 2
babel-0-31:2324238:2324523 [2] NCCL INFO comm 0x55880160e670 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
babel-0-31:2324238:2324523 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
babel-0-31:2324238:2324523 [2] NCCL INFO P2P Chunksize set to 524288
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Connected all rings
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
babel-0-31:2324238:2324523 [2] NCCL INFO Connected all trees
babel-0-31:2324238:2324523 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
babel-0-31:2324238:2324523 [2] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
babel-0-31:2324238:2324523 [2] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
babel-0-31:2324238:2324523 [2] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
babel-0-31:2324238:2324523 [2] NCCL INFO ncclCommInitRank comm 0x55880160e670 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId a1000 commId 0xb886029f29f1a815 - Init COMPLETE
[rank2]:[E1119 13:21:12.525244336 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out.
[rank1]:[E1119 13:21:13.938073877 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1119 13:21:13.938095107 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1119 13:21:13.938100947 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1119 13:21:13.938104817 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1119 13:21:13.938073737 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1119 13:21:13.938094977 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 2] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1119 13:21:13.938100577 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1119 13:21:13.938104557 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E1119 13:21:13.938073817 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1119 13:21:13.938094667 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 3] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E1119 13:21:13.938100907 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E1119 13:21:13.938104757 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1119 13:21:13.092845528 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fdf89a24446 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fdf8ad37772 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fdf8ad3ebb3 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fdf8ad4061d in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7fdfd36cd5c0 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x89c02 (0x7fdfe2c89c02 in /lib64/libc.so.6)
frame #6: <unknown function> + 0x10ec40 (0x7fdfe2d0ec40 in /lib64/libc.so.6)

[rank1]:[E1119 13:21:13.092997168 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b4e1d4446 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f3b4f4e7772 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b4f4eebb3 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f3b4f4f061d in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f3b97e7d5c0 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x89c02 (0x7f3ba7489c02 in /lib64/libc.so.6)
frame #6: <unknown function> + 0x10ec40 (0x7f3ba750ec40 in /lib64/libc.so.6)

[rank3]:[E1119 13:21:13.092988978 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800000 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe0c139c446 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fe0c26af772 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe0c26b6bb3 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fe0c26b861d in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7fe10b0455c0 in /home/mengyan3/.local/lib/python3.9/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x89c02 (0x7fe11a689c02 in /lib64/libc.so.6)
frame #6: <unknown function> + 0x10ec40 (0x7fe11a70ec40 in /lib64/libc.so.6)

/data/tir/projects/tir3/users/mengyan3/all_in_one_pretraining/./pretrain_then_finetune.sh: line 184: 2324239 Aborted                 (core dumped) litgpt pretrain $model_name --resume "${checkpoint_dir}/step${step}/lit_model.pth" --tokenizer_dir "${checkpoint_dir}/step${step}" --data FineWebDataset --data.data_path $pretraining_data_dir --data.val_data_path /data/datasets/hf_cache/data/fineweb/sample-350BT/val/0 --data.num_workers $SLURM_GPUS_ON_NODE --train.micro_batch_size $micro_batch_size --train.max_seq_len $max_seq_len --train.min_lr 1e-6 --train.max_iters ${max_iters} --train.max_additional_steps $max_additional_steps --train.save_interval 500 --train.log_interval $log_interval --train.lr_warmup_fraction 0.01 --train.lr_scheduler $lr_scheduler --eval.interval 1000 --out_dir $out_dir --logs_dir $out_dir --logger_name tensorboard
/data/tir/projects/tir3/users/mengyan3/all_in_one_pretraining/./pretrain_then_finetune.sh: line 184: 2324238 Aborted                 (core dumped) litgpt pretrain $model_name --resume "${checkpoint_dir}/step${step}/lit_model.pth" --tokenizer_dir "${checkpoint_dir}/step${step}" --data FineWebDataset --data.data_path $pretraining_data_dir --data.val_data_path /data/datasets/hf_cache/data/fineweb/sample-350BT/val/0 --data.num_workers $SLURM_GPUS_ON_NODE --train.micro_batch_size $micro_batch_size --train.max_seq_len $max_seq_len --train.min_lr 1e-6 --train.max_iters ${max_iters} --train.max_additional_steps $max_additional_steps --train.save_interval 500 --train.log_interval $log_interval --train.lr_warmup_fraction 0.01 --train.lr_scheduler $lr_scheduler --eval.interval 1000 --out_dir $out_dir --logs_dir $out_dir --logger_name tensorboard
/data/tir/projects/tir3/users/mengyan3/all_in_one_pretraining/./pretrain_then_finetune.sh: line 184: 2324240 Aborted                 (core dumped) litgpt pretrain $model_name --resume "${checkpoint_dir}/step${step}/lit_model.pth" --tokenizer_dir "${checkpoint_dir}/step${step}" --data FineWebDataset --data.data_path $pretraining_data_dir --data.val_data_path /data/datasets/hf_cache/data/fineweb/sample-350BT/val/0 --data.num_workers $SLURM_GPUS_ON_NODE --train.micro_batch_size $micro_batch_size --train.max_seq_len $max_seq_len --train.min_lr 1e-6 --train.max_iters ${max_iters} --train.max_additional_steps $max_additional_steps --train.save_interval 500 --train.log_interval $log_interval --train.lr_warmup_fraction 0.01 --train.lr_scheduler $lr_scheduler --eval.interval 1000 --out_dir $out_dir --logs_dir $out_dir --logger_name tensorboard
srun: First task exited 60s ago
srun: StepId=3178163.0 task 0: running
srun: StepId=3178163.0 tasks 1-3: exited
srun: Terminating StepId=3178163.0
slurmstepd: error: *** STEP 3178163.0 ON babel-0-31 CANCELLED AT 2024-11-19T13:24:07 ***
srun: Job step aborted: Waiting up to 122 seconds for job step to finish.
slurmstepd: error: --task-epilog failed status=9
```

Node information:
```bash
=== Slurm Environment ===
SLURM_NTASKS: 4
SLURM_PROCID: 0
SLURM_LOCALID: 0
SLURM_JOB_ID: 3178163

=== GPU Information ===
Available GPUs:
GPU 0: NVIDIA RTX A6000 (UUID: GPU-b349d8f4-c2a8-bd4b-2ed8-4678cc3093ad)
GPU 1: NVIDIA RTX A6000 (UUID: GPU-6386b3c4-ba07-b55c-a8d8-1d7e38378b83)
GPU 2: NVIDIA RTX A6000 (UUID: GPU-8a6310cf-0811-4754-64db-8c4117d4be50)
GPU 3: NVIDIA RTX A6000 (UUID: GPU-99486ac3-a03a-16fa-0e46-8bade74f121a)

GPU Topology:
	[4mGPU0	GPU1	GPU2	GPU3	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID[0m
GPU0	 X 	SYS	SYS	SYS	SYS	1-2,7-8,129-130	0		N/A
GPU1	SYS	 X 	NV4	NODE	NODE	64-65,67,69	1		N/A
GPU2	SYS	NV4	 X 	NODE	NODE	64-65,67,69	1		N/A
GPU3	SYS	NODE	NODE	 X 	NODE	64-65,67,69	1		N/A
NIC0	SYS	NODE	NODE	NODE	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
```


### What operating system are you using?

Linux

### LitGPT Version

``` 
Version: 0.4.0
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-gpu training with slurm times out #1832

Bug description

What operating system are you using?

LitGPT Version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-gpu training with slurm times out #1832

Description

Bug description

What operating system are you using?

LitGPT Version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions