Open
Description
Running on version 1.10 of Distributed
, I get the following error when add_procs()
.
nested task error: could not parse 9275#172.1.1.1
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] (::SlurmClusterManager.var"#13#18"{SlurmManager, Vector{WorkerConfig}, Condition})()
@ SlurmClusterManager ./REPL[4]:36
I added a debug statement in launch()
to figure out the problem and found this:
┌ Debug: connecting to worker 1 out of 25
└ @ SlurmClusterManager REPL[4]:31
┌ Debug: Worker 1 output: julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:9276#172.1.1.1
└ @ SlurmClusterManager REPL[4]:34
┌ Debug: Worker 1 ready on host 172.1.1.1, port 9276
└ @ SlurmClusterManager REPL[4]:42
┌ Debug: connecting to worker 2 out of 25
└ @ SlurmClusterManager REPL[4]:31
┌ Debug: Worker 2 output: julia_worker:julia_worker:julia_worker:julia_worker:9274#172.1.1.1
└ @ SlurmClusterManager REPL[4]:34
┌ Debug: Worker 2 ready on host 172.1.1.1, port 9274
└ @ SlurmClusterManager REPL[4]:42
┌ Debug: connecting to worker 3 out of 25
└ @ SlurmClusterManager REPL[4]:31
┌ Debug: Worker 3 output: 9275#172.1.1.1
└ @ SlurmClusterManager REPL[4]:34
┌ Error: Error launching Slurm job
│ exception =
│ TaskFailedException
│
│ nested task error: could not parse 9275#172.1.1.1
│ Stacktrace:
│ [1] error(s::String)
│ @ Base ./error.jl:35
│ [2] (::SlurmClusterManager.var"#13#18"{SlurmManager, Vector{WorkerConfig}, Condition})()
│ @ SlurmClusterManager ./REPL[4]:36
The error is clear here. The readline
for Worker 3 returns 9275#172.1.1.1
which does not have the julia_worker
string and so the regex m = match(r".*:(\d*)#(.*)
on line 206 fails. Moreover, both workers 1 and 2 have weird strings like
julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:9276#172.1.1.1
and
julia_worker:julia_worker:julia_worker:julia_worker:9274#172.1.1.1
So it seems to me that the prints are out of order. I am not sure why this would be. The code in Distributed
is
print(out, "julia_worker:") # print header
print(out, "$(string(LPROC.bind_port))#") # print port
print(out, LPROC.bind_addr)
print(out, '\n')
flush(out)
so not sure what is causing the race.
Version info:
julia> versioninfo()
Julia Version 1.10.3
Commit 0b4590a5507 (2024-04-30 10:59 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 32 × Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, broadwell)
Threads: 32 default, 0 interactive, 16 GC (on 32 virtual cores)
Environment:
LD_LIBRARY_PATH = /cm/shared/apps/slurm/16.05.8/lib64/slurm:/cm/shared/apps/slurm/16.05.8/lib64:/cm/shared/apps/openmpi/gcc/64/1.10.1/lib64
JULIA_NUM_THREADS = 32
LD_RUN_PATH = /cm/shared/apps/openmpi/gcc/64/1.10.1/lib64
julia> Distributed.VERSION
v"1.10.3"
Metadata
Metadata
Assignees
Labels
No labels