Skip to content

Error when reading worker config #78

Open
@affans

Description

@affans

Running on version 1.10 of Distributed, I get the following error when add_procs().

nested task error: could not parse 9275#172.1.1.1
        Stacktrace:
         [1] error(s::String)
           @ Base ./error.jl:35
         [2] (::SlurmClusterManager.var"#13#18"{SlurmManager, Vector{WorkerConfig}, Condition})()
           @ SlurmClusterManager ./REPL[4]:36

I added a debug statement in launch() to figure out the problem and found this:

┌ Debug: connecting to worker 1 out of 25
└ @ SlurmClusterManager REPL[4]:31
┌ Debug: Worker 1 output: julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:9276#172.1.1.1
└ @ SlurmClusterManager REPL[4]:34
┌ Debug: Worker 1 ready on host 172.1.1.1, port 9276
└ @ SlurmClusterManager REPL[4]:42
┌ Debug: connecting to worker 2 out of 25
└ @ SlurmClusterManager REPL[4]:31
┌ Debug: Worker 2 output: julia_worker:julia_worker:julia_worker:julia_worker:9274#172.1.1.1
└ @ SlurmClusterManager REPL[4]:34
┌ Debug: Worker 2 ready on host 172.1.1.1, port 9274
└ @ SlurmClusterManager REPL[4]:42
┌ Debug: connecting to worker 3 out of 25
└ @ SlurmClusterManager REPL[4]:31
┌ Debug: Worker 3 output: 9275#172.1.1.1
└ @ SlurmClusterManager REPL[4]:34
┌ Error: Error launching Slurm job
│   exception =
│    TaskFailedException
│
│        nested task error: could not parse 9275#172.1.1.1
│        Stacktrace:
│         [1] error(s::String)
│           @ Base ./error.jl:35
│         [2] (::SlurmClusterManager.var"#13#18"{SlurmManager, Vector{WorkerConfig}, Condition})()
│           @ SlurmClusterManager ./REPL[4]:36

The error is clear here. The readline for Worker 3 returns 9275#172.1.1.1 which does not have the julia_worker string and so the regex m = match(r".*:(\d*)#(.*) on line 206 fails. Moreover, both workers 1 and 2 have weird strings like

julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:9276#172.1.1.1

and

julia_worker:julia_worker:julia_worker:julia_worker:9274#172.1.1.1

So it seems to me that the prints are out of order. I am not sure why this would be. The code in Distributed is

    print(out, "julia_worker:")  # print header
    print(out, "$(string(LPROC.bind_port))#") # print port
    print(out, LPROC.bind_addr)
    print(out, '\n')
    flush(out)

so not sure what is causing the race.

Version info:

julia> versioninfo()
Julia Version 1.10.3
Commit 0b4590a5507 (2024-04-30 10:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, broadwell)
Threads: 32 default, 0 interactive, 16 GC (on 32 virtual cores)
Environment:
  LD_LIBRARY_PATH = /cm/shared/apps/slurm/16.05.8/lib64/slurm:/cm/shared/apps/slurm/16.05.8/lib64:/cm/shared/apps/openmpi/gcc/64/1.10.1/lib64
  JULIA_NUM_THREADS = 32
  LD_RUN_PATH = /cm/shared/apps/openmpi/gcc/64/1.10.1/lib64

julia> Distributed.VERSION
v"1.10.3"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions