Skip to content

initial sync with dispatcher timed out after 10s #275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Backlonhw2468 opened this issue May 3, 2025 · 11 comments · Fixed by #284 or #285
Closed

initial sync with dispatcher timed out after 10s #275

Backlonhw2468 opened this issue May 3, 2025 · 11 comments · Fixed by #284 or #285
Labels
reprex needs a minimal reproducible example

Comments

@Backlonhw2468
Copy link

I was trying to use daemons() inside a self defined function and got the error message as shown in the title. If I use it outside of the function, things work out fine.

My platform is Windows and I am using the mirai 2.2.0 and R 4.5.0.

Forgive me that the self defined function is too complicated to be attached in this post.

What could be the possible reasons that causes this error? Thank you!

@shikokuchuo shikokuchuo added the reprex needs a minimal reproducible example label May 6, 2025
@shikokuchuo
Copy link
Member

shikokuchuo commented May 6, 2025

This may be a sign that not using it within a function is what you should be doing anyway - see guidance here. Please try to find a reprex using any function. Also what does your actual daemons() call look like?

@sebffischer
Copy link

sebffischer commented May 16, 2025

fwiw: I also sometimes run into this error when starting mirai daemons on a HPC. It's not clear to me why/when this happens.

@sebffischer
Copy link

sebffischer commented May 16, 2025

I think the .limit_long in my case is just not long enough because stuff on HPCs can be slow.

When I increase it, e.g. by a factor of two, before starting the daemons it works.
I don't think there is a configuration option right now, but it's possible to hack this in like this:

assignInNamespace(".limit_long", 100000L, "mirai")

@shikokuchuo
Copy link
Member

fwiw: I also sometimes run into this error when starting mirai daemons on a HPC. It's not clear to me why/when this happens.

@sebffischer thanks for letting me know. I take it you mean you're using something like daemons(url = host_url(), remote = ...), rather than just launching daemons locally.

In such a case, it is only the host process synchronizing with dispatcher, which is a local background process. I would not expect it to take 10 seconds to start up and connect back to the host (especially as we're invoking R with --vanilla and not even loading any packages outside of base). I guess it could be that it still needs to load mirai, and this might be on some really slow to initiate file system?

In any case, it should be possible to make this a parameter. Or else, would widening it to say 20 secs work, or is it wildly variable?

@sebffischer
Copy link

Thanks for the quick response @shikokuchuo. I think I have described my setup somewhat poorly. It is the following: I start a slurm job with multiple CPU cores and within that slurm job I spawn multiple mirai workers. These are local cores on the node executing the slurm job.

In my case, I just set it to 100 seconds and then it worked every time so far, but who knows what the future will bring.
For this reason I think it would be more future proof to have control over this parameter just in case there might be some instance where a new limit also does not suffice.

@shikokuchuo
Copy link
Member

Thanks, I can see that there would easily be occasions where the 10s wouldn't be enough in that case. I'm going to put through a PR that adds this as a parameter to daemons().

@Backlonhw2468 I'm conscious that this may be different to what you're experiencing. Please feel free to post another issue in case this doesn't solve it for you.

@shikokuchuo
Copy link
Member

Actually I'm going to solve this a different way in #285 as this is not designed to be a point of failure.

@shikokuchuo shikokuchuo reopened this May 20, 2025
@shikokuchuo
Copy link
Member

@sebffischer I'm going to merge #285. Appreciate if you could give the latest dev version a whirl on your HPC system requiring the longer timeout, let me know if you encounter any issues. You should get messages to the console every 10s, but it'll eventually succeed.

@sebffischer
Copy link

I like the solution! I have installed the dev version and I am running it now. Will report back whether it worked!

@sebffischer
Copy link

No job failed @shikokuchuo, thanks again!

@shikokuchuo
Copy link
Member

Great, thanks for letting me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reprex needs a minimal reproducible example
Projects
None yet
3 participants