-
Notifications
You must be signed in to change notification settings - Fork 1k
umqtt.robust: Fix check_msg blocking after reconnect #509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
umqtt.robust: Fix check_msg blocking after reconnect #509
Conversation
I have since read @pfalcon's comment in PR #117:
This fix would contradict that aim. However, |
Hi, that's interesting that the socket timeout is lost after reconnect, I feel that's perhaps an upstream bug that should be addressed. In the mean time, re-setting the timeout certainly seems appropriate. Personally I think adding a class attribute to hold the timeout would be better than parsing the str(), the parsing is less likely to work across multiple micropython versions and fairly expensive in terms of CPU cycles really. |
I can't see how your change here has affected any behaviour for check_msg? You're just adding to wait_msg but even then there's no new delays? |
Many thanks for the feedback @andrewleech. I've had another look at the code with your comments in mind. The setblocking() calls are both executed in the umqtt.simple module which knows nothing about "robustness"; storing additional variables in that class, for this purpose seems... improper! Agreed though, the str() parsing approach isn't great. I looked at a couple of other ways to take care of it and settled on this idea: override check_msg() in umqtt.robust, implement the retry loop in there with My last question is this: Is it best to open a completely new PR for this since the approach is different, or squash this on top of the code you've already seen and force push? In reply to your second comment:
Yes and no. Without my change, check_msg is unintentionally converted to wait_msg after a reconnect - so that's a bug resulting in check_msg not returning to the main loop immediately. My change intends to address that accidental conversion,
|
Ok, I'm not familiar with the imqtt library myself so lets map out how it's supposed to work.
An important point here is that by default, TCP sockets are placed in a blocking mode. So on startup the first call to Moving on to robust.py:
I understand the root problem now :-)
Sorry I didn't get back to you in time, but yes I commonly replace changes with completely new ones and squash / force-push to github. These github PR's do let you still view earlier changesets so it's still possible to review them. |
def check_msg(self): | ||
while 1: | ||
self.sock.setblocking(False) | ||
try: | ||
return super().wait_msg() | ||
except OSError as e: | ||
self.log(False, e) | ||
self.reconnect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be updated to add a new optional arg to def wait_msg(self, blocking=True)
and if blocking is False, then
except OSError as e:
if not blocking:
raise
which would mean the error would be propogated back up to check_msg()
to handle.
def check_msg(self): | |
while 1: | |
self.sock.setblocking(False) | |
try: | |
return super().wait_msg() | |
except OSError as e: | |
self.log(False, e) | |
self.reconnect() | |
def check_msg(self): | |
self.sock.setblocking(False) | |
try: | |
return super().wait_msg() | |
except OSError as e: | |
self.log(False, e) | |
self.reconnect() | |
return None |
I also don't think check_msg() should be a loop, there's nothing wrong with checking once (in non blocking state), still reconnect on error, but just exit straight away then. This still preserves the intention of the functions I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a right answer to this, it depends to some degree on the application.
When calling a robust check_msg()
, my assumption would have been that I can rely on this (more so than umqtt.simple) to deliver me a message if one exists - I believe this is the original author's intention too since check_msg()
was not overridden in robust and calls back to the overridden wait_msg()
which does try to be more reliable. I'm convinced check_msg()
should loop (bear in mind, if there's an exception in reconnect()
, it will be raised, so the only occasion where we don't return reasonably quickly to check_msg()
's caller is if reconnect()
consistently works and super().wait_msg()
consistently fails - a highly unusual situation I'd say).
To help us reach consensus, is it worth considering two applications:
check_msg()
is periodically called on a main loop or from within an async co-routine
In this case, I think your suggestion would work reasonably well; in the event of areconnect()
, messages would be delayed only by the length of a loop. My suggested implementation does spend time trying a little harder, but delivers a message ASAP.check_msg()
is called once having solicited a response from other nodes connected to the MQTT broker (possibly after a short wait for responses)
In this case, returning fromcheck_msg()
having knowingly failed seems inappropriate to me and would need to be clearly stated in its usage documentation "check_msg() should be called at least twice (in what hard-to-detect circumstances?)" Again, my implementation does spend time trying a little harder, but delivers a message first time around if available.
Perhaps it is this specific point that is most concerning, "spend time trying a little harder". That's fair critique if returning quickly takes precedence over being robust for check_msg()
. That said, if that's the route we go down, I believe we need to signal the failure in some way, possibly by returning False
instead of None
on your last line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you make some very good points in that there's no one definitive right answer.
Looking through the suggestions, I now feel like check_msg should retry just once if needed, ie call wait_msg, on exception call reconnect then call wait_msg again.
If it fails again it should still just return as it means the server is probably down.
I definitely believe check_msg should not continue to loop on failure because that will be blocking and prevent the caller from logging, notifying user of failure etc, basically the same as a wait_msg if the server is down.
@andrewleech, I'm resurrecting our month old conversation! :-)
So a trivial code change brings your suggestion to fruition while also making
What do you think? |
@ian-llewellyn This does look quite good to me, nice cleanup. Keeps it all quite simple and I feel true-to-intention. |
After `reconnect()`, MQTTClient.socket is blocking by default. This commit aims to supplement that behaviour at times when `check_msg()` sets the socket to non-blocking. It may fix errors reported in micropython#102 and This fix: * avoids using an additional instance attribute to record the intended state of the socket * only adds two additional lines of code to one file in the codebase * depends on socket's `__str__()` method to retrieve the current timeout value: `<socket state=0 timeout=-1 incoming=0 off=0>` - not ideal
28cdd08
to
bcf6d97
Compare
Thank for the positive discourse @andrewleech. I saw that master has moved on a bit since my initial PR, so I've rebased, updated |
Looks good to me, I'll have to leave it to @jimmo for a final review. |
Thanks for the contribution. There is definitely a bug here, The fix here looks good. Rebased and merged in 4dc2d5e |
After
reconnect()
, MQTTClient.socket is blocking by default. Thiscommit aims to supplement that behaviour at times when
check_msg()
sets the socket to non-blocking. It may fix errors reported in #102 and
#192.
This fix:
state of the socket (Edit: this would involve changes to umqtt.simple which should 'know' nothing about robustness)
only adds two additional lines of code to one file in the codebase(Edit: a new approach was taken based on feedback: it's now 8 additional code lines)depends on socket's__str__()
method to retrieve the current timeoutvalue:
<socket state=0 timeout=-1 incoming=0 off=0>
- not idealcheck_msg()
method in order tosock.setblocking(False)
before all calls towait_msg()
, even afterreconnect()
)Edit: An additional note: This fix calls
super().wait_msg()
as opposed toself.wait_msg()
to avoid nested reconnect() retries which would simply reintroduce the original behaviour.