-
Notifications
You must be signed in to change notification settings - Fork 58
Improvements to time synchronisation in the face of connectivity problems. #7675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
oxidecomputer/helios-omnios-build#49 is the other part of this that updates chrony and adds the new |
4492819
to
0bc279e
Compare
I've tested this by deploying to the london environment and everything works as expected. On a boundary NTP zone, there are the correct 5 nameserver entries in
and both the external NTP server and the peer boundary are being used:
As expected, when seeing itself, chrony has marked that source as unusable (the Compare this with an internal DNS zone:
Disabling access to the boundary's upstream results in a rapid switch to using the other boundary as the source:
I also did tests where I removed connectivity from both boundary zones and confirmed that one of them became authoritative, with the other taking time from it. |
This is ready for review, and integration once buildomat images are updated to include the new |
…lems. For consistency of time within the rack we must guard against there ever being two authoritative sources of time. We currently have two (admittedly edge) cases where this can occur. 1) When, some time after everything is synchronised, one or both of the boundary NTP servers loses upstream connectivity, but continues to advertise at the same stratum as its clock begins to drift. 2) If both boundary NTP servers lose connectivity and fall back to their local clocks, advertising them with stratum 10, they will both be authoritative but with potentially different times. This change addresses both of these by updating the NTP server configuration in a number of ways. 1) Adding each boundary server as a source to the other; 2) Configuring the boundary "local" sources with the "orphan" flag that causes selection of just one as authoritative if both are in that mode; 3) Configuring RSS sources with a new "failfast" flag that causes them to be discounted quickly (marked as "unselectable") once they are considered unreachable, instead of waiting for their "distance" to degrade over time; 4) Adjusting the root dispersion decay rate from chrony's default of 1µs/s (versus RFC recommended default of 15µs/s) up to 60µs/s to achieve faster source reselection.
For consistency of time within the rack we must guard against there ever
being two authoritative sources of time. We currently have two (admittedly
edge) cases where this can occur.
boundary NTP servers loses upstream connectivity, but continues to
advertise at the same stratum as its clock begins to drift.
local clocks, advertising them with stratum 10, they will both be
authoritative but with potentially different times.
This change addresses both of these by updating the NTP server configuration
in a number of ways.
causes selection of just one as authoritative if both are in that mode;
be discounted quickly (marked as "unselectable") once they are considered
unreachable, instead of waiting for their "distance" to degrade over time;
(versus RFC recommended default of 15µs/s) up to 60µs/s to achieve faster
source reselection.
This is partially in response to the problems encountered in #7534