-
Notifications
You must be signed in to change notification settings - Fork 58
add timestamps to db_metadata_nexus records for debugging handoff #9128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
version, | ||
target_version | ||
) VALUES | ||
(TRUE, NOW(), NOW(), '195.0.0', NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'll need to bump this to 196
async fn database_nexus_access( | ||
&self, | ||
nexus_id: OmicronZoneUuid, | ||
) -> Result<Option<DbMetadataNexus>, Error> { | ||
use nexus_db_schema::schema::db_metadata_nexus::dsl; | ||
|
||
let nexus_access: Option<DbMetadataNexus> = dsl::db_metadata_nexus | ||
.filter( | ||
dsl::nexus_id.eq(nexus_db_model::to_db_typed_uuid(nexus_id)), | ||
) | ||
.first_async(&*self.pool_connection_unauthorized().await?) | ||
.await | ||
.optional() | ||
.map_err(|e| public_error_from_diesel(e, ErrorHandler::Server))?; | ||
|
||
Ok(nexus_access) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns about this function in the case where an existing database has a schema version of >= 185, but < 196.
I believe a "newer version of Nexus", e.g., one running with an expected schema version of 196, but an on-disk format less than that, will call this function from check_nexus_access
, which is called from check_schema_and_access
. I think it'll result in a diesel query of SELECT (each column, including the time columns) FROM db_metadata_nexus
.
I think this will encounter a "column does not exist" error, and prevent the new Nexus from taking over. This check happens before schema migrations, which is why the db_metadata_...
tables need to be modified with caution.
From a release-to-release point-of-view, I don't think this would be a problem -- R16 has an on-disk schema of < 185 (for reference, see the constant: DB_METADATA_NEXUS_SCHEMA_VERSION
).
However, this will cause problems for e.g. dogfood!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a potential fix:
- If the schema version < 196, only query for the old set of columns
- If the schema version >= 196, we can query for all the columns, including the new ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really good catch! I was hoping the risk here was minimal, but it's clear that his case is tricky and we'll want to do careful testing. I may wind up deprioritizing this for R17.
While working on Nexus handoff, I wondered how we were going to debug it if things went sideways. We may lose the Nexus logs (due to #3860). An easy thing to do would be to record in the database the timestamps when individual instances marked themselves quiesced or active. That would help us construct a timeline of what happened.
This is less urgent than most of the other stuff going on since we've been fortunate to not have observed any handoff bugs yet. I'd say this is not a blocker for R17, but I'm marking it for milestone 17 because if it's as low-risk (and finished) as I hope then it'll be worth getting in.