Skip to content

Conversation

davepacheco
Copy link
Collaborator

While working on Nexus handoff, I wondered how we were going to debug it if things went sideways. We may lose the Nexus logs (due to #3860). An easy thing to do would be to record in the database the timestamps when individual instances marked themselves quiesced or active. That would help us construct a timeline of what happened.

This is less urgent than most of the other stuff going on since we've been fortunate to not have observed any handoff bugs yet. I'd say this is not a blocker for R17, but I'm marking it for milestone 17 because if it's as low-risk (and finished) as I hope then it'll be worth getting in.

@davepacheco davepacheco added this to the 17 milestone Oct 1, 2025
@davepacheco davepacheco requested a review from smklein October 1, 2025 19:27
@davepacheco davepacheco self-assigned this Oct 1, 2025
version,
target_version
) VALUES
(TRUE, NOW(), NOW(), '195.0.0', NULL)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll need to bump this to 196

Comment on lines 783 to 799
async fn database_nexus_access(
&self,
nexus_id: OmicronZoneUuid,
) -> Result<Option<DbMetadataNexus>, Error> {
use nexus_db_schema::schema::db_metadata_nexus::dsl;

let nexus_access: Option<DbMetadataNexus> = dsl::db_metadata_nexus
.filter(
dsl::nexus_id.eq(nexus_db_model::to_db_typed_uuid(nexus_id)),
)
.first_async(&*self.pool_connection_unauthorized().await?)
.await
.optional()
.map_err(|e| public_error_from_diesel(e, ErrorHandler::Server))?;

Ok(nexus_access)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about this function in the case where an existing database has a schema version of >= 185, but < 196.

I believe a "newer version of Nexus", e.g., one running with an expected schema version of 196, but an on-disk format less than that, will call this function from check_nexus_access, which is called from check_schema_and_access. I think it'll result in a diesel query of SELECT (each column, including the time columns) FROM db_metadata_nexus.

I think this will encounter a "column does not exist" error, and prevent the new Nexus from taking over. This check happens before schema migrations, which is why the db_metadata_... tables need to be modified with caution.

From a release-to-release point-of-view, I don't think this would be a problem -- R16 has an on-disk schema of < 185 (for reference, see the constant: DB_METADATA_NEXUS_SCHEMA_VERSION).

However, this will cause problems for e.g. dogfood!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a potential fix:

  • If the schema version < 196, only query for the old set of columns
  • If the schema version >= 196, we can query for all the columns, including the new ones

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good catch! I was hoping the risk here was minimal, but it's clear that his case is tricky and we'll want to do careful testing. I may wind up deprioritizing this for R17.

@davepacheco davepacheco removed this from the 17 milestone Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants