Skip to content

Instance in unusable "running <not on any sled>" state #5798

@faithanalog

Description

@faithanalog

I experienced this on dogfood, and the instance is still in this problem state. The instance is bd91f2a8-e74f-485d-9bd4-8449b901b86a

I logged into dogfood today to find an instance in this state:

image

It says "running". However, it is not! It is not reachable via ssh or anything like that.

Let's try turning it off?

$ oxide instance stop --instance bd91f2a8-e74f-485d-9bd4-8449b901b86a
error
Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "11b5febe-dd35-429f-9145-b2b9bef3d1c2", "content-length": "124", "date": "Tue, 21 May 2024 00:45:01 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "11b5febe-dd35-429f-9145-b2b9bef3d1c2" }

What about rebooting it?

$ oxide instance reboot --instance bd91f2a8-e74f-485d-9bd4-8449b901b86a
error
Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "2e299968-06ad-4df8-9e1e-e886f2934f95", "content-length": "124", "date": "Tue, 21 May 2024 00:25:06 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "2e299968-06ad-4df8-9e1e-e886f2934f95" }

Hmm, internal server error either way. What happened internally? via oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7 on BRM44220011:

00:25:06.074Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (dropshot_external): request completed
    error_message_external = Internal Server Error
    error_message_internal = instance is active but not resident on a sled
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/283d897/dropshot/src/server.rs:866
    latency_us = 68156
    local_addr = 172.30.2.5:443
    method = POST
    remote_addr = 172.20.16.246:54890
    req_id = 2e299968-06ad-4df8-9e1e-e886f2934f95
    response_code = 500
    uri = //v1/instances/bd91f2a8-e74f-485d-9bd4-8449b901b86a/reboot

Very very odd. What is the instance state according to omdb?

root@oxz_switch0:~# omdb db instances | grep bd91f2a8
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (63.0.0)
bd91f2a8-e74f-485d-9bd4-8449b901b86a orchard                                                        running  3a4bfe51-421a-4fb7-9efc-4c575f3ee3b0 <not on any sled>   

running <not on any sled>.

So, we cannot turn it off. Can we turn it on?

$ oxide instance start --instance bd91f2a8-e74f-485d-9bd4-8449b901b86a
error
Error Response: status: 409 Conflict; headers: {"content-type": "application/json", "x-request-id": "2437c80d-3705-4fa9-b799-cf32d0034763", "content-length": "152", "date": "Tue, 21 May 2024 00:31:12 GMT"}; value: Error { error_code: Some("Conflict"), message: "instance changed state before it could be started", request_id: "2437c80d-3705-4fa9-b799-cf32d0034763" }

No, we cannot do that either. So it is neither running, nor not running.

I last accessed this VM some time last week. I believe some dogfoods updates then happened after that but I do not perfectly remember. This instance has survived updates in the past.

At time of writing the instance is still in this strange state on dogfood.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that isn't working.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions