Instance in unusable "running <not on any sled>" state

I experienced this on dogfood, and the instance is still in this problem state. The instance is **bd91f2a8-e74f-485d-9bd4-8449b901b86a**

I logged into dogfood today to find an instance in this state:

![image](https://github.com/oxidecomputer/omicron/assets/1389549/4e81b9de-91f3-400c-921b-e00ea5a0a583)

It says "running". However, it is not! It is not reachable via `ssh` or anything like that.

Let's try turning it off?

```
$ oxide instance stop --instance bd91f2a8-e74f-485d-9bd4-8449b901b86a
error
Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "11b5febe-dd35-429f-9145-b2b9bef3d1c2", "content-length": "124", "date": "Tue, 21 May 2024 00:45:01 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "11b5febe-dd35-429f-9145-b2b9bef3d1c2" }
```

What about rebooting it?

```
$ oxide instance reboot --instance bd91f2a8-e74f-485d-9bd4-8449b901b86a
error
Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "2e299968-06ad-4df8-9e1e-e886f2934f95", "content-length": "124", "date": "Tue, 21 May 2024 00:25:06 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "2e299968-06ad-4df8-9e1e-e886f2934f95" }
```

Hmm, internal server error either way. What happened internally? via `oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7` on `BRM44220011`:

```
00:25:06.074Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (dropshot_external): request completed
    error_message_external = Internal Server Error
    error_message_internal = instance is active but not resident on a sled
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/283d897/dropshot/src/server.rs:866
    latency_us = 68156
    local_addr = 172.30.2.5:443
    method = POST
    remote_addr = 172.20.16.246:54890
    req_id = 2e299968-06ad-4df8-9e1e-e886f2934f95
    response_code = 500
    uri = //v1/instances/bd91f2a8-e74f-485d-9bd4-8449b901b86a/reboot
```

Very very odd. What is the instance state according to `omdb`?

```
root@oxz_switch0:~# omdb db instances | grep bd91f2a8
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (63.0.0)
bd91f2a8-e74f-485d-9bd4-8449b901b86a orchard                                                        running  3a4bfe51-421a-4fb7-9efc-4c575f3ee3b0 <not on any sled>   
```

**`running <not on any sled>`**.

So, we cannot turn it off. Can we turn it on?

```
$ oxide instance start --instance bd91f2a8-e74f-485d-9bd4-8449b901b86a
error
Error Response: status: 409 Conflict; headers: {"content-type": "application/json", "x-request-id": "2437c80d-3705-4fa9-b799-cf32d0034763", "content-length": "152", "date": "Tue, 21 May 2024 00:31:12 GMT"}; value: Error { error_code: Some("Conflict"), message: "instance changed state before it could be started", request_id: "2437c80d-3705-4fa9-b799-cf32d0034763" }
```

No, we cannot do that either. So it is neither running, nor not running.

I last accessed this VM some time last week. I believe some dogfoods updates then happened after that but I do not perfectly remember. This instance has survived updates in the past.

At time of writing the instance is still in this strange state on dogfood.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Instance in unusable "running <not on any sled>" state #5798

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Instance in unusable "running <not on any sled>" state #5798

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions