Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 18 additions & 1 deletion articles/operator-nexus/TOC.yml
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,7 @@
- name: Cluster Upgrades
href: howto-cluster-runtime-upgrade.md
- name: Cluster Upgrades With PauseRack Startegy
href: howto-cluster-runtime-upgrade-with-pauserack-strategy.md
href: howto-cluster-runtime-upgrade-with-pauserack-strategy.md
- name: Credential Rotation
href: howto-credential-rotation.md
- name: Credential Manager Key Vault
Expand Down Expand Up @@ -300,6 +300,23 @@
href: troubleshoot-bare-metal-machine-provisioning.md
- name: Troubleshoot Nexus Kubernetes Cluster pods
href: troubleshoot-nexus-kubernetes-cluster-pods.md
- name: Azure Resource Health
expanded: false
items:
- name: Cluster Managers
expanded: false
- name: Clusters
expanded: false
items:
- name: Troubleshoot Cluster Heartbeat Connection Status Disconnected
href: troubleshoot-cluster-heartbeat-connection-status-disconnected.md
- name: Troubleshoot ETCD Cluster Possible Quorum Lost
href: troubleshoot-etcd-cluster-possible-quorum-lost.md
- name: BareMetal Machines
expanded: false
items:
- name: Troubleshoot BMM Not In Ready State
href: troubleshoot-bmm-not-ready-state.md

- name: FAQ
href: azure-operator-nexus-faq.md
Expand Down
14 changes: 14 additions & 0 deletions articles/operator-nexus/includes/contact-support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
---
author: omarrivera
ms.author: omarrivera
ms.date: 10/09/2024
ms.topic: include
ms.service: azure-operator-nexus
---
## Still Having Issues?

If the steps outlined didn't provide a path to resolve the issue or if you still have questions [contact support].
For more information about support plans, see [Azure Support plans].

[contact support]: https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade
[Azure Support plans]: https://azure.microsoft.com/support/plans/response/
17 changes: 17 additions & 0 deletions articles/operator-nexus/includes/prereq-az-cli.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
author: omarrivera
ms.author: omarrivera
ms.date: 10/09/2024
ms.topic: include
ms.service: azure-operator-nexus
---
## Prerequisites

1. Install the latest version of the [appropriate CLI extensions](howto-install-cli-extensions.md)
2. Collect the following information:
- Subscription ID (SUBSCRIPTION)
- Cluster name (CLUSTER)
- Resource group (CLUSTER_RG)
- Managed resource group (CLUSTER_MRG)
3. Request subscription access to run Azure Operator Nexus network fabric (NF) and network cloud (NC) CLI extension commands.
4. Sign in to Azure CLI and select the subscription where the cluster is deployed.
29 changes: 29 additions & 0 deletions articles/operator-nexus/troubleshoot-bmm-not-ready-state.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
title: Troubleshoot Azure Operator Nexus BareMetal Machines in a Not Ready state
description: Examine common and known issues with BareMetal Machine resources.
ms.service: azure-operator-nexus
ms.custom: troubleshooting
ms.topic: troubleshooting
ms.date: 10/09/2024
ms.author: omarrivera
author: omarrivera
---
# Troubleshoot Azure Operator Nexus BareMetal Machines in a Not Ready state

This guide attempts to provide steps to troubleshoot when a BareMetal Machine is declared to be `Not Ready` state.

> [!NOTE]
> There can be multiple reasons that a BareMetal Machine is in NotReady state.
> The best approach is to determine if some of the common reasons apply.
> Although we are providing guides to historically known issues, it cannot cover all possible error scenarios.

[!include[prereqAzCLI](./includes/prereq-az-cli.md)]


TODO - use the article that exists as reference and add only the preconditions and we'll have
articles/operator-nexus/troubleshoot-bare-metal-machine-provisioning.md

>[!NOTE]
> NC 3.14 has OnpremLogs in the LAW - it would need to use that for reference https://teams.microsoft.com/l/message/19:99bdf627-579c-46bb-a2e1-20215be79888_e5ef5aef-6faf-4e93-ae99-d353f173d715@unq.gbl.spaces/1729106818629?context=%7B%22contextType%22%3A%22chat%22%7D

[!include[stillHavingIssues](./includes/contact-support.md)]
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
---
title: Troubleshoot Azure Operator Nexus Cluster Heartbeat Connection Status shows Disconnected
description: Provide steps to investigate and possibly resolve circumstances that are preventing the Cluster from sending heartbeats to the Cluster Manager.
ms.service: azure-operator-nexus
ms.custom: troubleshooting
ms.topic: troubleshooting
ms.date: 10/09/2024
ms.author: omarrivera
author: omarrivera
---
# Troubleshoot Azure Operator Nexus Cluster Heartbeat Connection Status shows Disconnected

This guide attempts to provide steps to troubleshoot a Cluster is shown to have `clusterConnectionStatus` with a value of `Disconnected`.

> [!CAUTION]
> The `ClusterConnectionStatus` is likely a symptom or signal and not the root cause and this guide will not be able to provide answers for all scenarios.
> The focus and purpose of this guide is to provide common issues and signals that can be inspected to determine where the issue might be.

## Understanding the Issue

Cluster Managers ensure continuous Cluster network connectivity through a heartbeat agent running within the target Cluster.
The cluster-heartbeat agent sends periodic HTTP messages to the Cluster Manager and expects an acknowledgment response as well.
A Cluster has the property `ClusterConnectionStatus` which is set to the value `Connected` as the heartbeats are continuously received and acknowledged.

The `ClusterConnectionStatus` becomes `Connected` once the cluster is in a healthy state and network connectivity issues are resolved.
If the Cluster is expected to be healthy but the `ClusterConnectionStatus` remains in `Disconnected` state [contact support] after following the steps in this guide.

> [!IMPORTANT]
> `ClusterConnectionStatus` is **not** the same as Arc Connected Kubernetes Clusters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include metric also to monitor cluster connection status.

like below: Avg Cluster Connection Status for <> by Cluster Name where Cluster Name = <<'clustername'>>

The command can be used to see the value of `ClsuterConnectionStatus` and it is visible in Azure Portal in the Cluster resource's JSON view.

```azurecli
az networkcloud cluster show --subscription "$SUBSCRIPTION_ID" -g "$CLUSTER_RG" -n "$CLUSTER_NAME" --output table --query "{ClusterConnectionStatus:clusterConnectionStatus}"

ClusterConnectionStatus
-------------------------
Connected
```

The following table shows which status is displayed depending on the state of the undercloud cluster:

| Status | Definition |
|----------------|-----------------------------------------------------------------------------------------------------------------------|
| `Connected` | Heartbeats received, indicates healthy cluster and cluster manager connectivity |
| `Disconnected` | Heartbeats missed for __over 5 minutes__, indicates likely connectivity issue between Cluster Manager and Cluster |
| `Timeout` | Heartbeats missed for __over 2 minutes but less than 5 minutes__, cluster connectivity is uncertain possibly degraded |
| `Undefined` | Cluster not yet deployed or running a version without the heartbeats feature |

## Basic Investigation Steps

### 1. Ensure Network Connectivity for the Cluster

TODO - what steps could be done here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Undercloud to Azure connectivity is disconnected cluster connection status will move to disconnected state.


### Other possible causes to evaluate

- Are there recent changes to the Managed Identity permissions for the Cluster Manager or Cluster?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSI permissions change is internal to MSFT. we should not add into public document.

- The Managed Identities (MI) and their permissions are used for service-to-service authentication. A change in the permissions results in authentication failures for the heartbeat messages. Cluster Managers must both receive and acknowledge heartbeats failure to do so will also result in a `ClusterConnectionStatus` of `Disconnected`.

[!include[stillHavingIssues](./includes/contact-support.md)]

[contact support]: https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
title: Troubleshoot Azure Operator Nexus Cluster has ETCD Quorum Lost
description: Provides steps to follow in the event that an `etcd` quorum is lost for an extended period of time and the KCP did not successfully return to a stable state.
ms.service: azure-operator-nexus
ms.custom: troubleshooting
ms.topic: troubleshooting
ms.date: 10/09/2024
ms.author: omarrivera
author: omarrivera
---
# Troubleshoot Azure Operator Nexus Cluster has ETCD Quorum Lost

This guide attempts to provide steps to follow in the event that an `etcd` quorum is lost for an extended period of time and the Kubernetes Control Plane (KCP) did not successfully return to stable state.

> [!IMPORTANT]
> At this time there is no supported approach that can be executed through customer tools.
> There will be a feature enhancement for a future release to help address this scenario.
> Please, open a support ticket via [contact support].

[!include[stillHavingIssues](./includes/contact-support.md)]

[contact support]: https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade