-
Notifications
You must be signed in to change notification settings - Fork 21.9k
Add Troubleshooting Articles for Resource Health #124598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| --- | ||
| author: omarrivera | ||
| ms.author: omarrivera | ||
| ms.date: 10/09/2024 | ||
| ms.topic: include | ||
| ms.service: azure-operator-nexus | ||
| --- | ||
| ## Still Having Issues? | ||
|
|
||
| If the steps outlined didn't provide a path to resolve the issue or if you still have questions [contact support]. | ||
| For more information about support plans, see [Azure Support plans]. | ||
|
|
||
| [contact support]: https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade | ||
| [Azure Support plans]: https://azure.microsoft.com/support/plans/response/ |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| --- | ||
| author: omarrivera | ||
| ms.author: omarrivera | ||
| ms.date: 10/09/2024 | ||
| ms.topic: include | ||
| ms.service: azure-operator-nexus | ||
| --- | ||
| ## Prerequisites | ||
|
|
||
| 1. Install the latest version of the [appropriate CLI extensions](howto-install-cli-extensions.md) | ||
| 2. Collect the following information: | ||
| - Subscription ID (SUBSCRIPTION) | ||
| - Cluster name (CLUSTER) | ||
| - Resource group (CLUSTER_RG) | ||
| - Managed resource group (CLUSTER_MRG) | ||
| 3. Request subscription access to run Azure Operator Nexus network fabric (NF) and network cloud (NC) CLI extension commands. | ||
| 4. Sign in to Azure CLI and select the subscription where the cluster is deployed. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| --- | ||
| title: Troubleshoot Azure Operator Nexus BareMetal Machines in a Not Ready state | ||
| description: Examine common and known issues with BareMetal Machine resources. | ||
| ms.service: azure-operator-nexus | ||
| ms.custom: troubleshooting | ||
| ms.topic: troubleshooting | ||
| ms.date: 10/09/2024 | ||
| ms.author: omarrivera | ||
| author: omarrivera | ||
| --- | ||
| # Troubleshoot Azure Operator Nexus BareMetal Machines in a Not Ready state | ||
|
|
||
| This guide attempts to provide steps to troubleshoot when a BareMetal Machine is declared to be `Not Ready` state. | ||
|
|
||
| > [!NOTE] | ||
| > There can be multiple reasons that a BareMetal Machine is in NotReady state. | ||
| > The best approach is to determine if some of the common reasons apply. | ||
| > Although we are providing guides to historically known issues, it cannot cover all possible error scenarios. | ||
|
|
||
| [!include[prereqAzCLI](./includes/prereq-az-cli.md)] | ||
|
|
||
|
|
||
| TODO - use the article that exists as reference and add only the preconditions and we'll have | ||
| articles/operator-nexus/troubleshoot-bare-metal-machine-provisioning.md | ||
|
|
||
| >[!NOTE] | ||
| > NC 3.14 has OnpremLogs in the LAW - it would need to use that for reference https://teams.microsoft.com/l/message/19:99bdf627-579c-46bb-a2e1-20215be79888_e5ef5aef-6faf-4e93-ae99-d353f173d715@unq.gbl.spaces/1729106818629?context=%7B%22contextType%22%3A%22chat%22%7D | ||
|
|
||
| [!include[stillHavingIssues](./includes/contact-support.md)] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,63 @@ | ||
| --- | ||
| title: Troubleshoot Azure Operator Nexus Cluster Heartbeat Connection Status shows Disconnected | ||
| description: Provide steps to investigate and possibly resolve circumstances that are preventing the Cluster from sending heartbeats to the Cluster Manager. | ||
| ms.service: azure-operator-nexus | ||
| ms.custom: troubleshooting | ||
| ms.topic: troubleshooting | ||
| ms.date: 10/09/2024 | ||
| ms.author: omarrivera | ||
| author: omarrivera | ||
| --- | ||
| # Troubleshoot Azure Operator Nexus Cluster Heartbeat Connection Status shows Disconnected | ||
|
|
||
| This guide attempts to provide steps to troubleshoot a Cluster is shown to have `clusterConnectionStatus` with a value of `Disconnected`. | ||
|
|
||
| > [!CAUTION] | ||
| > The `ClusterConnectionStatus` is likely a symptom or signal and not the root cause and this guide will not be able to provide answers for all scenarios. | ||
| > The focus and purpose of this guide is to provide common issues and signals that can be inspected to determine where the issue might be. | ||
|
|
||
| ## Understanding the Issue | ||
|
|
||
| Cluster Managers ensure continuous Cluster network connectivity through a heartbeat agent running within the target Cluster. | ||
| The cluster-heartbeat agent sends periodic HTTP messages to the Cluster Manager and expects an acknowledgment response as well. | ||
| A Cluster has the property `ClusterConnectionStatus` which is set to the value `Connected` as the heartbeats are continuously received and acknowledged. | ||
|
|
||
| The `ClusterConnectionStatus` becomes `Connected` once the cluster is in a healthy state and network connectivity issues are resolved. | ||
| If the Cluster is expected to be healthy but the `ClusterConnectionStatus` remains in `Disconnected` state [contact support] after following the steps in this guide. | ||
|
|
||
| > [!IMPORTANT] | ||
| > `ClusterConnectionStatus` is **not** the same as Arc Connected Kubernetes Clusters. | ||
|
|
||
| The command can be used to see the value of `ClsuterConnectionStatus` and it is visible in Azure Portal in the Cluster resource's JSON view. | ||
|
|
||
| ```azurecli | ||
| az networkcloud cluster show --subscription "$SUBSCRIPTION_ID" -g "$CLUSTER_RG" -n "$CLUSTER_NAME" --output table --query "{ClusterConnectionStatus:clusterConnectionStatus}" | ||
|
|
||
| ClusterConnectionStatus | ||
| ------------------------- | ||
| Connected | ||
| ``` | ||
|
|
||
| The following table shows which status is displayed depending on the state of the undercloud cluster: | ||
|
|
||
| | Status | Definition | | ||
| |----------------|-----------------------------------------------------------------------------------------------------------------------| | ||
| | `Connected` | Heartbeats received, indicates healthy cluster and cluster manager connectivity | | ||
| | `Disconnected` | Heartbeats missed for __over 5 minutes__, indicates likely connectivity issue between Cluster Manager and Cluster | | ||
| | `Timeout` | Heartbeats missed for __over 2 minutes but less than 5 minutes__, cluster connectivity is uncertain possibly degraded | | ||
| | `Undefined` | Cluster not yet deployed or running a version without the heartbeats feature | | ||
|
|
||
| ## Basic Investigation Steps | ||
|
|
||
| ### 1. Ensure Network Connectivity for the Cluster | ||
|
|
||
| TODO - what steps could be done here? | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If Undercloud to Azure connectivity is disconnected cluster connection status will move to disconnected state. |
||
|
|
||
| ### Other possible causes to evaluate | ||
|
|
||
| - Are there recent changes to the Managed Identity permissions for the Cluster Manager or Cluster? | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. MSI permissions change is internal to MSFT. we should not add into public document. |
||
| - The Managed Identities (MI) and their permissions are used for service-to-service authentication. A change in the permissions results in authentication failures for the heartbeat messages. Cluster Managers must both receive and acknowledge heartbeats failure to do so will also result in a `ClusterConnectionStatus` of `Disconnected`. | ||
|
|
||
| [!include[stillHavingIssues](./includes/contact-support.md)] | ||
|
|
||
| [contact support]: https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| --- | ||
| title: Troubleshoot Azure Operator Nexus Cluster has ETCD Quorum Lost | ||
| description: Provides steps to follow in the event that an `etcd` quorum is lost for an extended period of time and the KCP did not successfully return to a stable state. | ||
| ms.service: azure-operator-nexus | ||
| ms.custom: troubleshooting | ||
| ms.topic: troubleshooting | ||
| ms.date: 10/09/2024 | ||
| ms.author: omarrivera | ||
| author: omarrivera | ||
| --- | ||
| # Troubleshoot Azure Operator Nexus Cluster has ETCD Quorum Lost | ||
|
|
||
| This guide attempts to provide steps to follow in the event that an `etcd` quorum is lost for an extended period of time and the Kubernetes Control Plane (KCP) did not successfully return to stable state. | ||
|
|
||
| > [!IMPORTANT] | ||
| > At this time there is no supported approach that can be executed through customer tools. | ||
| > There will be a feature enhancement for a future release to help address this scenario. | ||
| > Please, open a support ticket via [contact support]. | ||
|
|
||
| [!include[stillHavingIssues](./includes/contact-support.md)] | ||
|
|
||
| [contact support]: https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we include metric also to monitor cluster connection status.
like below: Avg Cluster Connection Status for <> by Cluster Name where Cluster Name = <<'clustername'>>