Skip to content

Commit 8505a4f

Browse files
authored
Add Troubleshooting SUSE Edge documentation section (suse-edge#678)
* Add Troubleshooting SUSE Edge documentation section * first round of review comments * Update troubleshooting-directed-network-provisioning.adoc updated "BMH trigger information" * removed Nessie content * second review comments
1 parent 74741dc commit 8505a4f

11 files changed

+461
-0
lines changed

asciidoc/edge-book/edge.adoc

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,7 @@ include::../guides/clusterclass.adoc[leveloffset=+1]
125125
// Tips and Tricks
126126
//--------------------------------------------
127127
128+
[#tips-and-tricks]
128129
= Tips and Tricks
129130
130131
[partintro]
@@ -192,6 +193,28 @@ include::../product/atip-lifecycle.adoc[]
192193
// Observability / Terminology
193194
//--------------------------------------------
194195
196+
= Troubleshooting
197+
198+
[partintro]
199+
This section provides guidance to diagnose and resolve common issues with SUSE Edge deployments and operations. It covers various topics, offering component-specific troubleshooting steps, key tools, and relevant log locations.
200+
201+
include::../troubleshooting/general-troubleshooting-principles.adoc[]
202+
203+
include::../troubleshooting/troubleshooting-kiwi.adoc[]
204+
205+
include::../troubleshooting/troubleshooting-edge-image-builder.adoc[]
206+
207+
include::../troubleshooting/troubleshooting-edge-networking.adoc[]
208+
209+
include::../troubleshooting/troubleshooting-phone-home-scenarios.adoc[]
210+
211+
include::../troubleshooting/troubleshooting-directed-network-provisioning.adoc[]
212+
213+
include::../troubleshooting/troubleshooting-other-components.adoc[]
214+
215+
include::../troubleshooting/collecting-diagnostics-for-support.adoc[]
216+
217+
195218
= Appendix
196219
197220
//--------------------------------------------

asciidoc/quickstart/eib.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,7 @@ operatingSystem:
136136
It's also possible to add additional users, create the home directories, set user-id's, add ssh-key authentication, and modify group information. Please refer to the {link-eib-building-images}[upstream building images guide] for further examples.
137137
====
138138

139+
[#configuring-os-time]
139140
=== Configuring OS time
140141

141142
The `time` section is optional but it is highly recommended to be configured to avoid potential issues with certificates and clock skew. EIB will configure chronyd and `/etc/localtime` depending on the parameters here.

asciidoc/quickstart/metal3.adoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -267,6 +267,7 @@ Also note that `ignition.platform.id=openstack` is mandatory - without this argu
267267
The `time` section is optional but it is highly recommended to be configured to avoid potential issues with certificates and clock skew. The values provided in this example are for illustrative purposes only. Please adjust them to fit your specific requirements.
268268
====
269269

270+
[#growfs-script]
270271
===== Growfs script
271272

272273
Currently, a custom script (`custom/scripts/01-fix-growfs.sh`) is required to grow the file system to match the disk size on first-boot after provisioning. The `01-fix-growfs.sh` script contains the following information:
@@ -812,6 +813,7 @@ metal3-ironic:
812813
type: NodePort
813814
----
814815

816+
[#disabling-tls-for-virtualmedia-iso-attachment]
815817
=== Disabling TLS for virtualmedia ISO attachment
816818

817819
Some server vendors verify the SSL connection when attaching virtual-media ISO images to the BMC, which can cause a problem because the generated
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
[#collecting-diagnostics-for-support]
2+
== Collecting Diagnostics for Support
3+
:experimental:
4+
5+
ifdef::env-github[]
6+
:imagesdir: ../images/
7+
:tip-caption: :bulb:
8+
:note-caption: :information_source:
9+
:important-caption: :heavy_exclamation_mark:
10+
:caution-caption: :fire:
11+
:warning-caption: :warning:
12+
endif::[]
13+
14+
When contacting SUSE Support, providing comprehensive diagnostic information is crucial.
15+
16+
.Essential Information to Collect
17+
18+
* *Detailed problem description*: What happened, when did it happen, what were you doing, what is the expected behavior, and what is the actual behavior?
19+
* *Steps to reproduce*: Can you reliably reproduce the issue? If so, list the exact steps.
20+
* *Component versions*: SUSE Edge version, components versions (RKE2/K3, EIB, Metal^3^, Elemental,..).
21+
* *Relevant logs*:
22+
** `journalctl` output (filtered by service if possible, or full boot logs).
23+
** Kubernetes pod logs (kubectl logs).
24+
** Metal³/Elemental component logs.
25+
** EIB build logs and other logs
26+
* *System information*:
27+
** `uname -a`
28+
** `df -h`
29+
** `ip a`
30+
** `/etc/os-release`
31+
* *Configuration files*: Relevant configuration files for Elemental, Metal^3^, EIB such as helm chart values, configmaps, etc.
32+
* *Kubernetes information*: Nodes, Services, Deployments, etc.
33+
* *Kubernetes objects affected*: BMH, MachineRegistration, etc.
34+
35+
.How to collect
36+
37+
* *For logs*: Redirect command output to files (for example, `journalctl -u k3s > k3s_logs.txt`).
38+
* *For Kubernetes resources*: Use `kubectl get <resource> -o yaml > <resource_name>.yaml` to get detailed YAML definitions.
39+
* *For system information*: Collect output of the commands listed above.
40+
* *For SL Micro*: Check the https://documentation.suse.com/sle-micro/5.5/html/SLE-Micro-all/cha-adm-support-slemicro.html[SUSE Linux Micro Troubleshooting Guide] documentation on how to gather system information for support with `supportconfig`.
41+
* *For RKE2/Rancher*: Check the https://www.suse.com/support/kb/doc/?id=000020191[The Rancher v2.x Linux log collector script] article to run The Rancher v2.x Linux log collector script.
42+
43+
////
44+
45+
* *Nessie*: Nessie is a powerful diagnostic tool designed to collect logs and configuration data from SUSE Edge environments. It gathers comprehensive information from both the host system and Kubernetes clusters, making it invaluable for troubleshooting and support. To collect logs from a SUSE Edge cluster, connect to any of the control plane nodes and run:
46+
+
47+
[,shell]
48+
----
49+
podman run --privileged \
50+
-v /etc/rancher/k3s/k3s.yaml:/etc/rancher/k3s/k3s.yaml:ro \
51+
-v /var/log/journal:/var/log/journal:ro \
52+
-v /run/systemd:/run/systemd:ro \
53+
-v /etc/machine-id:/etc/machine-id:ro \
54+
-v /tmp/nessie-logs:/tmp/cluster-logs \
55+
ghcr.io/gagrio/nessie
56+
----
57+
+
58+
[NOTE]
59+
====
60+
Adjust the paths of the `k3s.yaml/rke2.yaml` file if needed. See https://github.com/suse-edge/support-tools/blob/main/nessie/README.md[Nessie] for more information.
61+
====
62+
63+
////
64+
65+
.Contact Support
66+
Please check the article available at https://www.suse.com/support/kb/doc/?id=000019452[How-to effectively work with SUSE Technical Support] and the support handbook located at https://www.suse.com/support/handbook/[SUSE Technical Support Handbook] for more details on how to contact SUSE support.
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
[#general-troubleshooting-principles]
2+
== General Troubleshooting Principles
3+
:experimental:
4+
5+
ifdef::env-github[]
6+
:imagesdir: ../images/
7+
:tip-caption: :bulb:
8+
:note-caption: :information_source:
9+
:important-caption: :heavy_exclamation_mark:
10+
:caution-caption: :fire:
11+
:warning-caption: :warning:
12+
endif::[]
13+
14+
Before diving into component-specific issues, consider these general principles:
15+
16+
* *Check logs*: Logs are the primary source of information. Most of the times the errors are self explanatory and contain hints on what failed.
17+
* *Check clocks*: Having clock differences between systems can lead to all kinds of different errors. Ensure clocks are in sync. EIB can be instructed to force clock sync at boot time, see <<quickstart-eib,Configuring OS Time>>.
18+
* *Boot Issues*: If the system is stuck during boot, note down the last messages displayed. Access the console (physical or via BMC) to observe boot messages.
19+
* *Network Issues*: Verify network interface configuration (`ip a`), routing table (`ip route`), test connectivity from/to other nodes and external services (`ping`, `nc`). Ensure firewall rules are not blocking necessary ports.
20+
* *Verify component status*: Use `kubectl get` and `kubectl describe` for Kubernetes resources. Use `kubectl get events --sort-by='.lastTimestamp' -n <namespace>` to see the events on a particular Kubernetes namespace.
21+
* *Verify services status*: Use `systemctl status <service>` for systemd services.
22+
* *Check syntax*: Software expects certain structure and syntax on configuration files. For yaml files, for example, use `yamllint` or similar tools to verify the proper syntax.
23+
* *Isolate the problem*: Try to narrow down the issue to a specific component or layer (for example, network, storage, OS, Kubernetes, Metal^3^, Ironic,...).
24+
* *Documentation*: Always refer to the official https://documentation.suse.com/suse-edge/[SUSE Edge documentation] and also upstream documentation for detailed information.
25+
* *Versions*: SUSE Edge is an opinionated and thorough tested version of different SUSE components. he versions of each component per SUSE Edge release can be observed in the https://documentation.suse.com/suse-edge/support-matrix/html/support-matrix/index.html[SUSE Edge support matrix].
26+
* *Known issues*: For each SUSE Edge release there is a “Known issues” section on the release notes that contains information of issues that will be fixed on future releases but can affect the current one.
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
[#troubleshooting-directed-network-provisioning]
2+
== Troubleshooting Directed-network provisioning
3+
:experimental:
4+
5+
ifdef::env-github[]
6+
:imagesdir: ../images/
7+
:tip-caption: :bulb:
8+
:note-caption: :information_source:
9+
:important-caption: :heavy_exclamation_mark:
10+
:caution-caption: :fire:
11+
:warning-caption: :warning:
12+
endif::[]
13+
14+
Directed-network provisioning scenarios involve using Metal^3^ and CAPI elements to provision the Downstream cluster. It also includes EIB to create an OS image. Issues can happen when the host is being booted for the first time or during the inspection or provisioning processes.
15+
16+
.Common Issues
17+
18+
* *Old firmware*: Verify all the different firmware on the physical hosts being used are up to date. This includes the BMC firmware as some times Metal^3^ https://book.metal3.io/bmo/supported_hardware#redfish-and-its-variants[requires specific/updated ones].
19+
* *Provisioning failed with SSL errors*: If the webserver serving the images uses https, Metal^3^ needs to be configured to inject and trust the certificate on the IPA image. See <<mgmt-cluster-kubernetes-folder,Kubernetes folder>> on how to include a `ca-additional.crt` file to the Metal^3^ chart.
20+
* *Certificates issues when booting the hosts with IPA*: Some server vendors verify the SSL connection when attaching virtual-media ISO images to the BMC, which can cause a problem because the generated certificates for the Metal3 deployment are self-signed. It can happen that the host is being booted but it drops to an UEFI shell. See <<disabling-tls-for-virtualmedia-iso-attachment, Disabling TLS for virtualmedia ISO attachment>> on how to fix it.
21+
* *Wrong name or label reference*: If the cluster references a node by the wrong name or label, the cluster results as deployed but the BMH remains as “Available”. Double-check the references on the involved objects for the BMHs.
22+
* *BMC communication issues*: Ensure the Metal^3^ pods running on the management cluster can reach the BMC of the hosts being provisioned (usually the BMC network is very restricted).
23+
* *Incorrect bare metal host state*: The BMH object goes to different states (inspecting, preparing, provisioned, etc.) during its lifetime https://book.metal3.io/bmo/state_machine[Lifetime of State machine]. If detected an incorrect state, check the `status` field of the BMH object as it contains more information as `kubectl get bmh <name> -o jsonpath=’{.status}’| jq`.
24+
* *Host not being deprovisioned*: In the event of a host being intended to be deprovisioned fails, the removal can be attempted after adding the “detached” annotation to the BMH object as: `kubectl annotate bmh/<BMH> baremetalhost.metal3.io/detached=””`.
25+
* *Image errors*: Verify the image being built with EIB for the downstream cluster is available, has a proper checksum and it is not too large to decompress or too large for disk.
26+
* *Disk size mismatch*: By default, the disk would not expand to fill the whole disk. As explained in the <<growfs-script, Growfs script>> section, a growfs script needs to be included in the image being built with EIB for the downstream cluster hosts.
27+
* *Cleaning process stuck*: The cleaning process is retried several times. If due to a problem with the host cleaning is no longer possible, disable cleaning first by setting the `automatedCleanMode` field to `disabled` on the BMH object.
28+
+
29+
[WARNING]
30+
====
31+
It is not recommended to manually remove the finalizer when the cleaning process is taking longer than desired or is failing. Doing so, removes the host record from Kubernetes but leave it in Ironic. The currently running action continues in the background, and an attempt to add the host again may fail because of the conflict.
32+
====
33+
* *Metal3/Rancher Turtles/CAPI pods issues*: The deployment flow for all the required components is:
34+
+
35+
** The Rancher Turtles controller deploys the CAPI operator controller.
36+
** The CAPI operator controller then deploys the provider controllers (CAPI core, CAPM3 and RKE2 controlplane/bootstrap).
37+
38+
Verify all the pods are running correctly and check the logs otherwise.
39+
40+
41+
.Logs
42+
* *Metal^3^ logs*:Check logs for the different pods.
43+
+
44+
[,shell]
45+
----
46+
kubectl logs -n metal3-system -l app.kubernetes.io/component=baremetal-operator
47+
kubectl logs -n metal3-system -l app.kubernetes.io/component=ironic
48+
----
49+
+
50+
[NOTE]
51+
====
52+
The metal3-ironic pod contains at least 4 different containers (`ironic-httpd`,` ironic-log-watch`, `ironic` & `ironic-ipa-downloader` (init)) on the same pod. Use the `-c` flag when using `kubectl logs` to verify the logs of each of the containers.
53+
====
54+
+
55+
[NOTE]
56+
====
57+
The `ironic-log-watch` container exposes console logs from the hosts after inspection/provisioning, provided network connectivity enables sending these logs back to the management cluster. This can be useful in cases where there are provisioning errors but you do not have direct access to the BMC console logs.
58+
====
59+
60+
* *Rancher Turtles logs*: Check logs for the different pods.
61+
+
62+
[,shell]
63+
----
64+
kubectl logs -n rancher-turtles-system -l control-plane=controller-manager
65+
kubectl logs -n rancher-turtles-system -l app.kubernetes.io/name=cluster-api-operator
66+
kubectl logs -n rke2-bootstrap-system -l cluster.x-k8s.io/provider=bootstrap-rke2
67+
kubectl logs -n rke2-control-plane-system -l cluster.x-k8s.io/provider=control-plane-rke2
68+
kubectl logs -n capi-system -l cluster.x-k8s.io/provider=cluster-api
69+
kubectl logs -n capm3-system -l cluster.x-k8s.io/provider=infrastructure-metal3
70+
----
71+
72+
* *BMC logs*: Usually BMCs have a UI where most of the interaction can be done. There is usually a “logs” section that can be observed for potential issues (not being able to reach the image, hardware failures, etc.).
73+
74+
* *Console logs*: Connect to the BMC console (via the BMC webui, serial, etc.) and check for errors on the logs being written.
75+
76+
.Troubleshooting steps
77+
78+
. *Check `BareMetalHost` status*:
79+
80+
* `kubectl get bmh -A` shows the current state. Look for `provisioning`, `ready`, `error`, `registering`.
81+
* `kubectl describe bmh -n <namespace> <bmh_name>` provides detailed events and conditions explaining why a BMH might be stuck.
82+
83+
. *Test RedFish connectivity*:
84+
85+
* Use `curl` from the Metal^3^ control plane to test connectivity to the BMCs via redfish.
86+
* Ensure correct BMC credentials are provided in the `BareMetalHost-Secret` definition.
87+
88+
. *Verify turtles/CAPI/metal3 pod status*: Ensure the containers on the management cluster are up and running: `kubectl get pods -n metal3-system` and `kubectl get pods -n rancher-turtles-system` (also see `capi-system`, `capm3-system`, `rke2-bootstrap-system` and `rke2-control-plane-system`).
89+
90+
. *Verify the ironic endpoint is reachable from the host being provisioned*: The host being provisioned needs to be able to reach out the Ironic endpoint to report back to Metal^3^. Check the IP with `kubectl get svc -n metal3-system metal3-metal3-ironic` and try to reach it via `curl/nc`.
91+
92+
. *Verify the IPA image is reachable from the BMC*: IPA is being served by the Ironic endpoint and it needs to be reachable from the BMC as it is being used as a virtual CD.
93+
94+
. *Verify the OS image is reachable from the host being provisioned*: The image being used to provision the host needs to be reachable from the host itself (when running IPA) as it will be downloaded temporarily and written to the disk.
95+
96+
. *Examine Metal^3^ component logs*: See above.
97+
98+
. *Retrigger BMH Insepction*: If an inspection failed or the hardware of an available host changed, a new inspection process can be triggered by annotating the BMH object with `inspect.metal3.io: ""`. See the https://book.metal3.io/bmo/inspect_annotation[Metal^3^ Controlling inspection] guide for more information.
99+
100+
. *Bare metal IPA console*: To troubleshoot IPA issues a couple of alternatives exist:
101+
102+
* Enable “autologin”. This enables the root user to be logged automatically when connecting to the IPA console.
103+
+
104+
[WARNING]
105+
====
106+
This is only for debug purposes as it gives full access to the host.
107+
====
108+
+
109+
To enable autologin, the Metal3 helm `global.ironicKernelParams` value should look like: `console=ttyS0 suse.autologin=ttyS0` (depending on the console, `ttyS0` can be changed). Then a redeployment of the Metal^3^ chart should be performed. (Note `ttyS0` is an example, this should match the actual terminal e.g may be `tty1` in many cases on bare metal, this can be verified by looking at the console output from the IPA ramdisk on boot where `/etc/issue` prints the console name).
110+
+
111+
Another way to do it is by changing the `IRONIC_KERNEL_PARAMS` parameter on the `ironic-bmo` configmap on the `metal3-system` namespace. This can be easier as it can be done via `kubectl` edit but it will be overwritten when updating the chart. Then the Metal^3^ pod needs to be restarted with `kubectl delete pod -n metal3-system -l app.kubernetes.io/component=ironic`.
112+
113+
* Inject an ssh key for the root user on the IPA.
114+
+
115+
[WARNING]
116+
====
117+
This is only for debug purposes as it gives full access to the host.
118+
====
119+
+
120+
To inject the ssh key for the root user, the Metal^3^ helm `debug.ironicRamdiskSshKey` value should be used. Then a redeployment of the Metal^3^ chart should be performed.
121+
+
122+
Another way to do it is by changing the `IRONIC_RAMDISK_SSH_KEY` parameter on the `ironic-bmo configmap` on the `metal3-system` namespace. This can be easier as it can be done via `kubectl` edit but it will be overwritten when updating the chart. Then the Metal^3^ pod needs to be restarted with `kubectl delete pod -n metal3-system -l app.kubernetes.io/component=ironic`
123+
124+
125+
[NOTE]
126+
====
127+
Check the https://cluster-api.sigs.k8s.io/user/troubleshooting[CAPI troubleshooting] and https://book.metal3.io/troubleshooting[Metal^3^ troubleshooting] guides.
128+
====
129+

0 commit comments

Comments
 (0)