|
| 1 | +[#troubleshooting-directed-network-provisioning] |
| 2 | +== Troubleshooting Directed-network provisioning |
| 3 | +:experimental: |
| 4 | + |
| 5 | +ifdef::env-github[] |
| 6 | +:imagesdir: ../images/ |
| 7 | +:tip-caption: :bulb: |
| 8 | +:note-caption: :information_source: |
| 9 | +:important-caption: :heavy_exclamation_mark: |
| 10 | +:caution-caption: :fire: |
| 11 | +:warning-caption: :warning: |
| 12 | +endif::[] |
| 13 | + |
| 14 | +Directed-network provisioning scenarios involve using Metal^3^ and CAPI elements to provision the Downstream cluster. It also includes EIB to create an OS image. Issues can happen when the host is being booted for the first time or during the inspection or provisioning processes. |
| 15 | + |
| 16 | +.Common Issues |
| 17 | + |
| 18 | +* *Old firmware*: Verify all the different firmware on the physical hosts being used are up to date. This includes the BMC firmware as some times Metal^3^ https://book.metal3.io/bmo/supported_hardware#redfish-and-its-variants[requires specific/updated ones]. |
| 19 | +* *Provisioning failed with SSL errors*: If the webserver serving the images uses https, Metal^3^ needs to be configured to inject and trust the certificate on the IPA image. See <<mgmt-cluster-kubernetes-folder,Kubernetes folder>> on how to include a `ca-additional.crt` file to the Metal^3^ chart. |
| 20 | +* *Certificates issues when booting the hosts with IPA*: Some server vendors verify the SSL connection when attaching virtual-media ISO images to the BMC, which can cause a problem because the generated certificates for the Metal3 deployment are self-signed. It can happen that the host is being booted but it drops to an UEFI shell. See <<disabling-tls-for-virtualmedia-iso-attachment, Disabling TLS for virtualmedia ISO attachment>> on how to fix it. |
| 21 | +* *Wrong name or label reference*: If the cluster references a node by the wrong name or label, the cluster results as deployed but the BMH remains as “Available”. Double-check the references on the involved objects for the BMHs. |
| 22 | +* *BMC communication issues*: Ensure the Metal^3^ pods running on the management cluster can reach the BMC of the hosts being provisioned (usually the BMC network is very restricted). |
| 23 | +* *Incorrect bare metal host state*: The BMH object goes to different states (inspecting, preparing, provisioned, etc.) during its lifetime https://book.metal3.io/bmo/state_machine[Lifetime of State machine]. If detected an incorrect state, check the `status` field of the BMH object as it contains more information as `kubectl get bmh <name> -o jsonpath=’{.status}’| jq`. |
| 24 | +* *Host not being deprovisioned*: In the event of a host being intended to be deprovisioned fails, the removal can be attempted after adding the “detached” annotation to the BMH object as: `kubectl annotate bmh/<BMH> baremetalhost.metal3.io/detached=””`. |
| 25 | +* *Image errors*: Verify the image being built with EIB for the downstream cluster is available, has a proper checksum and it is not too large to decompress or too large for disk. |
| 26 | +* *Disk size mismatch*: By default, the disk would not expand to fill the whole disk. As explained in the <<growfs-script, Growfs script>> section, a growfs script needs to be included in the image being built with EIB for the downstream cluster hosts. |
| 27 | +* *Cleaning process stuck*: The cleaning process is retried several times. If due to a problem with the host cleaning is no longer possible, disable cleaning first by setting the `automatedCleanMode` field to `disabled` on the BMH object. |
| 28 | ++ |
| 29 | +[WARNING] |
| 30 | +==== |
| 31 | +It is not recommended to manually remove the finalizer when the cleaning process is taking longer than desired or is failing. Doing so, removes the host record from Kubernetes but leave it in Ironic. The currently running action continues in the background, and an attempt to add the host again may fail because of the conflict. |
| 32 | +==== |
| 33 | +* *Metal3/Rancher Turtles/CAPI pods issues*: The deployment flow for all the required components is: |
| 34 | ++ |
| 35 | +** The Rancher Turtles controller deploys the CAPI operator controller. |
| 36 | +** The CAPI operator controller then deploys the provider controllers (CAPI core, CAPM3 and RKE2 controlplane/bootstrap). |
| 37 | +
|
| 38 | +Verify all the pods are running correctly and check the logs otherwise. |
| 39 | + |
| 40 | + |
| 41 | +.Logs |
| 42 | +* *Metal^3^ logs*:Check logs for the different pods. |
| 43 | ++ |
| 44 | +[,shell] |
| 45 | +---- |
| 46 | +kubectl logs -n metal3-system -l app.kubernetes.io/component=baremetal-operator |
| 47 | +kubectl logs -n metal3-system -l app.kubernetes.io/component=ironic |
| 48 | +---- |
| 49 | ++ |
| 50 | +[NOTE] |
| 51 | +==== |
| 52 | +The metal3-ironic pod contains at least 4 different containers (`ironic-httpd`,` ironic-log-watch`, `ironic` & `ironic-ipa-downloader` (init)) on the same pod. Use the `-c` flag when using `kubectl logs` to verify the logs of each of the containers. |
| 53 | +==== |
| 54 | ++ |
| 55 | +[NOTE] |
| 56 | +==== |
| 57 | +The `ironic-log-watch` container exposes console logs from the hosts after inspection/provisioning, provided network connectivity enables sending these logs back to the management cluster. This can be useful in cases where there are provisioning errors but you do not have direct access to the BMC console logs. |
| 58 | +==== |
| 59 | +
|
| 60 | +* *Rancher Turtles logs*: Check logs for the different pods. |
| 61 | ++ |
| 62 | +[,shell] |
| 63 | +---- |
| 64 | +kubectl logs -n rancher-turtles-system -l control-plane=controller-manager |
| 65 | +kubectl logs -n rancher-turtles-system -l app.kubernetes.io/name=cluster-api-operator |
| 66 | +kubectl logs -n rke2-bootstrap-system -l cluster.x-k8s.io/provider=bootstrap-rke2 |
| 67 | +kubectl logs -n rke2-control-plane-system -l cluster.x-k8s.io/provider=control-plane-rke2 |
| 68 | +kubectl logs -n capi-system -l cluster.x-k8s.io/provider=cluster-api |
| 69 | +kubectl logs -n capm3-system -l cluster.x-k8s.io/provider=infrastructure-metal3 |
| 70 | +---- |
| 71 | +
|
| 72 | +* *BMC logs*: Usually BMCs have a UI where most of the interaction can be done. There is usually a “logs” section that can be observed for potential issues (not being able to reach the image, hardware failures, etc.). |
| 73 | +
|
| 74 | +* *Console logs*: Connect to the BMC console (via the BMC webui, serial, etc.) and check for errors on the logs being written. |
| 75 | +
|
| 76 | +.Troubleshooting steps |
| 77 | + |
| 78 | +. *Check `BareMetalHost` status*: |
| 79 | + |
| 80 | +* `kubectl get bmh -A` shows the current state. Look for `provisioning`, `ready`, `error`, `registering`. |
| 81 | +* `kubectl describe bmh -n <namespace> <bmh_name>` provides detailed events and conditions explaining why a BMH might be stuck. |
| 82 | +
|
| 83 | +. *Test RedFish connectivity*: |
| 84 | + |
| 85 | +* Use `curl` from the Metal^3^ control plane to test connectivity to the BMCs via redfish. |
| 86 | +* Ensure correct BMC credentials are provided in the `BareMetalHost-Secret` definition. |
| 87 | +
|
| 88 | +. *Verify turtles/CAPI/metal3 pod status*: Ensure the containers on the management cluster are up and running: `kubectl get pods -n metal3-system` and `kubectl get pods -n rancher-turtles-system` (also see `capi-system`, `capm3-system`, `rke2-bootstrap-system` and `rke2-control-plane-system`). |
| 89 | + |
| 90 | +. *Verify the ironic endpoint is reachable from the host being provisioned*: The host being provisioned needs to be able to reach out the Ironic endpoint to report back to Metal^3^. Check the IP with `kubectl get svc -n metal3-system metal3-metal3-ironic` and try to reach it via `curl/nc`. |
| 91 | + |
| 92 | +. *Verify the IPA image is reachable from the BMC*: IPA is being served by the Ironic endpoint and it needs to be reachable from the BMC as it is being used as a virtual CD. |
| 93 | + |
| 94 | +. *Verify the OS image is reachable from the host being provisioned*: The image being used to provision the host needs to be reachable from the host itself (when running IPA) as it will be downloaded temporarily and written to the disk. |
| 95 | + |
| 96 | +. *Examine Metal^3^ component logs*: See above. |
| 97 | + |
| 98 | +. *Retrigger BMH Insepction*: If an inspection failed or the hardware of an available host changed, a new inspection process can be triggered by annotating the BMH object with `inspect.metal3.io: ""`. See the https://book.metal3.io/bmo/inspect_annotation[Metal^3^ Controlling inspection] guide for more information. |
| 99 | + |
| 100 | +. *Bare metal IPA console*: To troubleshoot IPA issues a couple of alternatives exist: |
| 101 | + |
| 102 | +* Enable “autologin”. This enables the root user to be logged automatically when connecting to the IPA console. |
| 103 | ++ |
| 104 | +[WARNING] |
| 105 | +==== |
| 106 | +This is only for debug purposes as it gives full access to the host. |
| 107 | +==== |
| 108 | ++ |
| 109 | +To enable autologin, the Metal3 helm `global.ironicKernelParams` value should look like: `console=ttyS0 suse.autologin=ttyS0` (depending on the console, `ttyS0` can be changed). Then a redeployment of the Metal^3^ chart should be performed. (Note `ttyS0` is an example, this should match the actual terminal e.g may be `tty1` in many cases on bare metal, this can be verified by looking at the console output from the IPA ramdisk on boot where `/etc/issue` prints the console name). |
| 110 | ++ |
| 111 | +Another way to do it is by changing the `IRONIC_KERNEL_PARAMS` parameter on the `ironic-bmo` configmap on the `metal3-system` namespace. This can be easier as it can be done via `kubectl` edit but it will be overwritten when updating the chart. Then the Metal^3^ pod needs to be restarted with `kubectl delete pod -n metal3-system -l app.kubernetes.io/component=ironic`. |
| 112 | +
|
| 113 | +* Inject an ssh key for the root user on the IPA. |
| 114 | ++ |
| 115 | +[WARNING] |
| 116 | +==== |
| 117 | +This is only for debug purposes as it gives full access to the host. |
| 118 | +==== |
| 119 | ++ |
| 120 | +To inject the ssh key for the root user, the Metal^3^ helm `debug.ironicRamdiskSshKey` value should be used. Then a redeployment of the Metal^3^ chart should be performed. |
| 121 | ++ |
| 122 | +Another way to do it is by changing the `IRONIC_RAMDISK_SSH_KEY` parameter on the `ironic-bmo configmap` on the `metal3-system` namespace. This can be easier as it can be done via `kubectl` edit but it will be overwritten when updating the chart. Then the Metal^3^ pod needs to be restarted with `kubectl delete pod -n metal3-system -l app.kubernetes.io/component=ironic` |
| 123 | +
|
| 124 | +
|
| 125 | +[NOTE] |
| 126 | +==== |
| 127 | +Check the https://cluster-api.sigs.k8s.io/user/troubleshooting[CAPI troubleshooting] and https://book.metal3.io/troubleshooting[Metal^3^ troubleshooting] guides. |
| 128 | +==== |
| 129 | + |
0 commit comments