-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Pi 3+ buster pxe nfsboot - kernel hangs / "Fatal exception in interrupt" panics #3067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Confirmed. It's broken. How can I revert to previous version? |
Some more panics, which seems to be different each time.
Issue seems to be present with the 4.19.x kernels. |
If using piserver and "apt-get upgrade" gave you a 4.19.x kernel that bricked your system, use the following to downgrade to 4.14.x:
|
This looks like memory corruption - something is probably writing off the end of a buffer, or continuing to write to it after our has been freed. @maxnet There are some kernel heap validation features of varying performance impact that can be enabled if you are prepared to build your own kernel. Correction: "slub" debugging is a runtime option - try adding "slub_debug" to cmdline.txt. |
slub_debug did not give anything useful. But did notice that if for testing purposes I boot without modules (rename the /lib/modules folder), it does boot to X correctly. |
Two things I noticed.
If I raise SYNC_MSG_TIMEOUT to say 10, it does not timeout.
|
I noticed a number of errors with the bcm2835_v4l2 module as well when trying to boot the 4.19.57 kernel, and had to roll back to 4.19.42 to get a working system. I didn't have kernel crashes, but I'm also using nfsboot and the lan78xx driver failed to keep the link working after the bcm2835_v4l2 module failure messages (could be unrelated of course). |
We did have a memory corruption issue with bcm2835_v4l2 and PORT_PARAMETER_GET previously. Life got a little tortuous as upstream applied a different patch to "fix" it, so it's possible it's still partially broken. I'm bemused that we get a timeout at 3 seconds - the VPU should be doing very little at that point. It is probing for the connected sensor, but that should take less time than that. |
That also only happens when the Pi is being booted using a PXE network boot BTW.
|
Just a note: Pis do not support PXE, at all. They can boot off of a network (I have four that do), but they don't use PXE and it's really unfortunate that the community has adopted this terminology. PXE is a specific protocol and configuration, it's not just another name for 'network boot' :-) |
If a device spits out DHCP packets that say "PXEClient" I call it a PXE boot. |
Indeed, a decent part of the blame falls on the RPi team which decided to use PXEClient as an identifier in the DHCP packets for a non-PXE device. In my network I don't test for it, and don't include it in responses back to the devices (and I don't even check for the magic vendor identifier either) and it all works fine. |
I have resolved my booting problems by blacklisting the |
@kpfleming That's kinda what raspberrypi/piserver#75 (comment) also says. |
In my case, the PIs were hanging on boot w/o displaying anything in regards to a kernel panic, halt or anything else. Blacklisting these modules resolved the issue. PI2 & PI3, several different hardware versions. Raspian 4.19.66-v7+. |
I wonder if #2528 has anything to do with this? |
I'm also affected, thought it had something to do with the bootcode at first but apparently it doesn't. |
Huh. Blacklisting those modules fixed it for me, too. Let me throw in some Google keywords to help others: "raise network interfaces", "Network Time Synchronization" (it sometimes froze with that task displayed), Raspberry Pi 3 B nfsboot NFS boot freeze hang broken. |
Thanks google, this thread took me a while to find! Brand new Buster image on RPI3 modelB using "PXE" boot / nfsroot fails, uname -a claims '4.19.57-v7+ #1244'. Blacklisting those two modules works for me as well. Mine got stuck in startup at "creating volatile files and directories" or when systemd starts "udev kernel device manager". I twice had a kernel panic, but this has not happened after I hooked up a serial console to try and capture the output. Thanks @kpfleming for the temporary workaround. |
+1 --- blacklisting bcm2835_codec and bcm2835_v4l2 fixed network boot for me. Huge kudos to @maxnet, @kpfleming, and anyone else who contributed here. This bug cost me a dozen hours or more, stopped dead in my tracks. The problem manifests itself in a variety of ways, most commonly hanging in Network Time Synchronization, but sometimes "raise network interfaces" as @esticle and @FeepingCreature note. I've attached the two files I created to blacklist these modules. Place them in /etc/modprobe.d/ and reboot. |
I am wondering what it would take to get a contributor or someone from the @raspberrypi group to have a look at this? |
Another +1 on this issue, and the same workaround ( blacklisted bcm2835_v4l2 ). |
This is currently holding up piserver images. Should I apply the blacklist workaround or is a fix in the works? |
I've done a quick test on the old PiServer image I had around but using the latest rpi-update kernel and firmware, and it's booting fine, with or without camera. If you have a more recent image that I can test with, then I'm happy to do so. |
Ah, I do have I'll try to get a debugger working on it and see what is going on. |
Firmware patch created that seems to work for me. |
The depressing thing is that I am getting exactly this problem with 2 x brand new RPi4 netbooting from another RPi4 (following two official guides closely), but another two work fine. On the two which work, I got a kernel panic as described after pressing enter to continue, these two now work no problem and reboot fine. The other two just hang exactly as in the original description... The workarounds in the git error report don't work, not surprisingly, so I am now going to try booting from an SD card with just /boot on it which directs to the original nfs folder and see if that works. I suspect I have configured something wrong which the panic corrected, but I can't see anything obvious. |
@peter-chineham Netbooting on RPi4 is still in beta, so it may be more appropriate to report your problem at https://github.com/raspberrypi/rpi-eeprom/issues ? |
Could be a different problem then. You may also want to try if it also happens when you PXE boot your Pi through piserver. As that uses the same "known good" rootfs for all Pi. |
While adding Raspberry support to the new LTSP, I too bumped into this issue, and it took me many days to find the appropriate bug reports and workarounds. I believe it's the same as #2335 which resurfaced, so I initially commented there; then I found this one. Fortunately with the workaround everything is fine, LTSP can now netboot rpi2, rpi3b+ and rpi4 with no issues other than camera support due to the blacklisting. |
Describe the bug
PXE/nfs booting Buster on a Pi 3+ does not quite work
Does boot the kernel and starts to execute the first systemd units, but then the system hangs some seconds after on random moments (text cursor stops blinking, no activity, numlock does not respond)
Happens both with the kernel version you ship by default in your June 2019 Buster image, and 4.19.57 you get after running "apt-get upgrade"
In most cases kernels hangs without outputing any message, but seen the following kernel panic as well.
The text was updated successfully, but these errors were encountered: