As a software maintainer, sometimes you receive very peculiar bug reports ranging from laughable to completely bizarre, yet reproducible. Some of them are so strange they deserve a bit of post-mortem pondering and reflection, like this one. But first, it’s important to remember that the user reports what he sees, a manifestation of the bug — and that can be very different from the real causes of the problem. Ready to go? The next paragraph shows the original bug report, and then you’ll see what the investigations led us into.
Linux fails with a kernel panic on systems with a Celeron processor and 256MB of RAM when an Ext3 filesystem is created on a large partition and you set the number of blocks reserved to the superuser to 0 (mke2fs -j -m0). The problem doesn’t happen if you use a different processor, or in systems with 1GB of RAM.
This claim certainly raised some eyebrows, but instead of dismissing it as hallucination we set up a system according to the description and — surprise — it crashed. Lots of speculation ensued on what could be causing this crash. In-kernel memory corruption? DMA problems? Something unsupported on newest-generation Celerons? Certainly not bad memory modules, otherwise it wouldn’t be reproducible. How could the amount of system memory affect a crash in a simple procedure such as mkfs? What about the percentage of reserved blocks? And why a panic, instead of a simple application crash?
From the system’s point of view, mkfs does little more that writing to the block device, so logic tells us that high-level detais such as reserved blocks and mkfs itself should be irrelevant — just accessing or writing to the disk must give us the same results. A quick test with dd revealed that write access is indeed the case. With some noise removed from the equation, we started to investigate the rest of the claims.
To test the influence of the amount of memory, we set up the system with different combinations of memory modules. We reproduced it with 512MB using a single-DIMM, but not with 2×256MB or 2×512MB with system memory limited to 512MB. The plot thickens. Experimenting with different kernels and assuming the very likely scenario where the Celeron versus Core Duo problem is in fact an UP versus SMP problem, we rephrased the report as:
Linux 2.6.20 fails with a kernel panic on an uniprocessor systems with one memory module installed when write I/O is performed to the disk. The problem doesn’t happen if you use kernels 2.6.17. 2.6.20-up or 2.6.22, a dual-core processor, or if the system has two memory modules installed.
With the generalized bug symptom in hands, we went from empirical testing to analisys of kernel 2.6.20 starting with any local patches. Such a big problem on the upstream kernel would have been quickly noticed and reported — or not? Investigation of local patches revealed nothing about SATA, SCSI, the block layer or SMP, but this kernel used one feature that became our main suspect: SMP-alternatives. From its hackish nature it would explain strange symptoms in case of failure, it would be more mature in 2.6.22 and it didn’t exist in 2.6.17. The case was walking towards a nice ending, if a last-minute twist didn’t happen: the same kernel, built on a different system, works.
It turned out that the real bug origin was in the GNU toolchain, which generated bad code for the SMP-alternatives implementation of that specific kernel that works on most cases but fails catastrophically on the user’s setup. Pretty interesting, eh? What we can see here:
- The user reports what he actually sees, which can be a small part of the entire bug manifestation.
- The user always bumps into a corner case. If it can’t be reproduced, try with the exact configuration reported by the user.
- Investigate properly, don’t resort to shotgun debugging.
- If the bug is too strange, don’t trust the toolchain.
- And if you use binutils 2.16.91.0.7, don’t build kernel 2.6.20 with SMP-alternatives. It can crash on disk write.
This entire investigation took two days and was conducted mainly by Herton Krzesinski, who should have a blog to tell these things firsthand.
Entries (RSS)
August 9th, 2007 at 3:04 am
Excellent article !