[OLUG] Isolating flaky hardware problems
Matt Arnold
marnold at novia.net
Sat Feb 12 02:38:01 UTC 2000
I had some intermittant problems in one linux machine. Max uptime was never
more than 7 days. Problems manifested themselves via kernel panics,
segmentation faults, lock ups, and signal 11 errors. I was quite disturbed
because the problems were random, unpredictable, and unreproduceable.
I was, however, able to reproduce one of the symtoms reliably. I always got
a signal 11 error whenever I tried compiling a kernel. The real interesting
thing was that I got the error during different parts of the compile -- it
never failed in the same place twice. Computers are supposed to behave
consistently. Hmmmm.
I stumbled upon the following web page: http://www.bitwizard.nl/sig11/ I
began to suspect I had memory problems.
I moved some RAM around and used gcc to stress the remaining RAM by doing
more kernel compiles. After trying various RAM configurations, I could
authoritatively say that I identified a faulty SIMM -- with the SIMM in the
machine, I got reproduceable signal 11 errors; with the SIMM gone, the
machine was rock-solid.
Since the removal of the bad SIMM, the machine enjoyed an uptime of over 120
days. So what brought the machine down after 120 days? A moron coworker
turned the machine off despite the big "Do Not Turn Me Off" sign taped over
the power button. *sigh* No linux machine can resist that type of attack.
M@
P.S. When building kernels, there's a parameter you can pass make to spawn
multiple processes. This will allow you to easily spawn enough cc's to
consume all available RAM. I think this was the -j arg, but I can't
remember exactly.
-------------------------------------------------------------------------
Sent by OLUG Mailing list Manager, run by ezmlm. http://olug.bstc.net/
To unsubscribe: `echo unsubsribe | mail olug-unsubscribe at bstc.net`
More information about the OLUG
mailing list