[OLUG] Isolating flaky hardware problems

Matt Arnold marnold at novia.net
Sat Feb 12 02:38:01 UTC 2000


I had some intermittant problems in one linux machine.  Max uptime was never
more than 7 days.  Problems manifested themselves via kernel panics,
segmentation faults, lock ups, and signal 11 errors.  I was quite disturbed
because the problems were random, unpredictable, and unreproduceable.

I was, however, able to reproduce one of the symtoms reliably.  I always got
a signal 11 error whenever I tried compiling a kernel.  The real interesting
thing was that I got the error during different parts of the compile -- it
never failed in the same place twice.  Computers are supposed to behave
consistently.  Hmmmm.

I stumbled upon the following web page: http://www.bitwizard.nl/sig11/  I
began to suspect I had memory problems.

I moved some RAM around and used gcc to stress the remaining RAM by doing
more kernel compiles.  After trying various RAM configurations, I could
authoritatively say that I identified a faulty SIMM -- with the SIMM in the
machine, I got reproduceable signal 11 errors; with the SIMM gone, the
machine was rock-solid.

Since the removal of the bad SIMM, the machine enjoyed an uptime of over 120
days.  So what brought the machine down after 120 days?  A moron coworker
turned the machine off despite the big "Do Not Turn Me Off" sign taped over
the power button.  *sigh*  No linux machine can resist that type of attack.

M@

P.S. When building kernels, there's a parameter you can pass make to spawn
multiple processes.  This will allow you to easily spawn enough cc's to
consume all available RAM.  I think this was the -j arg, but I can't
remember exactly.


-------------------------------------------------------------------------
Sent by OLUG Mailing list Manager, run by ezmlm.  http://olug.bstc.net/ 
To unsubscribe: `echo unsubsribe | mail olug-unsubscribe at bstc.net` 



More information about the OLUG mailing list