[RAS] badblocks

Tue Feb 26 23:02:18 EST 2008

  Christian Zimmermann writes

> I am back and stand ready to run to the server farm if necessary.

  With 11 hours time difference, I was in bed. I have been thinking
  a bit more.

  I remember, when I had a similar problem with raneb, there
  were only 12 or 40 bad bad blocks, but they caused the disk
  to crash. Now that the offending disk has been replaced, it's
  all quiet on the raneb front. I would therefore suggest that 
  the troubles come from the bad block

  The way I understand disks, is that decay is expenential. 
  Most modern disks have some extra space through RAID, that
  is hidden from the O/S. When bad block appear the data is 
  moved from the bad blocks to blocks that are healthy, in 
  a way that is transparent to the o/s. When there are too
  many bad blocks, the o/s start seeing them, and that's
  when Linux gets rather merciless, it does not take hardware
  issues lightly. 

  So even with 3 bad blocks, we need to get rid of the disk,
  software updates will not help.

  e2fsck has a -c option that will scan for bad blocks and   
  mark the bad blocks as bad, so that they are not used by the 
  o/s.  When we run this on startup, with the root file system
  mounted read-only, it should mark the bad blocks. If 
  we then immediately (so that there are no further bad 
  blocks) rsync the files from sda to sdb, make sdb bootable,
  then swap disks to boot from sdb, we should be fine.
  I did such an operation locally and can give further
  instructions if you agree with the general course of 
  action.

  Cheers,

  Thomas Krichel                    http://openlib.org/home/krichel
                                RePEc:per:1965-06-05:thomas_krichel
  phone: +7 383 330 6813                       skype: thomaskrichel