[RAS] badblocks

Wed Feb 27 13:08:46 EST 2008

  Bob Parks writes

> Yes, IMHO.  As Christian wrote earlier about nebka, there are limits to 
> directory sizes.  He seemed to indicate that a cron job
> with du might have been the entire problem.  We have had similar problems 
> in the past. 

  my theory: du puts stress on the disk, it hits the bad block, and bang! 

> There are bad blocks on every disk.  Bad blocks, unless a large number, do 
> not show that the 'disk' is failing. And again, this is a mirror'ed disk, 
> two disks, in Raid 1, with a hardware controller.  Now that I think on it,
> it is not clear what badblocks on what disk are being reported by the 
> Adaptec controller -

  my theory: the disk is one disk to the o/s. 

> Note that nearly identical hardware exists on Bill's RFE machine and never 
> an error.  You have had problems
> on nebka, and snefru (idential hardware) and raneb (very different 
> hardware).  That alone leads me to suspect
> software.

  I don't remember a problem on snefru. The common file set are
  the adrepec files (common on raneb, sahure, fafner, nebka, 
  mutabor) and the citec files, common on mutabor, raneb,
  snefru, sahure, fafner (Yes, I back up!). 

  What I think is what's written in 27.2.4. badblocks and e2fsck
  of 

http://eduunix.ccut.edu.cn/index/html/linux/OReilly.LPI.Linux.Certification.in.a.Nutshell.2nd.Edition.Jul.2006/0596005288/lpicertnut2-CHP-27-SECT-2.html

  They say 

When a disk is failing, it will usually get an exponential increase in
bad blocks, and after a short while it will run out of spare blocks,
whereupon you will get into trouble with your filesystems on that
disk.

  It has already run out of spare blocks, that's why some
  bad blocks show up to the o/s. 

  Cheers,

  Thomas Krichel                    http://openlib.org/home/krichel
                                RePEc:per:1965-06-05:thomas_krichel
  phone: +7 383 330 6813                       skype: thomaskrichel