[RAS] nebka is now running
Thomas Krichel
krichel at openlib.org
Sat Mar 15 02:15:44 EDT 2008
Christian Zimmermann writes
> Tim got it finally to work. I must have done something in the RAID
> configuration utility that erase the tables on sdb1.
Oh great.
> The current state of the system is: kernel 2.6, ext3 filesystem with
> dir_index feature, empty sdb1, boot and root on sda1.
>
> Note: Time added the dir_index feature also to sda1. This allows better
> handling of large filesystems, but works only with 2.6
>
> Tim is convinced, and I agree, that we do not have a hard drive problem.
> The problem is software related and has to do with the fact that there is
> an awful lot of disk I/O going on on this machine. We should assess all
> the rsync's and such running and see whether they are necessary, and
> whether they needed at the current frequency.
Yes, it is an i/o related software issue. Linux kernels don't
handle hardware problems gracefully, but horribly. This also
applies to bad disks. To solve the issue, you either rewrite
the Linux kernel, or you get a new disk.
> We should also be used the second drive to distribute the I/O load
> optimally across the two drives. Say, put only /home on sdb1, or only
> /home/aras.
>
> My sense this strategy would also be valid on raneb, snefru, etc., which
> seem to have disk emergencies more often than usual...
No. There is no space on them, the close to 1TB of disk
space on raneb and snefru is used up by backups. But there
is no backup of nebka because of your bedevilling of rsync.
Snefru has had no disk problem. Raneb had it, it was bad blocks,
changed disk, all clear. Chichek had it, it was bad blocks, changed
disk, all clear. Fafner had it, it was bad blocks, changed disk,
all clear. In the meantime, I keep backups.
The fact that we were able to do the entire rsync after marking
the bad blocks as bad demonstrates that when the system does
not hit the bad blocks, it works. Next bad block comes along
it will go belly up again. In my experience, the more bad blocks
you have, the more bad blocks you get.
> 1) put the RePEc Author Service back online. We were having recently
> 15-40 new authors a day signing up. We do not want to discourage new
> users.
>
> 2) Think hard how to optimize disk load
>
> 3) Then only implement new strategy.
The first priority should be a complete backup, daily. More
rsync, not less
Cheers,
Thomas Krichel http://openlib.org/home/krichel
RePEc:per:1965-06-05:thomas_krichel
phone: +7 383 330 6813 skype: thomaskrichel
More information about the RAS-run
mailing list