[RAS] nebka problems

Christian Zimmermann christian.zimmermann at uconn.edu
Sun Jan 20 09:15:08 EST 2008


I have investigated a little what the problem with nebka may be. Here is 
what we know:

- we have ext3 errors that gave some sort of panic in the night of 
Thursday to Friday.
- a reboot fixed it, the machine looked fine
- nebka went down again, approx 24 hours after the first crash
- Thomas was doing a complete backup of the machine with rsync at the 
time. He did not get to the original data of the machine, the aras 
account.
- We had a similar set of crashes in June 2006, that were diagnosed as an 
issue with a directory in CitEc that had too many files. At the time, I 
wrote:

According to http://en.wikipedia.org/wiki/Ext3, the maximum number of 
files a directory can have is V*2^(-13), where V is the size of the volume 
in blocks. On raneb, this would be 56335 (V=461494280). On nebka, this is 
8551 (V=70057172). This would mean we are still in trouble for both (we 
have 12000 NBER WPs). I hope I am misunderstanding.

So I investigated on raneb to see whether we have any overfull directories 
that may get mirrored to nebka. I found in the adrepec account
~/ftp/CitEc/nbr/nberwo
~/ftp/opt/CitEc/nbr/nberwo
which each have 10630 files. So if my forecast from 18 months ago is 
correct, we have the same problem as before, but in a subdirectory this 
time.

If this is correct: the solution, I think, is to have a larger volume. It 
turns out we have one for this machine, Bob sent it two months ago. We had 
to divert it for the machine running IDEAS because of a more serious HD 
problem. We have a new machine for IDEAS, we just need to configure it and 
transfer content, then the drive could be reallocated to nebka. I would 
just need Tim to get started on the new machine before I am back to 
Connecticut (January 28).

Does this make sense? In the immediate, we would need to reboot the 
machine Monday, comment out all crontab jobs, investigate the true origin 
of the problem (we found it last year by looking a problematic inodes with 
fsck), and then only try to back up (only the aras account, in particular 
the userdata directory).

I will be in a train back to Paris again while the machine probably gets 
back up (Monday EST 10am-3pm), but I will check in as soon as 
possible once back in Paris.



Christian Zimmermann                                     FIGUGEGL!
Department of Economics
University of Connecticut
341 Mansfield Road, Unit 1063
Storrs, CT 06269-1063
http://ideas.repec.org/zimm/   christian.zimmermann at uconn.edu
http://ideas.repec.org/e/pzi1.html



More information about the RAS-run mailing list