[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

openbsd under heavy load corrupts fs and crash ?



I have this delicate problem that has been following me for the last year.

My only two OpenBSD servers one 1.7Ghz Celeron with  "ServerWorks CSB6 
IDE" chipset an the other one is a  500Mhz PIII "Intel 82371AB IDE"
This problem has been the same all thru 3.2  and 3.3(The DMA of the CSB6 
chipset got supported here)
Both machines have two IDE disks that are equally heavily loaded with 
diskaccess(postfix, imap, apache, nfs and scp) cpu and memory is not a 
problem.
If I enable softdeps the machines crash after a day or two,  always with 
errors about ffs not being able to allocate data or some ffs timeout.
With syncronus mounts the computers can run for several months without 
showing the same behavure. Usually I need to do a manual fsck after 
rebooting, a file or two has been badly corrupted. Enabling softdeps on 
only one partition will increase the chances for it to fail.


The interesting information I got last time was:

Feb  2 02:01:52 meso named[7465]: ---w2k machines trying to update the 
nameserver all the time...----
Feb  2 02:12:53 meso /bsd: wd0(pciide0:0:0): timeout
Feb  2 02:12:54 meso /bsd:      type: ata
Feb  2 02:12:54 meso /bsd:      c_bcount: 8192
Feb  2 02:12:54 meso /bsd:      c_skip: 0
Feb  2 02:12:54 meso /bsd: pciide0:0:0: bus-master DMA error: missing 
interrupt,
 status=0x20


I know that some people would suggest me to purchase some SCSI stuff but 
that is not an option......

These two machines are in production so debuging is not that easy. I 
have memory dumps from the crash, but how do I get the trace and ps info 
out of it and into a file without halting the machines ? This is not 
found in a FAQ that i know of!