Discussion:
Domino 5.0.12 sporadically cuts off all clients - but reads a whole 4Gig database internally
a***@bankaktiengesellschaft.de
2004-11-02 16:18:12 UTC
Permalink
Dear audience,


lately some of our Domino-Servers showed some weird behavoir.

All clients lost their connection at a sudden. This is also reflected in
the server log when the transaction count dropped to zero from one
minute to the other:

02.11.2004 02:57:47 PM 352 Transactions/Minute, 39 Users
02.11.2004 02:58:52 PM 315 Transactions/Minute, 38 Users
02.11.2004 02:59:52 PM 83 Transactions/Minute, 39 Users
02.11.2004 03:00:53 PM 6 Transactions/Minute, 38 Users
02.11.2004 03:01:53 PM 0 Transactions/Minute, 38 Users
02.11.2004 03:02:53 PM 0 Transactions/Minute, 38 Users


When we monitored the server tasks using "top" we found that a certain
server task process ID is unusual busy.

We 'strace'd this server task and learned that this server task reads
chunks of 30k size into the memory.

_llseek(80, 1499918336, [1499918336], SEEK_SET) = 0
read(80, "\2B\225>\0\0\0\200\0\0\305\177\1\0\346n%\301\0\0\0\0\0"...,
32768) = 32768
_llseek(80, 1611925504, [1611925504], SEEK_SET) = 0
read(80, "\2B\267?\0\0\0\370\0\0\3;V\0\30o%\301\0\0\0\0\0\0\0\0\0"...,
32768) = 32768
read(80, "\0\0\0\1\223\31\f\0\v\0\0\1\224\31\4\0\2\0\0\1\226\31\f"...,
30720) = 30720
_llseek(80, 273602560, [273602560], SEEK_SET) = 0
read(80, "\2Bi\16\0\0\0\370\0\0\267-7\0%o%\301\0\0\0\0\0\0\0\0\0"...,
32768) = 32768
read(80, " "..., 30720) = 30720

That consecutive reading continues for about half an hour while no user is
able to
work with any database on this server.

At the end of the interruption this server tasks writes some bytes into the
database
and all clients are able to use this server again.

semctl(557073, 13, SETVAL, 0xa91fedd4) = 0
semctl(557073, 13, SETVAL, 0xa91ff4c0) = 0
semctl(557073, 13, SETVAL, 0xa91ff85c) = 0
semctl(557073, 13, SETVAL, 0xa91ff864) = 0
semctl(557073, 13, SETVAL, 0xa91ff84c) = 0
semctl(557073, 13, SETVAL, 0xa91ff85c) = 0
semctl(557073, 13, SETVAL, 0xa91ff864) = 0
semctl(589842, 3, SETVAL, 0xa91ff894) = 0
semop(589842, 0xa91ff900, 1) = 0
semctl(557073, 11, SETVAL, 0xa91ff8b0) = 0
semctl(294921, 4, SETVAL, 0xa91ff910) = 0
gettimeofday({1099405615, 63840}, {4294967176, 0}) = 0
gettimeofday({1099405615, 64134}, {4294967176, 0}) = 0
semctl(589842, 0, SETVAL, 0xa91ff7fc) = 0
semop(589842, 0xa91ff868, 1) = 0

semctl(589842, 0, SETVAL, 0xa91ff7fc) = 0
semop(589842, 0xa91ff868, 1) = 0
_llseek(80, 776257134, [776257134], SEEK_SET) = 0
write(80, "%^O\***@o%\301D\0J\0CN=SERVER/O=OUName"..., 44) = 44

Now for the *really* weird part: Using "lsof" we tried to
get the filename by the descriptor 80 which is read by this process.

But there is no file descriptor 80u at all:

server 7697 notes 79u REG 8,6 668672 980 /var/lotus/doclbm50.ntf
server 7697 notes 81u REG 8,6 13631488 2369 /var/lotus/THB1/T_WV.NSF


According to an "fstat" in the strace we found the database anyway by
looking for the database with the same size as logged:

fstat64(80, {st_mode=S_IFREG|0600, st_size=4421320704, ...}) = 0


Of course we checked whether a "compact" or "upd|ate|all" is causing this
hickups
but not even "update" had been able to oben the database in question.

My question now is: Had anybody experienced this kind of behavior
of a Domino-Server? Has anybody some remedies for that?


Currently this server runs on SuSE Linux 8.0 (latest patches) and the
uptime
and stability shows that there is nothing wrong with the hardware.

XXXX:~ # w
4:18pm up 183 days, 11:44, 1 user, load average: 1.24, 1.38, 1.10


Thank you in advance for your expertise and best regards


MfG

Thomas Antepoth

- Abteilung ORGA -
BAG Bankaktiengesellschaft

Tel: 02385 942-168
Fax: 02385 942-293
***@bankaktiengesellschaft.de

Diese eMail bedarf zur Erlangung der Rechtsverbindlichkeit der
schriftlichen Bestätigung bzw. der Zusendung der Originalschreiben bzw. der
Originaldokumente.
Sascha Siekmann
2004-11-03 15:25:48 UTC
Permalink
Post by a***@bankaktiengesellschaft.de
Dear audience,
...

Check if you can still create a replica of this database and then just
move the corrupt one out of the directory. I may also be a full text
index that has gone haywire. Also, this might be a file system problem.
IIRC 8.0 had ext2 which is prone to corruption with large files. You may
want to upgrade to Enterprise Server 9.0 and feed some mouths in
Nuremberg ;-)
--
Mit freundlichem Gruß,

Sascha Siekmann
http://spam-exitus.de

Ihr professioneller Messaging-Dienstleister. Keine Viren, keine Würmer, kaum Spam.


siekmann network consulting
Sonnenstraße 25
80331 München

Telefon: 0700-spamexit (0700-77263948)
Loading...