drivers/char/Kconfig · 5dd7d1b6ad9e81c8376b0c95c73dd14a254d81f9 · nexedi / linux

Andrew Morton authored Feb 03, 2003

Patch from: Joel Becker <Joel.Becker@oracle.com>

This kernel module will detect long durations when jiffies has failed to
increment, and will reboot the machine in response.

Joel says:

"Here's why Oracle wants such a thing. We run clusters. Imagine a two node
cluster. Node1 pauses completely for some reason. There are multiple
reasons this can happen. A bad driver can udelay() for 90 seconds (qla used
to do this). zVM on S/390 can page Linux out for minutes at a time.
Anything that causes the box to freeze. Jiffies does *not* count during
this, so when Node1 returns it feels that no time has passed.

Node2, however, has been counting time. When Node1 goes away, the Oracle
cluster manager starts looking for it. After a timeout, it gives up. It
then recovers any in-progress transactions from Node1. After that, it
starts new operations, modifying the data in ways that Node1 has no idea
about (it's still out to lunch).

When Node1 finally returns (udelay() ends, zVM pages it in, whatever), any
I/O that it has queued or is about to queue will get sent to the disk.
Oops, you've just corrupted your shared data.

hangcheck-timer should catch this and reboot the box.

This is why Oracle wants this driver. We figure that such functionality
would be beneficial to others as well, so we posted to l-k. We'd all hope
that driver writers don't udelay() for 90s, but S/390 with zVM is still
around. Some folks might want to notice when it happens. I am sure other
things exist that trigger the same symptoms."

5dd7d1b6

Kconfig 39.5 KB

Replace Kconfig