wcfs: tests: Extend faulty protection tests with more kinds of faulty clients

So far we were testing only against faulty client that reads pin notification ok, but does not reply to the notification. But there could be more problems: 1) a client does not read pin notification at all 2) a client closes watchlink abruptly after reading pin notification 3) a client replies to pin notification but the reply is not "ack" The first problem, if not handled leads to whole set of clients to become stuck on reading the same block as the faulty client. The other problems also indicate breakage of the isolation protocol from the client side and that wcfs can no longer be sure that it provides good uncorrupted data to the client. In the first case, similarly to "no reply" situation we need to kill the client to make progress while maintaining safety as well. In the cases 2 and 3 we cannot maintain safety if the faulty client remains in the set of live and served clients, so it is also logical to send SIGBUS/SIGKILL to it. Killing a client with SIGBUS is similar to how OS kernel sends SIGBUS when a memory-mapped file is accessed and loading file data results in EIO. It is also similar to wendelin.core 1 where SIGBUS is raised if loading file block results in an error. Extend tests to cover all explained scenarios. /reviewed-by @levin.zimmermann /reviewed-on nexedi/wendelin.core!18

wcfs: tests: Extend faulty protection tests with more kinds of faulty clients
So far we were testing only against faulty client that reads pin notification ok, but does not reply to the notification. But there could be more problems: 1) a client does not read pin notification at all 2) a client closes watchlink abruptly after reading pin notification 3) a client replies to pin notification but the reply is not "ack" The first problem, if not handled leads to whole set of clients to become stuck on reading the same block as the faulty client. The other problems also indicate breakage of the isolation protocol from the client side and that wcfs can no longer be sure that it provides good uncorrupted data to the client. In the first case, similarly to "no reply" situation we need to kill the client to make progress while maintaining safety as well. In the cases 2 and 3 we cannot maintain safety if the faulty client remains in the set of live and served clients, so it is also logical to send SIGBUS/SIGKILL to it. Killing a client with SIGBUS is similar to how OS kernel sends SIGBUS when a memory-mapped file is accessed and loading file data results in EIO. It is also similar to wendelin.core 1 where SIGBUS is raised if loading file block results in an error. Extend tests to cover all explained scenarios. /reviewed-by @levin.zimmermann /reviewed-on nexedi/wendelin.core!18
c91fb14e · Kirill Smelkov · Levin Zimmermann · 0c35ae45 · c91fb14e · c91fb14e
Commit c91fb14e authored Sep 16, 2024 by Kirill Smelkov Committed by Levin Zimmermann Sep 17, 2024
Expand all Hide whitespace changes
Inline Side-by-side

Showing with 144 additions and 41 deletions

wcfs/wcfs_faultyprot_test.py wcfs/wcfs_faultyprot_test.py +141 -31

wcfs/wcfs_test.py wcfs/wcfs_test.py +3 -10

No files found.
--- a/wcfs/wcfs_faultyprot_test.py
+++ b/wcfs/wcfs_faultyprot_test.py
--- a/wcfs/wcfs_test.py
+++ b/wcfs/wcfs_test.py
@@ -1459,7 +1459,8 @@ def test_wcfs_watch_robust():
        "file not yet known to wcfs or is not a ZBigFile"
    wl.close()

-    # closeTX/bye cancels blocked pin handlers
+    # closeTX gently with "bye" cancels blocked pin handlers without killing client
+    # (closing abruptly is verified in wcfs_faultyprot_test.py)
    f = t.open(zf)
    f.assertBlk(2, 'c2')
    f.assertCache([0,0,1])
@@ -1467,23 +1468,15 @@ def test_wcfs_watch_robust():
    wl = t.openwatch()
    wg = sync.WorkGroup(timeout())
    def _(ctx):
-        # TODO clarify what wcfs should do if pin handler closes wlink TX:
-        #   - reply error + close, or
-        #   - just close
-        # t = when reviewing WatchLink.serve in wcfs.go
-        #assert wl.sendReq(ctx, b"watch %s @%s" % (h(zf._p_oid), h(at1))) == \
-        #        "error setup watch f<%s> @%s: " % (h(zf._p_oid), h(at1)) + \
-        #        "pin #%d @%s: context canceled" % (2, h(at1))
-        #with raises(error, match="unexpected EOF"):
        with raises(error, match="recvReply: link is down"):
            wl.sendReq(ctx, b"watch %s @%s" % (h(zf._p_oid), h(at1)))
-
    wg.go(_)
    def _(ctx):
        req = wl.recvReq(ctx)
        assert req is not None
        assert req.msg == b"pin %s #%d @%s" % (h(zf._p_oid), 2, h(at1))
        # don't reply to req - close instead
+        # NOTE this closes watchlink gently with first sending "bye" message
        wl.closeWrite()
    wg.go(_)
    wg.wait()