WIP: Fix protection against faulty/slow clients (!18) · Merge Requests · nexedi / wendelin.core

WIP: Fix protection against faulty/slow clients

Good evening Kirill,

I started to reflect/work on the WCFS bug you talked about in recent mails.

This isn't intented to the be the final solution, but it's a WIP state that I would like to share in order to hear your opinion on my interpretation of it.

First of all I would like to talk about the test test_wcfs_pintimeout_kill which is currently expected to fail and which aims to test the bug. On a wider overview, I think we should add more tests to comprehensively test the given bug:

Generally, we are less interested in the question whether a faulty client get killed, but we care more about, whether the all sane clients get replies on time & can continue working even if one client is faulty. So I thought maybe we should add a test, where we simulate multiple clients, 1 sane and 1 faulty, and we check if the sane one still gets replies in time with multiple pin communication cycles.
Then, on the lower level of checking if a faulty client get killed, currently we only test this in the initial watch request (1 -> 2 -> 3). But this isn't even the main moment where the faulty client gets problematic, which is in readPinWatchers (1 -> 2) - the moment when we run through all watch links and wait for replies from all the clients until we continue. In order to test this moment, our simulated faulty client would need to reply in the initial handleWatch interaction and only stop replying during the first readPinWatchers call. I think it's important to also have a test which covers this case, because otherwise we could find a solution where the first case would pass (e.g. if the watch link is closed outside of pin and only in setupWatch), but the second wouldn't.

Regarding the current test implementation itself, I wonder how exactly it is expected to pass. In the documentation it says

wcfs takes approach to kill a slow client on 30 seconds timeout

but for me it's not so clear what "killing a slow client" means exactly. Should the server kill the client process in terms of sending SIGTERM or SIGKILL to it? I didn't knew how to implement this and I wasn't sure if this is the goal, as the client may want to react to the server closing the connection (for instance by retrying, maybe it was just too slow?). So I thought it may be sufficient if the server stops the connection to the client e.g. closing the WatchLink. The client realizes this, because it catches an error that the connection is closed (in the test, in the code). In the pinner thread it'd only catch this exception, once it'd try to send the reply to the master, so I adjusted it in this way in the pinner thread imitation in the test.

Likewise I tried to adjust the WCFS server: if a client times out, the server closes the connection to this client, but doesn't propagate this error further, so it doesn't bother readBlkData and the other sane clients can continue to work.

The WatchLink to the faulty client is stopped with the close call, because in the endless _serve loop of this WatchLink the ReadString call wakes up and exits the loop. With this exit, all deferred closing functions run, and the Watch of the WatchLink to the faulty client are removed from the wlinkTab, so they won't be bothering in the next readPinWatchers call the other clients anymore.

What do you think about this @kirr, is there a misunderstanding about something, would you choose a different approach?

Best, Levin

Good evening Kirill,

I started to reflect/work on the WCFS bug you talked about in recent mails.

This isn't intented to the be the final solution, but it's a WIP state that I would like to share in order to hear your opinion on my interpretation of it.

First of all I would like to talk about the test `test_wcfs_pintimeout_kill` which is currently expected to fail and which aims to test the bug.
On a wider overview, I think we should add more tests to comprehensively test the given bug:

1. Generally, we are less interested in the question whether a faulty client get killed, but we care more about, whether the all sane clients get replies on time & can continue working even if one client is faulty. So I thought maybe we should add a test, where we simulate multiple clients, 1 sane and 1 faulty, and we check if the sane one still gets replies in time with multiple pin communication cycles.

2. Then, on the lower level of checking if a faulty client get killed, currently we only test this in the initial watch request ([1](https://lab.nexedi.com/nexedi/wendelin.core/blob/38dde766/wcfs/wcfs.go#L1956-1969) -> [2](https://lab.nexedi.com/nexedi/wendelin.core/blob/38dde766/wcfs/wcfs.go#L1977) -> [3](https://lab.nexedi.com/nexedi/wendelin.core/blob/38dde766/wcfs/wcfs.go#L1795)). But this isn't even the main moment where the faulty client gets problematic, which is in `readPinWatchers` ([1](https://lab.nexedi.com/nexedi/wendelin.core/blob/38dde766/wcfs/wcfs.go#L1516) -> [2](https://lab.nexedi.com/nexedi/wendelin.core/blob/38dde766/wcfs/wcfs.go#L1586)) - the moment when we run through all watch links and wait for replies from all the clients until we continue. In order to test this moment, our simulated faulty client would need to reply in the initial `handleWatch` interaction and only stop replying during the first `readPinWatchers` call. I think it's important to also have a test which covers this case, because otherwise we could find a solution where the first case would pass (e.g. if the watch link is closed outside of `pin` and only in `setupWatch`), but the second wouldn't.

Regarding the current test implementation itself, I wonder how exactly it is expected to pass.
In the documentation it says

> [wcfs takes approach to kill a slow client on 30 seconds timeout](https://lab.nexedi.com/nexedi/wendelin.core/blob/38dde766/wcfs/wcfs.go#L207-208)

but for me it's not so clear what "killing a slow client" means exactly.
Should the server kill the client process in terms of sending `SIGTERM` or `SIGKILL` to it?
I didn't knew how to implement this and I wasn't sure if this is the goal, as the client may want to react to the server closing the connection (for instance by retrying, maybe it was just too slow?).
So I thought it may be sufficient if the server stops the connection to the client e.g. closing the `WatchLink`.
The client realizes this, because it catches an error that the connection is closed ([in the test](https://lab.nexedi.com/levin.zimmermann/wendelin.core/blob/b29735a46cfa59726881bec634f8489ea03ece5e/wcfs/wcfs_test.py#L1474-1477), [in the code](https://lab.nexedi.com/levin.zimmermann/wendelin.core/blob/10f7153a/wcfs/client/_wcfs.pyx#L206-207)).
In the pinner thread it'd only catch this exception, once [it'd try to send the reply to the master](https://lab.nexedi.com/levin.zimmermann/wendelin.core/blob/a36cdcc3/wcfs/client/wcfs.cpp#L546), so I adjusted it in this way in [the pinner thread imitation in the test](https://lab.nexedi.com/levin.zimmermann/wendelin.core/blob/b29735a46cfa59726881bec634f8489ea03ece5e/wcfs/wcfs_test.py#L1489-1492).

Likewise I tried to adjust the WCFS server: if [a client times out](https://lab.nexedi.com/levin.zimmermann/wendelin.core/blob/b29735a46cfa59726881bec634f8489ea03ece5e/wcfs/wcfs.go#L2044-2046), the server [closes the connection to this client](https://lab.nexedi.com/levin.zimmermann/wendelin.core/blob/b29735a46cfa59726881bec634f8489ea03ece5e/wcfs/wcfs.go#L1494-1495), but [doesn't propagate this error further](https://lab.nexedi.com/levin.zimmermann/wendelin.core/blob/b29735a46cfa59726881bec634f8489ea03ece5e/wcfs/wcfs.go#L1496), so [it doesn't bother `readBlkData`](https://lab.nexedi.com/levin.zimmermann/wendelin.core/blob/b29735a46cfa59726881bec634f8489ea03ece5e/wcfs/wcfs.go#L1281-1284) and the other sane clients can continue to work.

The `WatchLink` to the faulty client is stopped with the `close` call, because in the endless `_serve` loop of this `WatchLink` [the `ReadString` call wakes up and exits the loop](https://lab.nexedi.com/levin.zimmermann/wendelin.core/blob/b29735a46cfa59726881bec634f8489ea03ece5e/wcfs/wcfs.go#L1924-1926). With this exit, all deferred closing functions run, and the `Watch` of the `WatchLink` to the faulty client [are removed from the `wlinkTab`](https://lab.nexedi.com/levin.zimmermann/wendelin.core/blob/b29735a46cfa59726881bec634f8489ea03ece5e/wcfs/wcfs.go#L1883), so they won't be bothering [in the next `readPinWatchers` call](https://lab.nexedi.com/levin.zimmermann/wendelin.core/blob/b29735a46cfa59726881bec634f8489ea03ece5e/wcfs/wcfs.go#L1549) the other clients anymore.

What do you think about this @kirr, is there a misunderstanding about something, would you choose a different approach?

Best, Levin

WIP: Fix protection against faulty/slow clients

Check out, review, and merge locally

Revert this merge request

Cherry-pick this merge request