WIP: Fix protection against faulty/slow clients
Good evening Kirill,
I started to reflect/work on the WCFS bug you talked about in recent mails.
This isn't intented to the be the final solution, but it's a WIP state that I would like to share in order to hear your opinion on my interpretation of it.
First of all I would like to talk about the test test_wcfs_pintimeout_kill
which is currently expected to fail and which aims to test the bug.
On a wider overview, I think we should add more tests to comprehensively test the given bug:
-
Generally, we are less interested in the question whether a faulty client get killed, but we care more about, whether the all sane clients get replies on time & can continue working even if one client is faulty. So I thought maybe we should add a test, where we simulate multiple clients, 1 sane and 1 faulty, and we check if the sane one still gets replies in time with multiple pin communication cycles.
-
Then, on the lower level of checking if a faulty client get killed, currently we only test this in the initial watch request (1 -> 2 -> 3). But this isn't even the main moment where the faulty client gets problematic, which is in
readPinWatchers
(1 -> 2) - the moment when we run through all watch links and wait for replies from all the clients until we continue. In order to test this moment, our simulated faulty client would need to reply in the initialhandleWatch
interaction and only stop replying during the firstreadPinWatchers
call. I think it's important to also have a test which covers this case, because otherwise we could find a solution where the first case would pass (e.g. if the watch link is closed outside ofpin
and only insetupWatch
), but the second wouldn't.
Regarding the current test implementation itself, I wonder how exactly it is expected to pass. In the documentation it says
wcfs takes approach to kill a slow client on 30 seconds timeout
but for me it's not so clear what "killing a slow client" means exactly.
Should the server kill the client process in terms of sending SIGTERM
or SIGKILL
to it?
I didn't knew how to implement this and I wasn't sure if this is the goal, as the client may want to react to the server closing the connection (for instance by retrying, maybe it was just too slow?).
So I thought it may be sufficient if the server stops the connection to the client e.g. closing the WatchLink
.
The client realizes this, because it catches an error that the connection is closed (in the test, in the code).
In the pinner thread it'd only catch this exception, once it'd try to send the reply to the master, so I adjusted it in this way in the pinner thread imitation in the test.
Likewise I tried to adjust the WCFS server: if a client times out, the server closes the connection to this client, but doesn't propagate this error further, so it doesn't bother readBlkData
and the other sane clients can continue to work.
The WatchLink
to the faulty client is stopped with the close
call, because in the endless _serve
loop of this WatchLink
the ReadString
call wakes up and exits the loop. With this exit, all deferred closing functions run, and the Watch
of the WatchLink
to the faulty client are removed from the wlinkTab
, so they won't be bothering in the next readPinWatchers
call the other clients anymore.
What do you think about this @kirr, is there a misunderstanding about something, would you choose a different approach?
Best, Levin