time: Redo timers properly
Background: in 2019 in 9c260fde (time: New package that mirrors Go's time) and b073f6df (time: Move/Port timers to C++/Pyx nogil) I've added basic timers - with proper API but with very dumb implementation that was spawning one thread per each timer. There were just a few timers in the users and this was working, surprisingly, relatively ok... ... until 2023 where I was working on XLTE that needs to organize 100Hz polling of Amarisoft eNodeB service to retrieve information about flows on Data Radio Bearers: xlte@2a016d48 https://lab.nexedi.com/kirr/xlte/-/blob/8e606c64/amari/drb.py There each request comes with its own deadline - to catch "no reply", and the deadlines are implemented via timers. So there are 100 threads created every second which adds visible overhead, consumes a lot of virtual address space and RSS for threads stacks, and should be all unnecessary. We was tolerating even that for some time, but recently Joanne approached me with reports that xamari program, that does the polling, is leaking memory. With that, and because it was hard to find what is actually leaking, I've started to remove uncertainties and there are a lot of uncertainty in what is going on when lots of threads are being created over and over. In the end the leak turned out to be likely a different thing (see nexedi/pygolang!24, still discovered while working on hereby patch), but all of the above was enough motivation to finally start redoing the timers properly. -------- So when it comes to do the timers properly more or less, there is usually queue of armed timers, and a loop that picks entries from that queue to fire them. I was initially trying to do the simple thing and use std::priority_queue for that, because priority_queue is internally heap, and heaps can provide O(log(n)) insertion and removal of arbitrary element, plus O(1) "pick top element to process". Exactly what would suit. However I quickly found that even in 2024, std::priority_queue does not provide removal operation at all, and there is no such thing as e.g. std::sift_heap, that would help to implement that manually. Which is surprising, because e.g. libevent implements all that just ok via sifting up/down upon removal in logarithmic complexity: https://github.com/libevent/libevent/blob/80e25c02/minheap-internal.h#L96-L115 the lack of efficient removal operation turned out to be a blocker to use std::priority_queue because most of the timers, that are armed for timeouts, are never expired and upon successful completion of covered operation, the timer is stopped. In other words the timer is removed from the timer queue and the removal is one of the most often operations. So, if std::priority_queue cannot work, we would need to either bring in another implementation of a heap, or, if we are to bring something, bring and use something else that is more suitable for implementing timers. That reminded me that in 2005 for my Navy project, I already implemented custom timer wheel to handle timeouts after reading https://lwn.net/Articles/152436/ . Contrary to heaps, such timer wheels provide O(1) insertion and removal of timers and work generally faster. But this time I did not want to delve into implementing all that myself again and tried to look around of what is available out there. There was an update to kernel timer-wheel implementation described at https://lwn.net/Articles/646950/ and from that a project called Timeout.c was also found that provides implementation for such a wheel for user space: https://25thandclement.com/~william/projects/timeout.c.html . However when we are to pick third-party code, we should be ready to understand it and fix bugs there on our own. So the audit of timeout.c did not went very smoothly - there are many platform-depended places, and the issue tracker shows signs that sometimes not everything is ok with the implementation. With that I've looked around a bit more and found more compact and more portable Ratas library with good structure and description and whose audit came more well: https://www.snellman.net/blog/archive/2016-07-27-ratas-hierarchical-timer-wheel https://github.com/jsnell/ratas Here, after going through the code, I feel to be capable to understand issues and fix bugs myself if that would become needed. And the benchmark comparison of Timeout.c and Ratas shows that they should be of the same order regarding performance: https://lab.nexedi.com/kirr/misc/-/blob/4f51fd6/bench/time-wheel/ratas-vs-timeout.pdf ratas@382321d2 timeout@d6f15744 which makes Ratas the winner for me. Having timer-wheel implementation, the rest is just technique to glue it all together. One implementation aspect deserves to be mentioned though: The timer loop uses Semaphore.acquire, recently modernized to also accept timeout, to organize sleep in between pauses with also being able to be simultaneously woken up if new timer is armed with earlier expiration time. Other than that the changes are mostly straightforward. Please see the patch itself for details. Regarding how the new implementation is more efficient for what we had before, there are added benchmarks to measure arming timers that do not fire, and, for symmetry, arming timers that do fire. We are most interested in the first benchmark, because it shows how cheap or expensive it is to use timers to implement timeouts, but the second one is also useful to have to see the overhead of the whole timers machinery. On my machine under py3.11 they go as after this patch: name time/op timer_arm_cancel 805ns ± 0% timer_arm_fire 9.63µs ± 0% and before the patch the benchmarks simply do not run till the end because they run out of memory due to huge number of threads being created. Still with the following test program we can measure the effect new timers implementation has: ---- 8< ---- from golang import time def main(): δt_rate = 1*time.millisecond tprev = time.now() tnext = tprev + δt_rate while 1: timer = time.Timer(5*time.second) _ = timer.stop() assert _ is True t = time.now() δtsleep = tnext - t #print('sleep %.3f ms' % (δtsleep/time.millisecond)) time.sleep(δtsleep) tprev = tnext tnext += δt_rate main() ---- 8< ---- This program creates/arms and cancels a timer 1000 times per second. Before hereby patch this program consumes ~ 30% of CPU, while after hereby patch this program consumes ~ 7-8% of CPU. For the reference just a sleep part of that program, with all code related to timers removed consumes ~5% of CPU, while the consumption of plain sleep(1ms) in C and directly using system calls ---- 8< ---- #include <unistd.h> int main() { while (1) { usleep(1000); } return 0; } ---- 8< ---- is ~ 3-4% of CPU on my machine. /cc @jerome /cc ORS team (@jhuge, @lu.xu, @tomo, @xavier_thompson, @Daetalus) /proposed-for-review-on nexedi/pygolang!26