• George Spelvin's avatar
    lib/list_sort: optimize number of calls to comparison function · b5c56e0c
    George Spelvin authored
    CONFIG_RETPOLINE has severely degraded indirect function call
    performance, so it's worth putting some effort into reducing the number
    of times cmp() is called.
    
    This patch avoids badly unbalanced merges on unlucky input sizes.  It
    slightly increases the code size, but saves an average of 0.2*n calls to
    cmp().
    
    x86-64 code size 739 -> 803 bytes (+64)
    
    Unfortunately, there's not a lot of low-hanging fruit in a merge sort;
    it already performs only n*log2(n) - K*n + O(1) compares.  The leading
    coefficient is already at the theoretical limit (log2(n!) corresponds to
    K=1.4427), so we're fighting over the linear term, and the best
    mergesort can do is K=1.2645, achieved when n is a power of 2.
    
    The differences between mergesort variants appear when n is *not* a
    power of 2; K is a function of the fractional part of log2(n).  Top-down
    mergesort does best of all, achieving a minimum K=1.2408, and an average
    (over all sizes) K=1.248.  However, that requires knowing the number of
    entries to be sorted ahead of time, and making a full pass over the
    input to count it conflicts with a second performance goal, which is
    cache blocking.
    
    Obviously, we have to read the entire list into L1 cache at some point,
    and performance is best if it fits.  But if it doesn't fit, each full
    pass over the input causes a cache miss per element, which is
    undesirable.
    
    While textbooks explain bottom-up mergesort as a succession of merging
    passes, practical implementations do merging in depth-first order: as
    soon as two lists of the same size are available, they are merged.  This
    allows as many merge passes as possible to fit into L1; only the final
    few merges force cache misses.
    
    This cache-friendly depth-first merge order depends on us merging the
    beginning of the input as much as possible before we've even seen the
    end of the input (and thus know its size).
    
    The simple eager merge pattern causes bad performance when n is just
    over a power of 2.  If n=1028, the final merge is between 1024- and
    4-element lists, which is wasteful of comparisons.  (This is actually
    worse on average than n=1025, because a 1204:1 merge will, on average,
    end after 512 compares, while 1024:4 will walk 4/5 of the list.)
    
    Because of this, bottom-up mergesort achieves K < 0.5 for such sizes,
    and has an average (over all sizes) K of around 1.  (My experiments show
    K=1.01, while theory predicts K=0.965.)
    
    There are "worst-case optimal" variants of bottom-up mergesort which
    avoid this bad performance, but the algorithms given in the literature,
    such as queue-mergesort and boustrodephonic mergesort, depend on the
    breadth-first multi-pass structure that we are trying to avoid.
    
    This implementation is as eager as possible while ensuring that all
    merge passes are at worst 1:2 unbalanced.  This achieves the same
    average K=1.207 as queue-mergesort, which is 0.2*n better then
    bottom-up, and only 0.04*n behind top-down mergesort.
    
    Specifically, defers merging two lists of size 2^k until it is known
    that there are 2^k additional inputs following.  This ensures that the
    final uneven merges triggered by reaching the end of the input will be
    at worst 2:1.  This will avoid cache misses as long as 3*2^k elements
    fit into the cache.
    
    (I confess to being more than a little bit proud of how clean this code
    turned out.  It took a lot of thinking, but the resultant inner loop is
    very simple and efficient.)
    
    Refs:
      Bottom-up Mergesort: A Detailed Analysis
      Wolfgang Panny, Helmut Prodinger
      Algorithmica 14(4):340--354, October 1995
      https://doi.org/10.1007/BF01294131
      https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.5260
    
      The cost distribution of queue-mergesort, optimal mergesorts, and
      power-of-two rules
      Wei-Mei Chen, Hsien-Kuei Hwang, Gen-Huey Chen
      Journal of Algorithms 30(2); Pages 423--448, February 1999
      https://doi.org/10.1006/jagm.1998.0986
      https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.5380
    
      Queue-Mergesort
      Mordecai J. Golin, Robert Sedgewick
      Information Processing Letters, 48(5):253--259, 10 December 1993
      https://doi.org/10.1016/0020-0190(93)90088-q
      https://sci-hub.tw/10.1016/0020-0190(93)90088-Q
    
    Feedback from Rasmus Villemoes <linux@rasmusvillemoes.dk>.
    
    Link: http://lkml.kernel.org/r/fd560853cc4dca0d0f02184ffa888b4c1be89abc.1552704200.git.lkml@sdf.orgSigned-off-by: default avatarGeorge Spelvin <lkml@sdf.org>
    Acked-by: default avatarAndrey Abramov <st5pub@yandex.ru>
    Acked-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
    Reviewed-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
    Cc: Daniel Wagner <daniel.wagner@siemens.com>
    Cc: Dave Chinner <dchinner@redhat.com>
    Cc: Don Mullis <don.mullis@gmail.com>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    b5c56e0c
list_sort.c 8.35 KB