• George Spelvin's avatar
    lib/sort: make swap functions more generic · 37d0ec34
    George Spelvin authored
    Patch series "lib/sort & lib/list_sort: faster and smaller", v2.
    
    Because CONFIG_RETPOLINE has made indirect calls much more expensive, I
    thought I'd try to reduce the number made by the library sort functions.
    
    The first three patches apply to lib/sort.c.
    
    Patch #1 is a simple optimization.  The built-in swap has special cases
    for aligned 4- and 8-byte objects.  But those are almost never used;
    most calls to sort() work on larger structures, which fall back to the
    byte-at-a-time loop.  This generalizes them to aligned *multiples* of 4
    and 8 bytes.  (If nothing else, it saves an awful lot of energy by not
    thrashing the store buffers as much.)
    
    Patch #2 grabs a juicy piece of low-hanging fruit.  I agree that nice
    simple solid heapsort is preferable to more complex algorithms (sorry,
    Andrey), but it's possible to implement heapsort with far fewer
    comparisons (50% asymptotically, 25-40% reduction for realistic sizes)
    than the way it's been done up to now.  And with some care, the code
    ends up smaller, as well.  This is the "big win" patch.
    
    Patch #3 adds the same sort of indirect call bypass that has been added
    to the net code of late.  The great majority of the callers use the
    builtin swap functions, so replace the indirect call to sort_func with a
    (highly preditable) series of if() statements.  Rather surprisingly,
    this decreased code size, as the swap functions were inlined and their
    prologue & epilogue code eliminated.
    
    lib/list_sort.c is a bit trickier, as merge sort is already close to
    optimal, and we don't want to introduce triumphs of theory over
    practicality like the Ford-Johnson merge-insertion sort.
    
    Patch #4, without changing the algorithm, chops 32% off the code size
    and removes the part[MAX_LIST_LENGTH+1] pointer array (and the
    corresponding upper limit on efficiently sortable input size).
    
    Patch #5 improves the algorithm.  The previous code is already optimal
    for power-of-two (or slightly smaller) size inputs, but when the input
    size is just over a power of 2, there's a very unbalanced final merge.
    
    There are, in the literature, several algorithms which solve this, but
    they all depend on the "breadth-first" merge order which was replaced by
    commit 835cc0c8 with a more cache-friendly "depth-first" order.
    Some hard thinking came up with a depth-first algorithm which defers
    merges as little as possible while avoiding bad merges.  This saves
    0.2*n compares, averaged over all sizes.
    
    The code size increase is minimal (64 bytes on x86-64, reducing the net
    savings to 26%), but the comments expanded significantly to document the
    clever algorithm.
    
    TESTING NOTES: I have some ugly user-space benchmarking code which I
    used for testing before moving this code into the kernel.  Shout if you
    want a copy.
    
    I'm running this code right now, with CONFIG_TEST_SORT and
    CONFIG_TEST_LIST_SORT, but I confess I haven't rebooted since the last
    round of minor edits to quell checkpatch.  I figure there will be at
    least one round of comments and final testing.
    
    This patch (of 5):
    
    Rather than having special-case swap functions for 4- and 8-byte
    objects, special-case aligned multiples of 4 or 8 bytes.  This speeds up
    most users of sort() by avoiding fallback to the byte copy loop.
    
    Despite what ca96ab85 ("lib/sort: Add 64 bit swap function") claims,
    very few users of sort() sort pointers (or pointer-sized objects); most
    sort structures containing at least two words.  (E.g.
    drivers/acpi/fan.c:acpi_fan_get_fps() sorts an array of 40-byte struct
    acpi_fan_fps.)
    
    The functions also got renamed to reflect the fact that they support
    multiple words.  In the great tradition of bikeshedding, the names were
    by far the most contentious issue during review of this patch series.
    
    x86-64 code size 872 -> 886 bytes (+14)
    
    With feedback from Andy Shevchenko, Rasmus Villemoes and Geert
    Uytterhoeven.
    
    Link: http://lkml.kernel.org/r/f24f932df3a7fa1973c1084154f1cea596bcf341.1552704200.git.lkml@sdf.orgSigned-off-by: default avatarGeorge Spelvin <lkml@sdf.org>
    Acked-by: default avatarAndrey Abramov <st5pub@yandex.ru>
    Acked-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
    Reviewed-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
    Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Cc: Daniel Wagner <daniel.wagner@siemens.com>
    Cc: Don Mullis <don.mullis@gmail.com>
    Cc: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    37d0ec34
sort.c 5.1 KB