• Linus Torvalds's avatar
    module: error out early on concurrent load of the same module file · 9828ed3f
    Linus Torvalds authored
    It turns out that udev under certain circumstances will concurrently try
    to load the same modules over-and-over excessively.  This isn't a kernel
    bug, but it ends up affecting the kernel, to the point that under
    certain circumstances we can fail to boot, because the kernel uses a lot
    of memory to read all the module data all at once.
    
    Note that it isn't a memory leak, it's just basically a thundering herd
    problem happening at bootup with a lot of CPUs, with the worst cases
    then being pretty bad.
    
    Admittedly the worst situations are somewhat contrived: lots and lots of
    CPUs, not a lot of memory, and KASAN enabled to make it all slower and
    as such (unintentionally) exacerbate the problem.
    
    Luis explains: [1]
    
     "My best assessment of the situation is that each CPU in udev ends up
      triggering a load of duplicate set of modules, not just one, but *a
      lot*. Not sure what heuristics udev uses to load a set of modules per
      CPU."
    
    Petr Pavlu chimes in: [2]
    
     "My understanding is that udev workers are forked. An initial kmod
      context is created by the main udevd process but no sharing happens
      after the fork. It means that the mentioned memory pool logic doesn't
      really kick in.
    
      Multiple parallel load requests come from multiple udev workers, for
      instance, each handling an udev event for one CPU device and making
      the exactly same requests as all others are doing at the same time.
    
      The optimization idea would be to recognize these duplicate requests
      at the udevd/kmod level and converge them"
    
    Note that module loading has tried to mitigate this issue before, see
    for example commit 064f4536 ("module: avoid allocation if module is
    already present and ready"), which has a few ASCII graphs on memory use
    due to this same issue.
    
    However, while that noticed that the module was already loaded, and
    exited with an error early before spending any more time on setting up
    the module, it didn't handle the case of multiple concurrent module
    loads all being active - but not complete - at the same time.
    
    Yes, one of them will eventually win the race and finalize its copy, and
    the others will then notice that the module already exists and error
    out, but while this all happens, we have tons of unnecessary concurrent
    work being done.
    
    Again, the real fix is for udev to not do that (maybe it should use
    threads instead of fork, and have actual shared data structures and not
    cause duplicate work). That real fix is apparently not trivial.
    
    But it turns out that the kernel already has a pretty good model for
    dealing with concurrent access to the same file: the i_writecount of the
    inode.
    
    In fact, the module loading already indirectly uses 'i_writecount' ,
    because 'kernel_file_read()' will in fact do
    
    	ret = deny_write_access(file);
    	if (ret)
    		return ret;
    	...
    	allow_write_access(file);
    
    around the read of the file data.  We do not allow concurrent writes to
    the file, and return -ETXTBUSY if the file was open for writing at the
    same time as the module data is loaded from it.
    
    And the solution to the reader concurrency problem is to simply extend
    this "no concurrent writers" logic to simply be "exclusive access".
    
    Note that "exclusive" in this context isn't really some absolute thing:
    it's only exclusion from writers and from other "special readers" that
    do this writer denial.  So we simply introduce a variation of that
    "deny_write_access()" logic that not only denies write access, but also
    requires that this is the _only_ such access that denies write access.
    
    Which means that you can't start loading a module that is already being
    loaded as a module by somebody else, or you will get the same -ETXTBSY
    error that you would get if there were writers around.
    
    [ It also means that you can't try to load a currently executing
      executable as a module, for the same reason: executables do that same
      "deny_write_access()" thing, and that's obviously where the whole
      ETXTBSY logic traditionally came from.
    
      This is not a problem for kernel modules, since the set of normal
      executable files and kernel module files is entirely disjoint. ]
    
    This new function is called "exclusive_deny_write_access()", and the
    implementation is trivial, in that it's just an atomic decrement of
    i_writecount if it was 0 before.
    
    To use that new exclusivity check, all we then do is wrap the module
    loading with that exclusive_deny_write_access()() / allow_write_access()
    pair.  The actual patch is a bit bigger than that, because we want to
    surround not just the "load file data" part, but the whole module setup,
    to get maximum exclusion.
    
    So this ends up splitting up "finit_module()" into a few helper
    functions to make it all very clear and legible.
    
    In Luis' test-case (bringing up 255 vcpu's in a virtual machine [3]),
    the "wasted vmalloc" space (ie module data read into a vmalloc'ed area
    in order to be loaded as a module, but then discarded because somebody
    else loaded the same module instead) dropped from 1.8GiB to 474kB.  Yes,
    that's gigabytes to kilobytes.
    
    It doesn't drop completely to zero, because even with this change, you
    can still end up having completely serial pointless module loads, where
    one udev process has loaded a module fully (and thus the kernel has
    released that exclusive lock on the module file), and then another udev
    process tries to load the same module again.
    
    So while we cannot fully get rid of the fundamental bug in user space,
    we _can_ get rid of the excessive concurrent thundering herd effect.
    
    A couple of final side notes on this all:
    
     - This tweak only affects the "finit_module()" system call, which gives
       the kernel a file descriptor with the module data.
    
       You can also just feed the module data as raw data from user space
       with "init_module()" (note the lack of 'f' at the beginning), and
       obviously for that case we do _not_ have any "exclusive read" logic.
    
       So if you absolutely want to do things wrong in user space, and try
       to load the same module multiple times, and error out only later when
       the kernel ends up saying "you can't load the same module name
       twice", you can still do that.
    
       And in fact, some distros will do exactly that, because they will
       uncompress the kernel module data in user space before feeding it to
       the kernel (mainly because they haven't started using the new kernel
       side decompression yet).
    
       So this is not some absolute "you can't do concurrent loads of the
       same module". It's literally just a very simple heuristic that will
       catch it early in case you try to load the exact same module file at
       the same time, and in that case avoid a potentially nasty situation.
    
     - There is another user of "deny_write_access()": the verity code that
       enables fs-verity on a file (the FS_IOC_ENABLE_VERITY ioctl).
    
       If you use fs-verity and you care about verifying the kernel modules
       (which does make sense), you should do it *before* loading said
       kernel module. That may sound obvious, but now the implementation
       basically requires it. Because if you try to do it concurrently, the
       kernel may refuse to load the module file that is being set up by the
       fs-verity code.
    
     - This all will obviously mean that if you insist on loading the same
       module in parallel, only one module load will succeed, and the others
       will return with an error.
    
       That was true before too, but what is different is that the -ETXTBSY
       error can be returned *before* the success case of another process
       fully loading and instantiating the module.
    
       Again, that might sound obvious, and it is indeed the whole point of
       the whole change: we are much quicker to notice the whole "you're
       already in the process of loading this module".
    
       So it's very much intentional, but it does mean that if you just
       spray the kernel with "finit_module()", and expect that the module is
       immediately loaded afterwards without checking the return value, you
       are doing something horribly horribly wrong.
    
       I'd like to say that that would never happen, but the whole _reason_
       for this commit is that udev is currently doing something horribly
       horribly wrong, so ...
    
    Link: https://lore.kernel.org/all/ZEGopJ8VAYnE7LQ2@bombadil.infradead.org/ [1]
    Link: https://lore.kernel.org/all/23bd0ce6-ef78-1cd8-1f21-0e706a00424a@suse.com/ [2]
    Link: https://lore.kernel.org/lkml/ZG%2Fa+nrt4%2FAAUi5z@bombadil.infradead.org/ [3]
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Lucas De Marchi <lucas.demarchi@intel.com>
    Cc: Petr Pavlu <petr.pavlu@suse.com>
    Tested-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    9828ed3f
main.c 84.2 KB