1. 09 May, 2020 6 commits
    • Linus Torvalds's avatar
      gcc-10: disable 'restrict' warning for now · adc71920
      Linus Torvalds authored
      gcc-10 now warns about passing aliasing pointers to functions that take
      restricted pointers.
      
      That's actually a great warning, and if we ever start using 'restrict'
      in the kernel, it might be quite useful.  But right now we don't, and it
      turns out that the only thing this warns about is an idiom where we have
      declared a few functions to be "printf-like" (which seems to make gcc
      pick up the restricted pointer thing), and then we print to the same
      buffer that we also use as an input.
      
      And people do that as an odd concatenation pattern, with code like this:
      
          #define sysfs_show_gen_prop(buffer, fmt, ...) \
              snprintf(buffer, PAGE_SIZE, "%s"fmt, buffer, __VA_ARGS__)
      
      where we have 'buffer' as both the destination of the final result, and
      as the initial argument.
      
      Yes, it's a bit questionable.  And outside of the kernel, people do have
      standard declarations like
      
          int snprintf( char *restrict buffer, size_t bufsz,
                        const char *restrict format, ... );
      
      where that output buffer is marked as a restrict pointer that cannot
      alias with any other arguments.
      
      But in the context of the kernel, that 'use snprintf() to concatenate to
      the end result' does work, and the pattern shows up in multiple places.
      And we have not marked our own version of snprintf() as taking restrict
      pointers, so the warning is incorrect for now, and gcc picks it up on
      its own.
      
      If we do start using 'restrict' in the kernel (and it might be a good
      idea if people find places where it matters), we'll need to figure out
      how to avoid this issue for snprintf and friends.  But in the meantime,
      this warning is not useful.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      adc71920
    • Linus Torvalds's avatar
      gcc-10: disable 'stringop-overflow' warning for now · 5a76021c
      Linus Torvalds authored
      This is the final array bounds warning removal for gcc-10 for now.
      
      Again, the warning is good, and we should re-enable all these warnings
      when we have converted all the legacy array declaration cases to
      flexible arrays. But in the meantime, it's just noise.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a76021c
    • Linus Torvalds's avatar
      gcc-10: disable 'array-bounds' warning for now · 44720996
      Linus Torvalds authored
      This is another fine warning, related to the 'zero-length-bounds' one,
      but hitting the same historical code in the kernel.
      
      Because C didn't historically support flexible array members, we have
      code that instead uses a one-sized array, the same way we have cases of
      zero-sized arrays.
      
      The one-sized arrays come from either not wanting to use the gcc
      zero-sized array extension, or from a slight convenience-feature, where
      particularly for strings, the size of the structure now includes the
      allocation for the final NUL character.
      
      So with a "char name[1];" at the end of a structure, you can do things
      like
      
             v = my_malloc(sizeof(struct vendor) + strlen(name));
      
      and avoid the "+1" for the terminator.
      
      Yes, the modern way to do that is with a flexible array, and using
      'offsetof()' instead of 'sizeof()', and adding the "+1" by hand.  That
      also technically gets the size "more correct" in that it avoids any
      alignment (and thus padding) issues, but this is another long-term
      cleanup thing that will not happen for 5.7.
      
      So disable the warning for now, even though it's potentially quite
      useful.  Having a slew of warnings that then hide more urgent new issues
      is not an improvement.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      44720996
    • Linus Torvalds's avatar
      gcc-10: disable 'zero-length-bounds' warning for now · 5c45de21
      Linus Torvalds authored
      This is a fine warning, but we still have a number of zero-length arrays
      in the kernel that come from the traditional gcc extension.  Yes, they
      are getting converted to flexible arrays, but in the meantime the gcc-10
      warning about zero-length bounds is very verbose, and is hiding other
      issues.
      
      I missed one actual build failure because it was hidden among hundreds
      of lines of warning.  Thankfully I caught it on the second go before
      pushing things out, but it convinced me that I really need to disable
      the new warnings for now.
      
      We'll hopefully be all done with our conversion to flexible arrays in
      the not too distant future, and we can then re-enable this warning.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c45de21
    • Linus Torvalds's avatar
      Stop the ad-hoc games with -Wno-maybe-initialized · 78a5255f
      Linus Torvalds authored
      We have some rather random rules about when we accept the
      "maybe-initialized" warnings, and when we don't.
      
      For example, we consider it unreliable for gcc versions < 4.9, but also
      if -O3 is enabled, or if optimizing for size.  And then various kernel
      config options disabled it, because they know that they trigger that
      warning by confusing gcc sufficiently (ie PROFILE_ALL_BRANCHES).
      
      And now gcc-10 seems to be introducing a lot of those warnings too, so
      it falls under the same heading as 4.9 did.
      
      At the same time, we have a very straightforward way to _enable_ that
      warning when wanted: use "W=2" to enable more warnings.
      
      So stop playing these ad-hoc games, and just disable that warning by
      default, with the known and straight-forward "if you want to work on the
      extra compiler warnings, use W=123".
      
      Would it be great to have code that is always so obvious that it never
      confuses the compiler whether a variable is used initialized or not?
      Yes, it would.  In a perfect world, the compilers would be smarter, and
      our source code would be simpler.
      
      That's currently not the world we live in, though.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78a5255f
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.7-2020-05-08' of git://git.kernel.dk/linux-block · 1d3962ae
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
      
       - Fix finish_wait() balancing in file cancelation (Xiaoguang)
      
       - Ensure early cleanup of resources in ring map failure (Xiaoguang)
      
       - Ensure IORING_OP_SLICE does the right file mode checks (Pavel)
      
       - Remove file opening from openat/openat2/statx, it's not needed and
         messes with O_PATH
      
      * tag 'io_uring-5.7-2020-05-08' of git://git.kernel.dk/linux-block:
        io_uring: don't use 'fd' for openat/openat2/statx
        splice: move f_mode checks to do_{splice,tee}()
        io_uring: handle -EFAULT properly in io_uring_setup()
        io_uring: fix mismatched finish_wait() calls in io_uring_cancel_files()
      1d3962ae
  2. 08 May, 2020 28 commits
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · d5eeab8d
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Four minor fixes, all in drivers (qla2xxx, ibmvfc, ibmvscsi)"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: ibmvscsi: Fix WARN_ON during event pool release
        scsi: ibmvfc: Don't send implicit logouts prior to NPIV login
        scsi: qla2xxx: Delete all sessions before unregister local nvme port
        scsi: qla2xxx: Fix hang when issuing nvme disconnect-all in NPIV
      d5eeab8d
    • Linus Torvalds's avatar
      Merge tag 'ceph-for-5.7-rc5' of git://github.com/ceph/ceph-client · eb24fdd8
      Linus Torvalds authored
      Pull ceph fixes from Ilya Dryomov:
       "Fixes for an endianness handling bug that prevented mounts on
        big-endian arches, a spammy log message and a couple error paths.
      
        Also included a MAINTAINERS update"
      
      * tag 'ceph-for-5.7-rc5' of git://github.com/ceph/ceph-client:
        ceph: demote quotarealm lookup warning to a debug message
        MAINTAINERS: remove myself as ceph co-maintainer
        ceph: fix double unlock in handle_cap_export()
        ceph: fix special error code in ceph_try_get_caps()
        ceph: fix endianness bug when handling MDS session feature bits
      eb24fdd8
    • Luis Henriques's avatar
      ceph: demote quotarealm lookup warning to a debug message · 12ae44a4
      Luis Henriques authored
      A misconfigured cephx can easily result in having the kernel client
      flooding the logs with:
      
        ceph: Can't lookup inode 1 (err: -13)
      
      Change this message to debug level.
      
      Cc: stable@vger.kernel.org
      URL: https://tracker.ceph.com/issues/44546Signed-off-by: default avatarLuis Henriques <lhenriques@suse.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      12ae44a4
    • Linus Torvalds's avatar
      Merge tag 'char-misc-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 4334f30e
      Linus Torvalds authored
      Pull char/misc driver fixes from Greg KH:
       "Here are some small driver fixes for 5.7-rc5 that resolve a number of
        minor reported issues:
      
         - mhi bus driver fixes found as people actually use the code
      
         - phy driver fixes and compat string additions
      
         - most driver fix due to link order changing when the core moved out
           of staging
      
         - mei driver fix
      
         - interconnect build warning fix
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'char-misc-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        bus: mhi: core: Fix channel device name conflict
        bus: mhi: core: Fix typo in comment
        bus: mhi: core: Offload register accesses to the controller
        bus: mhi: core: Remove link_status() callback
        bus: mhi: core: Make sure to powerdown if mhi_sync_power_up fails
        bus: mhi: Fix parsing of mhi_flags
        mei: me: disable mei interface on LBG servers.
        phy: qualcomm: usb-hs-28nm: Prepare clocks in init
        MAINTAINERS: Add Vinod Koul as Generic PHY co-maintainer
        interconnect: qcom: Move the static keyword to the front of declaration
        most: core: use function subsys_initcall()
        bus: mhi: core: Fix a NULL vs IS_ERR check in mhi_create_devices()
        phy: qcom-qusb2: Re add "qcom,sdm845-qusb2-phy" compat string
        phy: tegra: Select USB_COMMON for usb_get_maximum_speed()
      4334f30e
    • Linus Torvalds's avatar
      Merge tag 'driver-core-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core · c61529f6
      Linus Torvalds authored
      Pull driver core fixes from Greg KH:
       "Here are a number of small driver core fixes for 5.7-rc5 to resolve a
        bunch of reported issues with the current tree.
      
        Biggest here are the reverts and patches from John Stultz to resolve a
        bunch of deferred probe regressions we have been seeing in 5.7-rc
        right now.
      
        Along with those are some other smaller fixes:
      
         - coredump crash fix
      
         - devlink fix for when permissive mode was enabled
      
         - amba and platform device dma_parms fixes
      
         - component error silenced for when deferred probe happens
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'driver-core-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
        regulator: Revert "Use driver_deferred_probe_timeout for regulator_init_complete_work"
        driver core: Ensure wait_for_device_probe() waits until the deferred_probe_timeout fires
        driver core: Use dev_warn() instead of dev_WARN() for deferred_probe_timeout warnings
        driver core: Revert default driver_deferred_probe_timeout value to 0
        component: Silence bind error on -EPROBE_DEFER
        driver core: Fix handling of fw_devlink=permissive
        coredump: fix crash when umh is disabled
        amba: Initialize dma_parms for amba devices
        driver core: platform: Initialize dma_parms for platform devices
      c61529f6
    • Linus Torvalds's avatar
      Merge tag 'staging-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · e7a1c733
      Linus Torvalds authored
      Pull staging driver fixes from Greg KH:
       "Here are three small driver fixes for 5.7-rc5.
      
        Two of these are documentation fixes:
      
         - MAINTAINERS update due to removed driver
      
         - removing Wolfram from the ks7010 driver TODO file
      
        The other patch is a real fix:
      
         - fix gasket driver to proper check the return value of a call
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'staging-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        staging: gasket: Check the return value of gasket_get_bar_index()
        staging: ks7010: remove me from CC list
        MAINTAINERS: remove entry after hp100 driver removal
      e7a1c733
    • Linus Torvalds's avatar
      Merge tag 'tty-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · cbd0e482
      Linus Torvalds authored
      Pull tty/serial fixes from Greg KH:
       "Here are three small TTY/Serial/VT fixes for 5.7-rc5:
      
         - revert for the bcm63xx driver "fix" that was incorrect
      
         - vt unicode console bugfix
      
         - xilinx_uartps console driver fix
      
        All of these have been in linux next with no reported issues"
      
      * tag 'tty-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        tty: xilinx_uartps: Fix missing id assignment to the console
        vt: fix unicode console freeing with a common interface
        Revert "tty: serial: bcm63xx: fix missing clk_put() in bcm63xx_uart"
      cbd0e482
    • Linus Torvalds's avatar
      Merge tag 'usb-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 0a0b96b2
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are some small USB fixes for 5.7-rc5 to resolve some reported
        issues:
      
         - syzbot found problems fixed
      
         - usbfs dma mapping fix
      
         - typec bugfixs
      
         - chipidea bugfix
      
         - usb4/thunderbolt fix
      
         - new device ids/quirks
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'usb-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        usb: chipidea: msm: Ensure proper controller reset using role switch API
        usb: typec: mux: intel: Handle alt mode HPD_HIGH
        usb: usbfs: correct kernel->user page attribute mismatch
        usb: typec: intel_pmc_mux: Fix the property names
        USB: core: Fix misleading driver bug report
        USB: serial: qcserial: Add DW5816e support
        USB: uas: add quirk for LaCie 2Big Quadra
        thunderbolt: Check return value of tb_sw_read() in usb4_switch_op()
        USB: serial: garmin_gps: add sanity checking for data length
      0a0b96b2
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2020-05-08' of git://anongit.freedesktop.org/drm/drm · 775a8e03
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Another pretty normal week. I didn't get any i915 fixes yet, so next
        week I'd expect double the usual i915, but otherwise a bunch of amdgpu
        and some scattered other fixes.
      
        hdcp:
         - fix HDCP regression
      
        amdgpu:
         - Runtime PM fixes
         - DC fix for PPC
         - Misc DC fixes
      
        virtio:
         - fix context ordering issue
      
        sun4i:
         - old gcc warning fix
      
        ingenic-drm:
         - missing module support"
      
      * tag 'drm-fixes-2020-05-08' of git://anongit.freedesktop.org/drm/drm:
        drm/amd/display: Prevent dpcd reads with passive dongles
        drm/amd/display: fix counter in wait_for_no_pipes_pending
        drm/amd/display: Update DCN2.1 DV Code Revision
        drm: Fix HDCP failures when SRM fw is missing
        sun6i: dsi: fix gcc-4.8
        drm: ingenic-drm: add MODULE_DEVICE_TABLE
        drm/virtio: create context before RESOURCE_CREATE_2D in 3D mode
        drm/amd/display: work around fp code being emitted outside of DC_FP_START/END
        drm/amdgpu/dc: Use WARN_ON_ONCE for ASSERT
        drm/amdgpu: drop redundant cg/pg ungate on runpm enter
        drm/amdgpu: move kfd suspend after ip_suspend_phase1
      775a8e03
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · af38553c
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "14 fixes and one selftest to verify the ipc fixes herein"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm: limit boost_watermark on small zones
        ubsan: disable UBSAN_ALIGNMENT under COMPILE_TEST
        mm/vmscan: remove unnecessary argument description of isolate_lru_pages()
        epoll: atomically remove wait entry on wake up
        kselftests: introduce new epoll60 testcase for catching lost wakeups
        percpu: make pcpu_alloc() aware of current gfp context
        mm/slub: fix incorrect interpretation of s->offset
        scripts/gdb: repair rb_first() and rb_last()
        eventpoll: fix missing wakeup for ovflist in ep_poll_callback
        arch/x86/kvm/svm/sev.c: change flag passed to GUP fast in sev_pin_memory()
        scripts/decodecode: fix trapping instruction formatting
        kernel/kcov.c: fix typos in kcov_remote_start documentation
        mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()
        mm, memcg: fix error return value of mem_cgroup_css_alloc()
        ipc/mqueue.c: change __do_notify() to bypass check_kill_permission()
      af38553c
    • Dave Airlie's avatar
      Merge tag 'drm-misc-fixes-2020-05-07' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes · a9fe6f18
      Dave Airlie authored
      A few minor fixes for an ordering issue in virtio, an (old) gcc warning
      in sun4i, a probe issue in ingenic-drm and a regression in the HDCP
      support.
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      
      From: Maxime Ripard <maxime@cerno.tech>
      Link: https://patchwork.freedesktop.org/patch/msgid/20200507160130.id64niqgf5wsha4u@gilmour.lan
      a9fe6f18
    • Dave Airlie's avatar
      Merge tag 'amd-drm-fixes-5.7-2020-05-06' of... · c61b0b97
      Dave Airlie authored
      Merge tag 'amd-drm-fixes-5.7-2020-05-06' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
      
      amd-drm-fixes-5.7-2020-05-06:
      
      amdgpu:
      - Runtime PM fixes
      - DC fix for PPC
      - Misc DC fixes
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      From: Alex Deucher <alexdeucher@gmail.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20200506212257.3893-1-alexander.deucher@amd.com
      c61b0b97
    • Linus Torvalds's avatar
      Merge branch 'for-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security · 79dede78
      Linus Torvalds authored
      Pull security subsystem fix from James Morris:
       "Fix the default value of fs_context_parse_param hook"
      
      * 'for-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
        security: Fix the default value of fs_context_parse_param hook
      79dede78
    • Henry Willard's avatar
      mm: limit boost_watermark on small zones · 14f69140
      Henry Willard authored
      Commit 1c30844d ("mm: reclaim small amounts of memory when an
      external fragmentation event occurs") adds a boost_watermark() function
      which increases the min watermark in a zone by at least
      pageblock_nr_pages or the number of pages in a page block.
      
      On Arm64, with 64K pages and 512M huge pages, this is 8192 pages or
      512M.  It does this regardless of the number of managed pages managed in
      the zone or the likelihood of success.
      
      This can put the zone immediately under water in terms of allocating
      pages from the zone, and can cause a small machine to fail immediately
      due to OoM.  Unlike set_recommended_min_free_kbytes(), which
      substantially increases min_free_kbytes and is tied to THP,
      boost_watermark() can be called even if THP is not active.
      
      The problem is most likely to appear on architectures such as Arm64
      where pageblock_nr_pages is very large.
      
      It is desirable to run the kdump capture kernel in as small a space as
      possible to avoid wasting memory.  In some architectures, such as Arm64,
      there are restrictions on where the capture kernel can run, and
      therefore, the space available.  A capture kernel running in 768M can
      fail due to OoM immediately after boost_watermark() sets the min in zone
      DMA32, where most of the memory is, to 512M.  It fails even though there
      is over 500M of free memory.  With boost_watermark() suppressed, the
      capture kernel can run successfully in 448M.
      
      This patch limits boost_watermark() to boosting a zone's min watermark
      only when there are enough pages that the boost will produce positive
      results.  In this case that is estimated to be four times as many pages
      as pageblock_nr_pages.
      
      Mel said:
      
      : There is no harm in marking it stable.  Clearly it does not happen very
      : often but it's not impossible.  32-bit x86 is a lot less common now
      : which would previously have been vulnerable to triggering this easily.
      : ppc64 has a larger base page size but typically only has one zone.
      : arm64 is likely the most vulnerable, particularly when CMA is
      : configured with a small movable zone.
      
      Fixes: 1c30844d ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Signed-off-by: default avatarHenry Willard <henry.willard@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1588294148-6586-1-git-send-email-henry.willard@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14f69140
    • Kees Cook's avatar
      ubsan: disable UBSAN_ALIGNMENT under COMPILE_TEST · 8d58f222
      Kees Cook authored
      The documentation for UBSAN_ALIGNMENT already mentions that it should
      not be used on all*config builds (and for efficient-unaligned-access
      architectures), so just refactor the Kconfig to correctly implement this
      so randconfigs will stop creating insane images that freak out objtool
      under CONFIG_UBSAN_TRAP (due to the false positives producing functions
      that never return, etc).
      
      Link: http://lkml.kernel.org/r/202005011433.C42EA3E2D@keescook
      Fixes: 0887a7eb ("ubsan: add trap instrumentation option")
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reported-by: default avatarRandy Dunlap <rdunlap@infradead.org>
        Link: https://lore.kernel.org/linux-next/202004231224.D6B3B650@keescook/Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d58f222
    • Qiwu Chen's avatar
      mm/vmscan: remove unnecessary argument description of isolate_lru_pages() · 17e34526
      Qiwu Chen authored
      Since commit a9e7c39f ("mm/vmscan.c: remove 7th argument of
      isolate_lru_pages()"), the explanation of 'mode' argument has been
      unnecessary.  Let's remove it.
      Signed-off-by: default avatarQiwu Chen <chenqiwu@xiaomi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200501090346.2894-1-chenqiwu@xiaomi.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      17e34526
    • Roman Penyaev's avatar
      epoll: atomically remove wait entry on wake up · 412895f0
      Roman Penyaev authored
      This patch does two things:
      
       - fixes a lost wakeup introduced by commit 339ddb53 ("fs/epoll:
         remove unnecessary wakeups of nested epoll")
      
       - improves performance for events delivery.
      
      The description of the problem is the following: if N (>1) threads are
      waiting on ep->wq for new events and M (>1) events come, it is quite
      likely that >1 wakeups hit the same wait queue entry, because there is
      quite a big window between __add_wait_queue_exclusive() and the
      following __remove_wait_queue() calls in ep_poll() function.
      
      This can lead to lost wakeups, because thread, which was woken up, can
      handle not all the events in ->rdllist.  (in better words the problem is
      described here: https://lkml.org/lkml/2019/10/7/905)
      
      The idea of the current patch is to use init_wait() instead of
      init_waitqueue_entry().
      
      Internally init_wait() sets autoremove_wake_function as a callback,
      which removes the wait entry atomically (under the wq locks) from the
      list, thus the next coming wakeup hits the next wait entry in the wait
      queue, thus preventing lost wakeups.
      
      Problem is very well reproduced by the epoll60 test case [1].
      
      Wait entry removal on wakeup has also performance benefits, because
      there is no need to take a ep->lock and remove wait entry from the queue
      after the successful wakeup.  Here is the timing output of the epoll60
      test case:
      
        With explicit wakeup from ep_scan_ready_list() (the state of the
        code prior 339ddb53):
      
          real    0m6.970s
          user    0m49.786s
          sys     0m0.113s
      
       After this patch:
      
         real    0m5.220s
         user    0m36.879s
         sys     0m0.019s
      
      The other testcase is the stress-epoll [2], where one thread consumes
      all the events and other threads produce many events:
      
        With explicit wakeup from ep_scan_ready_list() (the state of the
        code prior 339ddb53):
      
          threads  events/ms  run-time ms
                8       5427         1474
               16       6163         2596
               32       6824         4689
               64       7060         9064
              128       6991        18309
      
       After this patch:
      
          threads  events/ms  run-time ms
                8       5598         1429
               16       7073         2262
               32       7502         4265
               64       7640         8376
              128       7634        16767
      
       (number of "events/ms" represents event bandwidth, thus higher is
        better; number of "run-time ms" represents overall time spent
        doing the benchmark, thus lower is better)
      
      [1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
      [2] https://github.com/rouming/test-tools/blob/master/stress-epoll.cSigned-off-by: default avatarRoman Penyaev <rpenyaev@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJason Baron <jbaron@akamai.com>
      Cc: Khazhismel Kumykov <khazhy@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Heiher <r@hev.cc>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200430130326.1368509-2-rpenyaev@suse.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      412895f0
    • Roman Penyaev's avatar
      kselftests: introduce new epoll60 testcase for catching lost wakeups · 474328c0
      Roman Penyaev authored
      This test case catches lost wake up introduced by commit 339ddb53
      ("fs/epoll: remove unnecessary wakeups of nested epoll")
      
      The test is simple: we have 10 threads and 10 event fds.  Each thread
      can harvest only 1 event.  1 producer fires all 10 events at once and
      waits that all 10 events will be observed by 10 threads.
      
      In case of lost wakeup epoll_wait() will timeout and 0 will be returned.
      
      Test case catches two sort of problems: forgotten wakeup on event, which
      hits the ->ovflist list, this problem was fixed by:
      
        5a2513239750 ("eventpoll: fix missing wakeup for ovflist in ep_poll_callback")
      
      the other problem is when several sequential events hit the same waiting
      thread, thus other waiters get no wakeups.  Problem is fixed in the
      following patch.
      Signed-off-by: default avatarRoman Penyaev <rpenyaev@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Khazhismel Kumykov <khazhy@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Heiher <r@hev.cc>
      Cc: Jason Baron <jbaron@akamai.com>
      Link: http://lkml.kernel.org/r/20200430130326.1368509-1-rpenyaev@suse.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      474328c0
    • Filipe Manana's avatar
      percpu: make pcpu_alloc() aware of current gfp context · 28307d93
      Filipe Manana authored
      Since 5.7-rc1, on btrfs we have a percpu counter initialization for
      which we always pass a GFP_KERNEL gfp_t argument (this happens since
      commit 2992df73 ("btrfs: Implement DREW lock")).
      
      That is safe in some contextes but not on others where allowing fs
      reclaim could lead to a deadlock because we are either holding some
      btrfs lock needed for a transaction commit or holding a btrfs
      transaction handle open.  Because of that we surround the call to the
      function that initializes the percpu counter with a NOFS context using
      memalloc_nofs_save() (this is done at btrfs_init_fs_root()).
      
      However it turns out that this is not enough to prevent a possible
      deadlock because percpu_alloc() determines if it is in an atomic context
      by looking exclusively at the gfp flags passed to it (GFP_KERNEL in this
      case) and it is not aware that a NOFS context is set.
      
      Because percpu_alloc() thinks it is in a non atomic context it locks the
      pcpu_alloc_mutex.  This can result in a btrfs deadlock when
      pcpu_balance_workfn() is running, has acquired that mutex and is waiting
      for reclaim, while the btrfs task that called percpu_counter_init() (and
      therefore percpu_alloc()) is holding either the btrfs commit_root
      semaphore or a transaction handle (done fs/btrfs/backref.c:
      iterate_extent_inodes()), which prevents reclaim from finishing as an
      attempt to commit the current btrfs transaction will deadlock.
      
      Lockdep reports this issue with the following trace:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.6.0-rc7-btrfs-next-77 #1 Not tainted
        ------------------------------------------------------
        kswapd0/91 is trying to acquire lock:
        ffff8938a3b3fdc8 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
      
        but task is already holding lock:
        ffffffffb4f0dbc0 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #4 (fs_reclaim){+.+.}:
               fs_reclaim_acquire.part.0+0x25/0x30
               __kmalloc+0x5f/0x3a0
               pcpu_create_chunk+0x19/0x230
               pcpu_balance_workfn+0x56a/0x680
               process_one_work+0x235/0x5f0
               worker_thread+0x50/0x3b0
               kthread+0x120/0x140
               ret_from_fork+0x3a/0x50
      
        -> #3 (pcpu_alloc_mutex){+.+.}:
               __mutex_lock+0xa9/0xaf0
               pcpu_alloc+0x480/0x7c0
               __percpu_counter_init+0x50/0xd0
               btrfs_drew_lock_init+0x22/0x70 [btrfs]
               btrfs_get_fs_root+0x29c/0x5c0 [btrfs]
               resolve_indirect_refs+0x120/0xa30 [btrfs]
               find_parent_nodes+0x50b/0xf30 [btrfs]
               btrfs_find_all_leafs+0x60/0xb0 [btrfs]
               iterate_extent_inodes+0x139/0x2f0 [btrfs]
               iterate_inodes_from_logical+0xa1/0xe0 [btrfs]
               btrfs_ioctl_logical_to_ino+0xb4/0x190 [btrfs]
               btrfs_ioctl+0x165a/0x3130 [btrfs]
               ksys_ioctl+0x87/0xc0
               __x64_sys_ioctl+0x16/0x20
               do_syscall_64+0x5c/0x260
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #2 (&fs_info->commit_root_sem){++++}:
               down_write+0x38/0x70
               btrfs_cache_block_group+0x2ec/0x500 [btrfs]
               find_free_extent+0xc6a/0x1600 [btrfs]
               btrfs_reserve_extent+0x9b/0x180 [btrfs]
               btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
               alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
               __btrfs_cow_block+0x122/0x5a0 [btrfs]
               btrfs_cow_block+0x106/0x240 [btrfs]
               commit_cowonly_roots+0x55/0x310 [btrfs]
               btrfs_commit_transaction+0x509/0xb20 [btrfs]
               sync_filesystem+0x74/0x90
               generic_shutdown_super+0x22/0x100
               kill_anon_super+0x14/0x30
               btrfs_kill_super+0x12/0x20 [btrfs]
               deactivate_locked_super+0x31/0x70
               cleanup_mnt+0x100/0x160
               task_work_run+0x93/0xc0
               exit_to_usermode_loop+0xf9/0x100
               do_syscall_64+0x20d/0x260
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #1 (&space_info->groups_sem){++++}:
               down_read+0x3c/0x140
               find_free_extent+0xef6/0x1600 [btrfs]
               btrfs_reserve_extent+0x9b/0x180 [btrfs]
               btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
               alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
               __btrfs_cow_block+0x122/0x5a0 [btrfs]
               btrfs_cow_block+0x106/0x240 [btrfs]
               btrfs_search_slot+0x50c/0xd60 [btrfs]
               btrfs_lookup_inode+0x3a/0xc0 [btrfs]
               __btrfs_update_delayed_inode+0x90/0x280 [btrfs]
               __btrfs_commit_inode_delayed_items+0x81f/0x870 [btrfs]
               __btrfs_run_delayed_items+0x8e/0x180 [btrfs]
               btrfs_commit_transaction+0x31b/0xb20 [btrfs]
               iterate_supers+0x87/0xf0
               ksys_sync+0x60/0xb0
               __ia32_sys_sync+0xa/0x10
               do_syscall_64+0x5c/0x260
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #0 (&delayed_node->mutex){+.+.}:
               __lock_acquire+0xef0/0x1c80
               lock_acquire+0xa2/0x1d0
               __mutex_lock+0xa9/0xaf0
               __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
               btrfs_evict_inode+0x40d/0x560 [btrfs]
               evict+0xd9/0x1c0
               dispose_list+0x48/0x70
               prune_icache_sb+0x54/0x80
               super_cache_scan+0x124/0x1a0
               do_shrink_slab+0x176/0x440
               shrink_slab+0x23a/0x2c0
               shrink_node+0x188/0x6e0
               balance_pgdat+0x31d/0x7f0
               kswapd+0x238/0x550
               kthread+0x120/0x140
               ret_from_fork+0x3a/0x50
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> pcpu_alloc_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(fs_reclaim);
                                       lock(pcpu_alloc_mutex);
                                       lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/91:
         #0: (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
         #1: (shrinker_rwsem){++++}, at: shrink_slab+0x12f/0x2c0
         #2: (&type->s_umount_key#43){++++}, at: trylock_super+0x16/0x50
      
        stack backtrace:
        CPU: 1 PID: 91 Comm: kswapd0 Not tainted 5.6.0-rc7-btrfs-next-77 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8f/0xd0
         check_noncircular+0x170/0x190
         __lock_acquire+0xef0/0x1c80
         lock_acquire+0xa2/0x1d0
         __mutex_lock+0xa9/0xaf0
         __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         btrfs_evict_inode+0x40d/0x560 [btrfs]
         evict+0xd9/0x1c0
         dispose_list+0x48/0x70
         prune_icache_sb+0x54/0x80
         super_cache_scan+0x124/0x1a0
         do_shrink_slab+0x176/0x440
         shrink_slab+0x23a/0x2c0
         shrink_node+0x188/0x6e0
         balance_pgdat+0x31d/0x7f0
         kswapd+0x238/0x550
         kthread+0x120/0x140
         ret_from_fork+0x3a/0x50
      
      This could be fixed by making btrfs pass GFP_NOFS instead of GFP_KERNEL
      to percpu_counter_init() in contextes where it is not reclaim safe,
      however that type of approach is discouraged since
      memalloc_[nofs|noio]_save() were introduced.  Therefore this change
      makes pcpu_alloc() look up into an existing nofs/noio context before
      deciding whether it is in an atomic context or not.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Link: http://lkml.kernel.org/r/20200430164356.15543-1-fdmanana@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28307d93
    • Waiman Long's avatar
      mm/slub: fix incorrect interpretation of s->offset · cbfc35a4
      Waiman Long authored
      In a couple of places in the slub memory allocator, the code uses
      "s->offset" as a check to see if the free pointer is put right after the
      object.  That check is no longer true with commit 3202fa62 ("slub:
      relocate freelist pointer to middle of object").
      
      As a result, echoing "1" into the validate sysfs file, e.g.  of dentry,
      may cause a bunch of "Freepointer corrupt" error reports like the
      following to appear with the system in panic afterwards.
      
        =============================================================================
        BUG dentry(666:pmcd.service) (Tainted: G    B): Freepointer corrupt
        -----------------------------------------------------------------------------
      
      To fix it, use the check "s->offset == s->inuse" in the new helper
      function freeptr_outside_object() instead.  Also add another helper
      function get_info_end() to return the end of info block (inuse + free
      pointer if not overlapping with object).
      
      Fixes: 3202fa62 ("slub: relocate freelist pointer to middle of object")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Vitaly Nikolenko <vnik@duasynt.com>
      Cc: Silvio Cesare <silvio.cesare@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Markus Elfring <Markus.Elfring@web.de>
      Cc: Changbin Du <changbin.du@gmail.com>
      Link: http://lkml.kernel.org/r/20200429135328.26976-1-longman@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cbfc35a4
    • Aymeric Agon-Rambosson's avatar
      scripts/gdb: repair rb_first() and rb_last() · 50e36be1
      Aymeric Agon-Rambosson authored
      The current implementations of the rb_first() and rb_last() gdb
      functions have a variable that references itself in its instanciation,
      which causes the function to throw an error if a specific condition on
      the argument is met.  The original author rather intended to reference
      the argument and made a typo.  Referring the argument instead makes the
      function work as intended.
      Signed-off-by: default avatarAymeric Agon-Rambosson <aymeric.agon@yandex.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarStephen Boyd <swboyd@chromium.org>
      Cc: Jan Kiszka <jan.kiszka@siemens.com>
      Cc: Kieran Bingham <kbingham@kernel.org>
      Cc: Douglas Anderson <dianders@chromium.org>
      Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
      Cc: Jackie Liu <liuyun01@kylinos.cn>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Link: http://lkml.kernel.org/r/20200427051029.354840-1-aymeric.agon@yandex.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50e36be1
    • Khazhismel Kumykov's avatar
      eventpoll: fix missing wakeup for ovflist in ep_poll_callback · 0c54a6a4
      Khazhismel Kumykov authored
      In the event that we add to ovflist, before commit 339ddb53
      ("fs/epoll: remove unnecessary wakeups of nested epoll") we would be
      woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.
      
      With that wakeup removed, if we add to ovflist here, we may never wake
      up.  Rather than adding back the ep_scan_ready_list wakeup - which was
      resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.
      
      We noticed that one of our workloads was missing wakeups starting with
      339ddb53 and upon manual inspection, this wakeup seemed missing to me.
      With this patch added, we no longer see missing wakeups.  I haven't yet
      tried to make a small reproducer, but the existing kselftests in
      filesystem/epoll passed for me with this patch.
      
      [khazhy@google.com: use if/elif instead of goto + cleanup suggested by Roman]
        Link: http://lkml.kernel.org/r/20200424190039.192373-1-khazhy@google.com
      Fixes: 339ddb53 ("fs/epoll: remove unnecessary wakeups of nested epoll")
      Signed-off-by: default avatarKhazhismel Kumykov <khazhy@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarRoman Penyaev <rpenyaev@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Roman Penyaev <rpenyaev@suse.de>
      Cc: Heiher <r@hev.cc>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200424025057.118641-1-khazhy@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c54a6a4
    • Janakarajan Natarajan's avatar
      arch/x86/kvm/svm/sev.c: change flag passed to GUP fast in sev_pin_memory() · 996ed22c
      Janakarajan Natarajan authored
      When trying to lock read-only pages, sev_pin_memory() fails because
      FOLL_WRITE is used as the flag for get_user_pages_fast().
      
      Commit 73b0140b ("mm/gup: change GUP fast to use flags rather than a
      write 'bool'") updated the get_user_pages_fast() call sites to use
      flags, but incorrectly updated the call in sev_pin_memory().  As the
      original coding of this call was correct, revert the change made by that
      commit.
      
      Fixes: 73b0140b ("mm/gup: change GUP fast to use flags rather than a write 'bool'")
      Signed-off-by: default avatarJanakarajan Natarajan <Janakarajan.Natarajan@amd.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Link: http://lkml.kernel.org/r/20200423152419.87202-1-Janakarajan.Natarajan@amd.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      996ed22c
    • Ivan Delalande's avatar
      scripts/decodecode: fix trapping instruction formatting · e08df079
      Ivan Delalande authored
      If the trapping instruction contains a ':', for a memory access through
      segment registers for example, the sed substitution will insert the '*'
      marker in the middle of the instruction instead of the line address:
      
      	2b:   65 48 0f c7 0f          cmpxchg16b %gs:*(%rdi)          <-- trapping instruction
      
      I started to think I had forgotten some quirk of the assembly syntax
      before noticing that it was actually coming from the script.  Fix it to
      add the address marker at the right place for these instructions:
      
      	28:   49 8b 06                mov    (%r14),%rax
      	2b:*  65 48 0f c7 0f          cmpxchg16b %gs:(%rdi)           <-- trapping instruction
      	30:   0f 94 c0                sete   %al
      
      Fixes: 18ff44b1 ("scripts/decodecode: make faulting insn ptr more robust")
      Signed-off-by: default avatarIvan Delalande <colona@arista.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarBorislav Petkov <bp@suse.de>
      Link: http://lkml.kernel.org/r/20200419223653.GA31248@visorSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e08df079
    • Maciej Grochowski's avatar
    • David Hildenbrand's avatar
      mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous() · e84fe99b
      David Hildenbrand authored
      Without CONFIG_PREEMPT, it can happen that we get soft lockups detected,
      e.g., while booting up.
      
        watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4
        Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
        RIP: __pageblock_pfn_to_page+0x134/0x1c0
        Call Trace:
         set_zone_contiguous+0x56/0x70
         page_alloc_init_late+0x166/0x176
         kernel_init_freeable+0xfa/0x255
         kernel_init+0xa/0x106
         ret_from_fork+0x35/0x40
      
      The issue becomes visible when having a lot of memory (e.g., 4TB)
      assigned to a single NUMA node - a system that can easily be created
      using QEMU.  Inside VMs on a hypervisor with quite some memory
      overcommit, this is fairly easy to trigger.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200416073417.5003-1-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e84fe99b
    • Yafang Shao's avatar
      mm, memcg: fix error return value of mem_cgroup_css_alloc() · 11d67612
      Yafang Shao authored
      When I run my memcg testcase which creates lots of memcgs, I found
      there're unexpected out of memory logs while there're still enough
      available free memory.  The error log is
      
        mkdir: cannot create directory 'foo.65533': Cannot allocate memory
      
      The reason is when we try to create more than MEM_CGROUP_ID_MAX memcgs,
      an -ENOMEM errno will be set by mem_cgroup_css_alloc(), but the right
      errno should be -ENOSPC "No space left on device", which is an
      appropriate errno for userspace's failed mkdir.
      
      As the errno really misled me, we should make it right.  After this
      patch, the error log will be
      
        mkdir: cannot create directory 'foo.65533': No space left on device
      
      [akpm@linux-foundation.org: s/EBUSY/ENOSPC/, per Michal]
      [akpm@linux-foundation.org: s/EBUSY/ENOSPC/, per Michal]
      Fixes: 73f576c0 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Link: http://lkml.kernel.org/r/20200407063621.GA18914@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/1586192163-20099-1-git-send-email-laoar.shao@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11d67612
    • Oleg Nesterov's avatar
      ipc/mqueue.c: change __do_notify() to bypass check_kill_permission() · b5f20061
      Oleg Nesterov authored
      Commit cc731525 ("signal: Remove kernel interal si_code magic")
      changed the value of SI_FROMUSER(SI_MESGQ), this means that mq_notify() no
      longer works if the sender doesn't have rights to send a signal.
      
      Change __do_notify() to use do_send_sig_info() instead of kill_pid_info()
      to avoid check_kill_permission().
      
      This needs the additional notify.sigev_signo != 0 check, shouldn't we
      change do_mq_notify() to deny sigev_signo == 0 ?
      
      Test-case:
      
      	#include <signal.h>
      	#include <mqueue.h>
      	#include <unistd.h>
      	#include <sys/wait.h>
      	#include <assert.h>
      
      	static int notified;
      
      	static void sigh(int sig)
      	{
      		notified = 1;
      	}
      
      	int main(void)
      	{
      		signal(SIGIO, sigh);
      
      		int fd = mq_open("/mq", O_RDWR|O_CREAT, 0666, NULL);
      		assert(fd >= 0);
      
      		struct sigevent se = {
      			.sigev_notify	= SIGEV_SIGNAL,
      			.sigev_signo	= SIGIO,
      		};
      		assert(mq_notify(fd, &se) == 0);
      
      		if (!fork()) {
      			assert(setuid(1) == 0);
      			mq_send(fd, "",1,0);
      			return 0;
      		}
      
      		wait(NULL);
      		mq_unlink("/mq");
      		assert(notified);
      		return 0;
      	}
      
      [manfred@colorfullife.com: 1) Add self_exec_id evaluation so that the implementation matches do_notify_parent 2) use PIDTYPE_TGID everywhere]
      Fixes: cc731525 ("signal: Remove kernel interal si_code magic")
      Reported-by: default avatarYoji <yoji.fujihar.min@gmail.com>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Markus Elfring <elfring@users.sourceforge.net>
      Cc: <1vier1@web.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/e2a782e4-eab9-4f5c-c749-c07a8f7a4e66@colorfullife.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5f20061
  3. 07 May, 2020 6 commits