1. 17 Jun, 2021 1 commit
    • Roman Gushchin's avatar
      percpu: optimize locking in pcpu_balance_workfn() · e4d77700
      Roman Gushchin authored
      pcpu_balance_workfn() unconditionally calls pcpu_balance_free(),
      pcpu_reclaim_populated(), pcpu_balance_populated() and
      pcpu_balance_free() again.
      
      Each call to pcpu_balance_free() and pcpu_reclaim_populated() will
      cause at least one acquisition of the pcpu_lock. So even if the
      balancing was scheduled because of a failed atomic allocation,
      pcpu_lock will be acquired at least 4 times. This obviously
      increases the contention on the pcpu_lock.
      
      To optimize the scheme let's grab the pcpu_lock on the upper level
      (in pcpu_balance_workfn()) and keep it generally locked for the whole
      duration of the scheduled work, but release conditionally to perform
      any slow operations like chunk (de)population and creation of new
      chunks.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      e4d77700
  2. 14 Jun, 2021 1 commit
  3. 05 Jun, 2021 3 commits
    • Roman Gushchin's avatar
      percpu: rework memcg accounting · faf65dde
      Roman Gushchin authored
      The current implementation of the memcg accounting of the percpu
      memory is based on the idea of having two separate sets of chunks for
      accounted and non-accounted memory. This approach has an advantage
      of not wasting any extra memory for memcg data for non-accounted
      chunks, however it complicates the code and leads to a higher chunks
      number due to a lower chunk utilization.
      
      Instead of having two chunk types it's possible to declare all* chunks
      memcg-aware unless the kernel memory accounting is disabled globally
      by a boot option. The size of objcg_array is usually small in
      comparison to chunks themselves (it obviously depends on the number of
      CPUs), so even if some chunk will have no accounted allocations, the
      memory waste isn't significant and will likely be compensated by
      a higher chunk utilization. Also, with time more and more percpu
      allocations will likely become accounted.
      
      * The first chunk is initialized before the memory cgroup subsystem,
        so we don't know for sure whether we need to allocate obj_cgroups.
        Because it's small, let's make it free for use. Then we don't need
        to allocate obj_cgroups for it.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      faf65dde
    • Roman Gushchin's avatar
      mm, memcg: introduce mem_cgroup_kmem_disabled() · 4d5c8aed
      Roman Gushchin authored
      Introduce a new mem_cgroup_kmem_disabled() helper, similar to
      mem_cgroup_disabled(), to check whether the kernel memory accounting
      is off. A user could disable it using a boot option to eliminate
      some associated costs.
      
      The helper can be used outside of memcontrol.c to dynamically disable
      the kmem-related code. The returned value is stable after the kernel
      initialization is finished.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      4d5c8aed
    • Roman Gushchin's avatar
      mm, memcg: mark cgroup_memory_nosocket, nokmem and noswap as __ro_after_init · 0f0cace3
      Roman Gushchin authored
      cgroup_memory_nosocket, cgroup_memory_nokmem and cgroup_memory_noswap
      are initialized during the kernel initialization and never change
      their value afterwards.
      
      cgroup_memory_nosocket, cgroup_memory_nokmem are written only from
      cgroup_memory(), which is marked as __init.
      
      cgroup_memory_noswap is written from setup_swap_account() and
      mem_cgroup_swap_init(), both are marked as __init.
      
      Mark all three variables as __ro_after_init.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      0f0cace3
  4. 14 May, 2021 1 commit
  5. 21 Apr, 2021 3 commits
    • Roman Gushchin's avatar
      percpu: implement partial chunk depopulation · f1833241
      Roman Gushchin authored
      From Roman ("percpu: partial chunk depopulation"):
      In our [Facebook] production experience the percpu memory allocator is
      sometimes struggling with returning the memory to the system. A typical
      example is a creation of several thousands memory cgroups (each has
      several chunks of the percpu data used for vmstats, vmevents,
      ref counters etc). Deletion and complete releasing of these cgroups
      doesn't always lead to a shrinkage of the percpu memory, so that
      sometimes there are several GB's of memory wasted.
      
      The underlying problem is the fragmentation: to release an underlying
      chunk all percpu allocations should be released first. The percpu
      allocator tends to top up chunks to improve the utilization. It means
      new small-ish allocations (e.g. percpu ref counters) are placed onto
      almost filled old-ish chunks, effectively pinning them in memory.
      
      This patchset solves this problem by implementing a partial depopulation
      of percpu chunks: chunks with many empty pages are being asynchronously
      depopulated and the pages are returned to the system.
      
      To illustrate the problem the following script can be used:
      --
      
      cd /sys/fs/cgroup
      
      mkdir percpu_test
      echo "+memory" > percpu_test/cgroup.subtree_control
      
      cat /proc/meminfo | grep Percpu
      
      for i in `seq 1 1000`; do
          mkdir percpu_test/cg_"${i}"
          for j in `seq 1 10`; do
      	mkdir percpu_test/cg_"${i}"_"${j}"
          done
      done
      
      cat /proc/meminfo | grep Percpu
      
      for i in `seq 1 1000`; do
          for j in `seq 1 10`; do
      	rmdir percpu_test/cg_"${i}"_"${j}"
          done
      done
      
      sleep 10
      
      cat /proc/meminfo | grep Percpu
      
      for i in `seq 1 1000`; do
          rmdir percpu_test/cg_"${i}"
      done
      
      rmdir percpu_test
      --
      
      It creates 11000 memory cgroups and removes every 10 out of 11.
      It prints the initial size of the percpu memory, the size after
      creating all cgroups and the size after deleting most of them.
      
      Results:
        vanilla:
          ./percpu_test.sh
          Percpu:             7488 kB
          Percpu:           481152 kB
          Percpu:           481152 kB
      
        with this patchset applied:
          ./percpu_test.sh
          Percpu:             7488 kB
          Percpu:           481408 kB
          Percpu:           135552 kB
      
      The total size of the percpu memory was reduced by more than 3.5 times.
      
      This patch:
      
      This patch implements partial depopulation of percpu chunks.
      
      As of now, a chunk can be depopulated only as a part of the final
      destruction, if there are no more outstanding allocations. However
      to minimize a memory waste it might be useful to depopulate a
      partially filed chunk, if a small number of outstanding allocations
      prevents the chunk from being fully reclaimed.
      
      This patch implements the following depopulation process: it scans
      over the chunk pages, looks for a range of empty and populated pages
      and performs the depopulation. To avoid races with new allocations,
      the chunk is previously isolated. After the depopulation the chunk is
      sidelined to a special list or freed. New allocations prefer using
      active chunks to sidelined chunks. If a sidelined chunk is used, it is
      reintegrated to the active lists.
      
      The depopulation is scheduled on the free path if the chunk is all of
      the following:
        1) has more than 1/4 of total pages free and populated
        2) the system has enough free percpu pages aside of this chunk
        3) isn't the reserved chunk
        4) isn't the first chunk
      If it's already depopulated but got free populated pages, it's a good
      target too. The chunk is moved to a special slot,
      pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
      item is scheduled. On isolation, these pages are removed from the
      pcpu_nr_empty_pop_pages. It is constantly replaced to the
      to_depopulate_slot when it meets these qualifications.
      
      pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
      becomes empty. The depopulation is performed in the reverse direction to
      keep populated pages close to the beginning. Depopulated chunks are
      sidelined to preferentially avoid them for new allocations. When no
      active chunk can suffice a new allocation, sidelined chunks are first
      checked before creating a new chunk.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Co-developed-by: default avatarDennis Zhou <dennis@kernel.org>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Tested-by: default avatarPratik Sampat <psampat@linux.ibm.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      f1833241
    • Dennis Zhou's avatar
      percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1 · 1c29a3ce
      Dennis Zhou authored
      This prepares for adding a to_depopulate list and sidelined list after
      the free slot in the set of lists in pcpu_slot.
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      1c29a3ce
    • Roman Gushchin's avatar
      percpu: factor out pcpu_check_block_hint() · 8ea2e1e3
      Roman Gushchin authored
      Factor out the pcpu_check_block_hint() helper, which will be useful
      in the future. The new function checks if the allocation can likely
      fit within the contig hint.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      8ea2e1e3
  6. 16 Apr, 2021 2 commits
  7. 11 Apr, 2021 4 commits
  8. 10 Apr, 2021 10 commits
    • Linus Torvalds's avatar
      Merge branch 'for-5.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu · 52e44129
      Linus Torvalds authored
      Pull percpu fix from Dennis Zhou:
       "This contains a fix for sporadically failing atomic percpu
        allocations.
      
        I only caught it recently while I was reviewing a new series [1] and
        simultaneously saw reports by btrfs in xfstests [2] and [3].
      
        In v5.9, memcg accounting was extended to percpu done by adding a
        second type of chunk. I missed an interaction with the free page float
        count used to ensure we can support atomic allocations. If one type of
        chunk has no free pages, but the other has enough to satisfy the free
        page float requirement, we will not repopulate the free pages for the
        former type of chunk. This led to the sporadically failing atomic
        allocations"
      
      Link: https://lore.kernel.org/linux-mm/20210324190626.564297-1-guro@fb.com/ [1]
      Link: https://lore.kernel.org/linux-mm/20210401185158.3275.409509F4@e16-tech.com/ [2]
      Link: https://lore.kernel.org/linux-mm/CAL3q7H5RNBjCi708GH7jnczAOe0BLnacT9C+OBgA-Dx9jhB6SQ@mail.gmail.com/ [3]
      
      * 'for-5.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
        percpu: make pcpu_nr_empty_pop_pages per chunk type
      52e44129
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · efc2da92
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Seven fixes, all in drivers.
      
        The hpsa three are the most extensive and the most problematic: it's a
        packed structure misalignment that oopses on ia64 but looks like it
        would also oops on quite a few non-x86 architectures.
      
        The pm80xx is a regression and the rest are bug fixes for patches in
        the misc tree"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: scsi_transport_srp: Don't block target in SRP_PORT_LOST state
        scsi: target: iscsi: Fix zero tag inside a trace event
        scsi: pm80xx: Fix chip initialization failure
        scsi: ufs: core: Fix wrong Task Tag used in task management request UPIUs
        scsi: ufs: core: Fix task management request completion timeout
        scsi: hpsa: Add an assert to prevent __packed reintroduction
        scsi: hpsa: Fix boot on ia64 (atomic_t alignment)
        scsi: hpsa: Use __packed on individual structs, not header-wide
      efc2da92
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.12-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 95c7b075
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
       "Some some more powerpc fixes for 5.12:
      
         - Fix an oops triggered by ptrace when CONFIG_PPC_FPU_REGS=n
      
         - Fix an oops on sigreturn when the VDSO is unmapped on 32-bit
      
         - Fix vdso_wrapper.o not being rebuilt everytime vdso.so is rebuilt
      
        Thanks to Christophe Leroy"
      
      * tag 'powerpc-5.12-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/vdso: Make sure vdso_wrapper.o is rebuilt everytime vdso.so is rebuilt
        powerpc/signal32: Fix Oops on sigreturn with unmapped VDSO
        powerpc/ptrace: Don't return error when getting/setting FP regs without CONFIG_PPC_FPU_REGS
      95c7b075
    • Linus Torvalds's avatar
      Merge tag 'driver-core-5.12-rc7' of... · d5fa1dad
      Linus Torvalds authored
      Merge tag 'driver-core-5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
      
      Pull driver core fix from Greg KH:
       "Here is a single driver core fix for 5.12-rc7 to resolve a reported
        problem that caused some devices to lockup when booting. It has been
        in linux-next with no reported issues"
      
      * tag 'driver-core-5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
        driver core: Fix locking bug in deferred_probe_timeout_work_func()
      d5fa1dad
    • Linus Torvalds's avatar
      Merge tag 'usb-5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 445e09e7
      Linus Torvalds authored
      Pull USB/Thunderbolt fixes from Greg KH:
       "Here are a few small USB and Thunderbolt driver fixes for 5.12-rc7 for
        reported issues:
      
         - thunderbolt leaks and off-by-one fix
      
         - cdnsp deque fix
      
         - usbip fixes for syzbot-reported issues
      
        All have been in linux-next with no reported problems"
      
      * tag 'usb-5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        usbip: synchronize event handler with sysfs code paths
        usbip: vudc synchronize sysfs code paths
        usbip: stub-dev synchronize sysfs code paths
        usbip: add sysfs_lock to synchronize sysfs code paths
        thunderbolt: Fix off by one in tb_port_find_retimer()
        thunderbolt: Fix a leak in tb_retimer_add()
        usb: cdnsp: Fixes issue with dequeuing requests after disabling endpoint
      445e09e7
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 12a0cf72
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "A mixture of driver and documentation bugfixes for I2C"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: imx: mention Oleksij as maintainer of the binding docs
        i2c: exynos5: correct top kerneldoc
        i2c: designware: Adjust bus_freq_hz when refuse high speed mode set
        i2c: hix5hd2: use the correct HiSilicon copyright
        i2c: gpio: update email address in binding docs
        i2c: imx: drop me as maintainer of binding docs
        i2c: stm32f4: Mundane typo fix
        I2C: JZ4780: Fix bug for Ingenic X1000.
        i2c: turn recovery error on init to debug
      12a0cf72
    • Naohiro Aota's avatar
      btrfs: zoned: move superblock logging zone location · 53b74fa9
      Naohiro Aota authored
      Moves the location of the superblock logging zones. The new locations of
      the logging zones are now determined based on fixed block addresses
      instead of on fixed zone numbers.
      
      The old placement method based on fixed zone numbers causes problems when
      one needs to inspect a file system image without access to the drive zone
      information. In such case, the super block locations cannot be reliably
      determined as the zone size is unknown. By locating the superblock logging
      zones using fixed addresses, we can scan a dumped file system image without
      the zone information since a super block copy will always be present at or
      after the fixed known locations.
      
      Introduce the following three pairs of zones containing fixed offset
      locations, regardless of the device zone size.
      
        - primary superblock: offset   0B (and the following zone)
        - first copy:         offset 512G (and the following zone)
        - Second copy:        offset   4T (4096G, and the following zone)
      
      If a logging zone is outside of the disk capacity, we do not record the
      superblock copy.
      
      The first copy position is much larger than for a non-zoned filesystem,
      which is at 64M.  This is to avoid overlapping with the log zones for
      the primary superblock. This higher location is arbitrary but allows
      supporting devices with very large zone sizes, plus some space around in
      between.
      
      Such large zone size is unrealistic and very unlikely to ever be seen in
      real devices. Currently, SMR disks have a zone size of 256MB, and we are
      expecting ZNS drives to be in the 1-4GB range, so this limit gives us
      room to breathe. For now, we only allow zone sizes up to 8GB. The
      maximum zone size that would still fit in the space is 256G.
      
      The fixed location addresses are somewhat arbitrary, with the intent of
      maintaining superblock reliability for smaller and larger devices, with
      the preference for the latter. For this reason, there are two superblocks
      under the first 1T. This should cover use cases for physical devices and
      for emulated/device-mapper devices.
      
      The superblock logging zones are reserved for superblock logging and
      never used for data or metadata blocks. Note that we only reserve the
      two zones per primary/copy actually used for superblock logging. We do
      not reserve the ranges of zones possibly containing superblocks with the
      largest supported zone size (0-16GB, 512G-528GB, 4096G-4112G).
      
      The zones containing the fixed location offsets used to store
      superblocks on a non-zoned volume are also reserved to avoid confusion.
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      53b74fa9
    • Linus Torvalds's avatar
      Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux · d4961772
      Linus Torvalds authored
      Pull clk fixes from Stephen Boyd:
       "Here's the latest pile of clk driver and clk framework fixes for this
        release:
      
         - Two clk framework fixes for a long standing issue in
           clk_notifier_{register,unregister}() where we used a pointer that
           was for a struct containing a list head when there was no container
           struct
      
         - A compile warning fix for socfpga that's good to have
      
         - A double free problem with devm registered fixed factor clks
      
         - One last fix to the Qualcomm camera clk driver to use the right clk
           ops so clks don't get stuck and stop working because the firmware
           takes them for a ride"
      
      * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
        clk: fixed: fix double free in resource managed fixed-factor clock
        clk: fix invalid usage of list cursor in unregister
        clk: fix invalid usage of list cursor in register
        clk: qcom: camcc: Update the clock ops for the SC7180
        clk: socfpga: fix iomem pointer cast on 64-bit
      d4961772
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-fixes-for-v5.12-2020-04-09' of... · 9288e1f7
      Linus Torvalds authored
      Merge tag 'perf-tools-fixes-for-v5.12-2020-04-09' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
      
      Pull perf tool fixes from Arnaldo Carvalho de Melo:
      
       - Fix wrong LBR block sorting in 'perf report'
      
       - Fix 'perf inject' repipe usage when consuming perf.data files
      
       - Avoid potential buffer overrun when decoding ARM SPE hardware tracing
         packets, bug found using a fuzzer
      
      * tag 'perf-tools-fixes-for-v5.12-2020-04-09' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
        perf arm-spe: Avoid potential buffer overrun
        perf report: Fix wrong LBR block sorting
        perf inject: Fix repipe usage
      9288e1f7
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · adb2c417
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "14 patches.
      
        Subsystems affected by this patch series: mm (kasan, gup, pagecache,
        and kfence), MAINTAINERS, mailmap, nds32, gcov, ocfs2, ia64, and lib"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        lib: fix kconfig dependency on ARCH_WANT_FRAME_POINTERS
        kfence, x86: fix preemptible warning on KPTI-enabled systems
        lib/test_kasan_module.c: suppress unused var warning
        kasan: fix conflict with page poisoning
        fs: direct-io: fix missing sdio->boundary
        ia64: fix user_stack_pointer() for ptrace()
        ocfs2: fix deadlock between setattr and dio_end_io_write
        gcov: re-fix clang-11+ support
        nds32: flush_dcache_page: use page_mapping_file to avoid races with swapoff
        mm/gup: check page posion status for coredump.
        .mailmap: fix old email addresses
        mailmap: update email address for Jordan Crouse
        treewide: change my e-mail address, fix my name
        MAINTAINERS: update CZ.NIC's Turris information
      adb2c417
  9. 09 Apr, 2021 15 commits