Commits · d3132792285859253c466354fd8d54d1fe0ba786 · nexedi / linux

16 Jun, 2020 1 commit

HID: usbhid: do not sleep when opening device · d3132792

Dmitry Torokhov authored Jun 09, 2020

usbhid tries to give the device 50 milliseconds to drain its queues when
opening the device, but does it naively by simply sleeping in open handler,
which slows down device probing (and thus may affect overall boot time).

However we do not need to sleep as we can instead mark a point of time in
the future when we should start processing the events.
Reported-by: Nicolas Boichat <drinkcat@chromium.org>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>

d3132792

04 Jun, 2020 39 commits

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid · a789d5f8

Linus Torvalds authored Jun 04, 2020

Pull HID updates from Jiri Kosina:

 - hid-mcp2221 GPIO support, from Rishi Gupta

 - MT_CLS_WIN_8_DUAL obsolete quirk removal from hid-multitouch, from
   Kai-Heng Feng

 - a bunch of new hardware support to hid-asus driver, from Hans de
   Goede

 - other assorted small fixes, cleanups and device-specific quirks

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
  HID: multitouch: Remove MT_CLS_WIN_8_DUAL
  HID: multitouch: enable multi-input as a quirk for some devices
  HID: sony: Fix for broken buttons on DS3 USB dongles
  HID: Add quirks for Trust Panora Graphic Tablet
  HID: apple: Swap the Fn and Left Control keys on Apple keyboards
  HID: asus: Add depends on USB_HID to HID_ASUS Kconfig option
  HID: asus: Fix mute and touchpad-toggle keys on Medion Akoya E1239T
  HID: asus: Add support for multi-touch touchpad on Medion Akoya E1239T
  HID: asus: Add report_size to struct asus_touchpad_info
  HID: asus: Add hid_is_using_ll_driver(usb_hid_driver) check
  HID: asus: Simplify skipping of mappings for Asus T100CHI keyboard-dock
  HID: asus: Only set EV_REP if we are adding a mapping
  HID: i2c-hid: add Schneider SCL142ALM to descriptor override
  HID: intel-ish-hid: avoid bogus uninitialized-variable warning
  HID: mcp2221: add GPIO functionality support
  HID: fix typo in Kconfig
  HID: logitech: drop outdated references to unifying receivers

a789d5f8

Merge tag 'sound-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 631d6914

Linus Torvalds authored Jun 04, 2020

Pull sound updates from Takashi Iwai:
 "It was another busy development cycle, and the majority of changes are
  found in ASoC side. Below are Some highlights.

  ASoC core:
   - Lots of core cleanups and refactorings, still on-going work by
     Morimoto-san

  ASoC drivers:
   - Continued work on cleaning up and improving the Intel SOF stuff,
     along with new platform support including SoundWire

   - Fixes to make the Marvell SSPA driver work upstream

   - Support for AMD Renoir ACP, Dialog DA7212, Freescale EASRC and
     i.MX8M, Intel Elkhard Lake, Maxim MAX98390, Nuvoton NAU8812 and
     NAU8814 and Realtek RT1016.

  USB-audio:
   - Improvement for sync and implicit feedback streams with the more
     accurate frame size calculation and full-duplex support

   - Support for RME Babyface Pro and Prioneer DJ DJM

  HD-audio:
   - Fixes for Mic mute LED on HP machines

   - Re-enable support of Intel SST driver for SKL/KBL platforms

  FireWire:
   - Lots of refactoring, add support for RME FireFace and MOTU
     UltraLite-mk3"

* tag 'sound-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (428 commits)
  ALSA: es1688: Add the missed snd_card_free()
  ALSA: hda: add sienna_cichlid audio asic id for sienna_cichlid up
  ALSA: usb-audio: Add Pioneer DJ DJM-900NXS2 support
  ASoC: qcom: q6asm-dai: kCFI fix
  ASoC: soc-card: add snd_soc_card_remove_dai_link()
  ASoC: soc-card: add snd_soc_card_add_dai_link()
  ASoC: soc-card: add snd_soc_card_set_bias_level_post()
  ASoC: soc-card: add snd_soc_card_set_bias_level()
  ASoC: soc-card: add snd_soc_card_remove()
  ASoC: soc-card: add snd_soc_card_late_probe()
  ASoC: soc-card: add snd_soc_card_probe()
  ASoC: soc-card: add probed bit field to snd_soc_card
  ASoC: soc-card: add snd_soc_card_resume_post()
  ASoC: soc-card: add snd_soc_card_resume_pre()
  ASoC: soc-card: add snd_soc_card_suspend_post()
  ASoC: soc-card: add snd_soc_card_suspend_pre()
  ASoC: soc-card: move snd_soc_card_subclass to soc-card
  ASoC: soc-card: move snd_soc_card_get_codec_dai() to soc-card
  ASoC: soc-card: move snd_soc_card_set/get_drvdata() to soc-card
  ASoC: soc-card: move snd_soc_card_jack_new() to soc-card
  ...

631d6914

Merge branch 'pcmcia-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux · a0a4d17e

Linus Torvalds authored Jun 04, 2020

Pull pcmcia updates from Dominik Brodowski:
 "Two minor PCMCIA odd fixes: one replacing zero-length arrays with a
  flexible-array member, and one making a local function static"

* 'pcmcia-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux:
  pcmcia: make pccard_loop_tuple() static
  pcmcia: Replace zero-length array with flexible-array

a0a4d17e

Merge tag 'leds-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/pavel/linux-leds · 86c67ce2

Linus Torvalds authored Jun 04, 2020

Pull LED updates from Pavel Machek:
 "New drivers: aw2013, sgm3140, some fixes

  Nothing much to see here, next release should be more interesting"

* tag 'leds-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/pavel/linux-leds:
  leds: add aw2013 driver
  dt-bindings: leds: Add binding for aw2013
  leds: trigger: remove redundant assignment to variable ret
  leds: netxbig: Convert to use GPIO descriptors
  leds: add sgm3140 driver
  dt-bindings: leds: Add binding for sgm3140
  leds: ariel: Add driver for status LEDs on Dell Wyse 3020
  leds: pwm: check result of led_pwm_set() in led_pwm_add()
  leds: tlc591xxt: hide error on EPROBE_DEFER
  leds: tca6507: Include the right header
  leds: lt3593: Drop surplus include
  leds: lp3952: Include the right header
  leds: lm355x: Drop surplus include

86c67ce2

Merge tag 'tag-chrome-platform-for-v5.8' of... · 9875b201

Linus Torvalds authored Jun 04, 2020

Merge tag 'tag-chrome-platform-for-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux

Pull chrome platform updates from Benson Leung:
 "cros_ec_typec:
   - Add notifier for update, and register port partner

  Sensors/iio:
   - Fixes to cros_ec_sensorhub around allocation of resources, and
     send_sample

  Wilco EC:
   - Fix to output format of h1_gpio

  Misc:
   - Misc fixes to appease kernel-doc and other warnings
   - Set user space log size in chromeos_pstore"

* tag 'tag-chrome-platform-for-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux:
  platform/chrome: cros_usbpd_logger: Add __printf annotation to append_str()
  platform/chrome: cros_ec_i2c: Appease the kernel-doc deity
  platform/chrome: typec: Fix ret value check error
  platform/chrome: cros_ec_typec: Register port partner
  platform/chrome: cros_ec_typec: Add struct for port data
  platform/chrome: cros_ec_typec: Use notifier for updates
  platform/chrome: cros_ec_ishtp: free ishtp buffer before sending event
  platform/chrome: cros_ec_ishtp: skip old cros_ec responses
  platform/chrome: wilco_ec: Provide correct output format to 'h1_gpio' file
  platform/chrome: chromeos_pstore: set user space log size

9875b201

Merge tag 'linux-watchdog-5.8-rc1' of git://www.linux-watchdog.org/linux-watchdog · 0486a39a

Linus Torvalds authored Jun 04, 2020

Pull watchdog updates from Wim Van Sebroeck:

 - add new arm_smc_wdt watchdog driver

 - da9062 and da9063 improvements

 - clarify documentation about stop() that became optional

 - document r8a7742 support

 - some overall fixes and improvements

* tag 'linux-watchdog-5.8-rc1' of git://www.linux-watchdog.org/linux-watchdog:
  watchdog: m54xx: Add missing include
  dt-bindings: watchdog: renesas,wdt: Document r8a7742 support
  watchdog: Fix runtime PM imbalance on error
  watchdog: riowd: remove unneeded semicolon
  watchdog: Add new arm_smc_wdt watchdog driver
  dt-bindings: watchdog: Add ARM smc wdt for mt8173 watchdog
  watchdog: imx2_wdt: update contact email
  watchdog: iTCO: fix link error
  watchdog: da9062: No need to ping manually before setting timeout
  watchdog: da9063: Make use of pre-configured timeout during probe
  watchdog: da9062: Initialize timeout during probe
  watchdog: clarify that stop() is optional
  watchdog: imx_sc_wdt: Fix reboot on crash
  watchdog: ts72xx_wdt: fix build error

0486a39a

Merge tag 'backlight-next-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight · 302d5b33

Linus Torvalds authored Jun 04, 2020

Pull backlight updates from Lee Jones:
 "Core Framework:
   - Add backlight_device_get_by_name() to the API

  New Device Support:
   - Add support for WLED5 to Qualcomm WLED

  Fix-ups:
   - Convert to GPIO descriptors in l4f00242t03
   - Device Tree fix-ups for qcom-wled

  Bug Fixes:
   - Properly disable regulators on .probe() failure"

* tag 'backlight-next-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
  backlight: Add backlight_device_get_by_name()
  backlight: qcom-wled: Add support for WLED5 peripheral that is present on PM8150L PMICs
  dt-bindings: backlight: qcom-wled: Add WLED5 bindings
  backlight: qcom-wled: Add callback functions
  dt-bindings: backlight: qcom-wled: Convert the wled bindings to .yaml format
  backlight: l4f00242t03: Convert to GPIO descriptors
  backlight: lp855x: Ensure regulators are disabled on probe failure

302d5b33

Merge tag 'mfd-next-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd · 512b7d37

Linus Torvalds authored Jun 04, 2020

Pull MFD updates from Lee Jones:
 "Core Frameworks:
   - Constify 'properties' attribute in core header file

  New Drivers:
   - Add support for Gateworks System Controller
   - Add support for MediaTek MT6358 PMIC
   - Add support for Mediatek MT6360 PMIC
   - Add support for Monolithic Power Systems MP2629 ADC and Battery charger

  Fix-ups:
   - Use new I2C API in htc-i2cpld
   - Remove superfluous code in sprd-sc27xx-spi
   - Improve error handling in stm32-timers
   - Device Tree additions/fixes in mt6397
   - Defer probe betterment in wm8994-core
   - Improve module handling in wm8994-core
   - Staticify in stpmic1
   - Trivial (spelling, formatting) in tqmx86

  Bug Fixes:
   - Fix incorrect register/PCI IDs in intel-lpss-pci
   - Fix unbalanced Regulator API calls in wm8994-core
   - Fix double free() in wcd934x
   - Remove IRQ domain on failure in stmfx
   - Reset chip on resume in stmfx
   - Disable/enable IRQs on suspend/resume in stmfx
   - Do not use bulk writes on H/W which does not support them in max77620"

* tag 'mfd-next-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (29 commits)
  mfd: mt6360: Remove duplicate REGMAP_IRQ_REG_LINE() entry
  mfd: Add support for PMIC MT6360
  mfd: max77620: Use single-byte writes on MAX77620
  mfd: wcd934x: Drop kfree for memory allocated with devm_kzalloc
  mfd: stmfx: Disable IRQ in suspend to avoid spurious interrupt
  mfd: stmfx: Fix stmfx_irq_init error path
  mfd: stmfx: Reset chip on resume as supply was disabled
  mfd: wm8994: Silence warning about supplies during deferred probe
  mfd: wm8994: Fix unbalanced calls to regulator_bulk_disable()
  mfd: wm8994: Fix driver operation if loaded as modules
  dt-bindings: mfd: mediatek: Add MT6397 Pin Controller
  mfd: Constify properties in mfd_cell
  mfd: stm32-timers: Use dma_request_chan() instead dma_request_slave_channel()
  mfd: sprd: Remove unnecessary spi_bus_type setting
  mfd: intel-lpss: Update LPSS UART #2 PCI ID for Jasper Lake
  mfd: tqmx86: Fix a typo in MODULE_DESCRIPTION
  mfd: stpmic1: Make stpmic1_regmap_config static
  mfd: htc-i2cpld: Convert to use i2c_new_client_device()
  MAINTAINERS: Add entry for mp2629 Battery Charger driver
  power: supply: mp2629: Add impedance compensation config
  ...

512b7d37

Merge tag 'Smack-for-5.8' of git://github.com/cschaufler/smack-next · acf25aa6

Linus Torvalds authored Jun 04, 2020

Pull smack updates from Casey Schaufler:
 "Clean out dead code and repair an out-of-bounds warning"

* tag 'Smack-for-5.8' of git://github.com/cschaufler/smack-next:
  Smack: Remove unused inline function smk_ad_setfield_u_fs_path_mnt
  Smack:- Remove redundant inode_smack cache
  Smack:- Remove mutex lock "smk_lock" from inode_smack
  Smack: slab-out-of-bounds in vsscanf
  smack: remove redundant structure variable from header.
  smack: avoid unused 'sip' variable warning

acf25aa6

Merge tag 'keys-next-20200602' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs · a484a497

Linus Torvalds authored Jun 04, 2020

Pull keyring updates from David Howells:

 - Fix a documentation warning.

 - Replace a zero-length array with a flexible one

 - Make the big_key key type use ChaCha20Poly1305 and use the crypto
   algorithm directly rather than going through the crypto layer.

 - Implement the update op for the big_key type.

* tag 'keys-next-20200602' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  keys: Implement update for the big_key type
  security/keys: rewrite big_key crypto to use library interface
  KEYS: Replace zero-length array with flexible-array
  Documentation: security: core.rst: add missing argument

a484a497

Merge tag 'perf-tools-2020-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux · 38b3a5aa

Linus Torvalds authored Jun 04, 2020

Pull perf tooling updates from Arnaldo Carvalho de Melo:
 "These are additional changes to the perf tools, on top of what Ingo
  already submitted.

   - Further Intel PT call-trace fixes

   - Improve SELinux docs and tool warnings

   - Fix race at exit in 'perf record' using eventfd.

   - Add missing build tests to the default set of 'make -C tools/perf
     build-test'

   - Sync msr-index.h getting new AMD MSRs to decode and filter in 'perf
     trace'.

   - Fix fallback to libaudit in 'perf trace' for arches not using
     per-arch *.tbl files.

   - Fixes for 'perf ftrace'.

   - Fixes and improvements for the 'perf stat' metrics.

   - Use dummy event to get PERF_RECORD_{FORK,MMAP,etc} while
     synthesizing those metadata events for pre-existing threads.

   - Fix leaks detected using clang tooling.

   - Improvements to PMU event metric testing.

   - Report summary for 'perf stat' interval mode at the end, summing up
     all the intervals.

   - Improve pipe mode, i.e. this now works as expected, continuously
     dumping samples:

        # perf record -g -e raw_syscalls:sys_enter | perf --no-pager script

   - Fixes for event grouping, detecting incompatible groups such as:

        # perf stat -e '{cycles,power/energy-cores/}' -v
        WARNING: group events cpu maps do not match, disabling group:
          anon group { power/energy-cores/, cycles }
            power/energy-cores/: 0
            cycles: 0-7

   - Fixes for 'perf probe': blacklist address checking, number of
     kretprobe instances, etc.

   - JIT processing improvements and fixes plus the addition of a 'perf
     test' entry for the java demangler.

   - Add support for synthesizing first/last level cache, TLB and remove
     access events from HW tracing in the auxtrace code, first to use is
     ARM SPE.

   - Vendor events updates and fixes, including for POWER9 and Intel.

   - Allow using ~/.perfconfig for removing the ',' separators in 'perf
     stat' output.

   - Opt-in support for libpfm4"

* tag 'perf-tools-2020-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (120 commits)
  perf tools: Remove some duplicated includes
  perf symbols: Fix kernel maps for kcore and eBPF
  tools arch x86: Sync the msr-index.h copy with the kernel sources
  perf stat: Ensure group is defined on top of the same cpu mask
  perf libdw: Fix off-by 1 relative directory includes
  perf arm-spe: Support synthetic events
  perf auxtrace: Add four itrace options
  perf tools: Move arm-spe-pkt-decoder.h/c to the new dir
  perf test: Initialize memory in dwarf-unwind
  perf tests: Don't tail call optimize in unwind test
  tools compiler.h: Add attribute to disable tail calls
  perf build: Add a LIBPFM4=1 build test entry
  perf tools: Add optional support for libpfm4
  perf tools: Correct license on jsmn JSON parser
  perf jit: Fix inaccurate DWARF line table
  perf jvmti: Remove redundant jitdump line table entries
  perf build: Add NO_SDT=1 to the default set of build tests
  perf build: Add NO_LIBCRYPTO=1 to the default set of build tests
  perf build: Add NO_SYSCALL_TABLE=1 to the build tests
  perf build: Remove libaudit from the default feature checks
  ...

38b3a5aa

atomisp: avoid warning about unused function · 6929f71e

Linus Torvalds authored Jun 03, 2020

The atomisp_mrfld_power() function isn't actually ever called, because
the two call-sites have commented out the use because it breaks on some
platforms.  That results in:

  drivers/staging/media/atomisp/pci/atomisp_v4l2.c:764:12: warning: ‘atomisp_mrfld_power’ defined but not used [-Wunused-function]
    764 | static int atomisp_mrfld_power(struct atomisp_device *isp, bool enable)
        |            ^~~~~~~~~~~~~~~~~~~

during the build.

Rather than commenting out the use entirely, just disable it
semantically instead (using a "0 &&" construct), leaving the call in
place from a syntax standpoint, and avoiding the warning.

I really don't want my builds to have any warnings that can then hide
real issues.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

6929f71e

Merge tag 'media/v5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · a98f670e

Linus Torvalds authored Jun 03, 2020

Pull media updates from Mauro Carvalho Chehab:

 - Media documentation is now split into admin-guide, driver-api and
   userspace-api books (a longstanding request from Jon);

 - The media Kconfig was reorganized, in order to make easier to select
   drivers and their dependencies;

 - The testing drivers now has a separate directory;

 - added a new driver for Rockchip Video Decoder IP;

 - The atomisp staging driver was resurrected. It is meant to work with
   4 generations of cameras on Atom-based laptops, tablets and cell
   phones. So, it seems worth investing time to cleanup this driver and
   making it in good shape.

 - Added some V4L2 core ancillary routines to help with h264 codecs;

 - Added an ov2740 image sensor driver;

 - The si2157 gained support for Analog TV, which, in turn, added
   support for some cx231xx and cx23885 boards to also support analog
   standards;

 - Added some V4L2 controls (V4L2_CID_CAMERA_ORIENTATION and
   V4L2_CID_CAMERA_SENSOR_ROTATION) to help identifying where the camera
   is located at the device;

 - VIDIOC_ENUM_FMT was extended to support MC-centric devices;

 - Lots of drivers improvements and cleanups.

* tag 'media/v5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (503 commits)
  media: Documentation: media: Refer to mbus format documentation from CSI-2 docs
  media: s5k5baf: Replace zero-length array with flexible-array
  media: i2c: imx219: Drop <linux/clk-provider.h> and <linux/clkdev.h>
  media: i2c: Add ov2740 image sensor driver
  media: ov8856: Implement sensor module revision identification
  media: ov8856: Add devicetree support
  media: dt-bindings: ov8856: Document YAML bindings
  media: dvb-usb: Add Cinergy S2 PCIe Dual Port support
  media: dvbdev: Fix tuner->demod media controller link
  media: dt-bindings: phy: phy-rockchip-dphy-rx0: move rockchip dphy rx0 bindings out of staging
  media: staging: dt-bindings: phy-rockchip-dphy-rx0: remove non-used reg property
  media: atomisp: unify the version for isp2401 a0 and b0 versions
  media: atomisp: update TODO with the current data
  media: atomisp: adjust some code at sh_css that could be broken
  media: atomisp: don't produce errs for ignored IRQs
  media: atomisp: print IRQ when debugging
  media: atomisp: isp_mmu: don't use kmem_cache
  media: atomisp: add a notice about possible leak resources
  media: atomisp: disable the dynamic and reserved pools
  media: atomisp: turn on camera before setting it
  ...

a98f670e

Merge branch 'akpm' (patches from Andrew) · ee01c4d7

Linus Torvalds authored Jun 03, 2020

Merge more updates from Andrew Morton:
 "More mm/ work, plenty more to come

  Subsystems affected by this patch series: slub, memcg, gup, kasan,
  pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
  thp, mmap, kconfig"

* akpm: (131 commits)
  arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
  x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
  riscv: support DEBUG_WX
  mm: add DEBUG_WX support
  drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
  mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
  powerpc/mm: drop platform defined pmd_mknotpresent()
  mm: thp: don't need to drain lru cache when splitting and mlocking THP
  hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
  sparc32: register memory occupied by kernel as memblock.memory
  include/linux/memblock.h: fix minor typo and unclear comment
  mm, mempolicy: fix up gup usage in lookup_node
  tools/vm/page_owner_sort.c: filter out unneeded line
  mm: swap: memcg: fix memcg stats for huge pages
  mm: swap: fix vmstats for huge pages
  mm: vmscan: limit the range of LRU type balancing
  mm: vmscan: reclaim writepage is IO cost
  mm: vmscan: determine anon/file pressure balance at the reclaim root
  mm: balance LRU lists based on relative thrashing
  mm: only count actual rotations as LRU reclaim cost
  ...

ee01c4d7

arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined · 09587a09

Zong Li authored Jun 03, 2020

Extract DEBUG_WX to mm/Kconfig.debug for shared use.  Change to use
ARCH_HAS_DEBUG_WX instead of DEBUG_WX defined by arch port.
Signed-off-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/e19709e7576f65e303245fe520cad5f7bae72763.1587455584.git.zong.li@sifive.comSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

09587a09

x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined · 7e01ccb4

Zong Li authored Jun 03, 2020

Extract DEBUG_WX to mm/Kconfig.debug for shared use.  Change to use
ARCH_HAS_DEBUG_WX instead of DEBUG_WX defined by arch port.
Signed-off-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/430736828d149df3f5b462d291e845ec690e0141.1587455584.git.zong.li@sifive.comSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7e01ccb4

riscv: support DEBUG_WX · b422d28b

Zong Li authored Jun 03, 2020

Support DEBUG_WX to check whether there are mapping with write and execute
permission at the same time.

[akpm@linux-foundation.org: replace macros with C]
Signed-off-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/282e266311bced080bc6f7c255b92f87c1eb65d6.1587455584.git.zong.li@sifive.comSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b422d28b

mm: add DEBUG_WX support · 375d315c

Zong Li authored Jun 03, 2020

Patch series "Extract DEBUG_WX to shared use".

Some architectures support DEBUG_WX function, it's verbatim from each
others, so extract to mm/Kconfig.debug for shared use.

PPC and ARM ports don't support generic page dumper yet, so we only
refine x86 and arm64 port in this patch series.

For RISC-V port, the DEBUG_WX support depends on other patches which
be merged already:
  - RISC-V page table dumper
  - Support strict kernel memory permissions for security

This patch (of 4):

Some architectures support DEBUG_WX function, it's verbatim from each
others.  Extract to mm/Kconfig.debug for shared use.

[akpm@linux-foundation.org: reword text, per Will Deacon & Zong Li]
  Link: http://lkml.kernel.org/r/20200427194245.oxRJKj3fn%25akpm@linux-foundation.org
[zong.li@sifive.com: remove the specific name of arm64]
  Link: http://lkml.kernel.org/r/3a6a92ecedc54e1d0fc941398e63d504c2cd5611.1589178399.git.zong.li@sifive.com
[zong.li@sifive.com: add MMU dependency for DEBUG_WX]
  Link: http://lkml.kernel.org/r/4a674ac7863ff39ca91847b10e51209771f99416.1589178399.git.zong.li@sifive.comSuggested-by: Palmer Dabbelt <palmer@dabbelt.com>
Signed-off-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/cover.1587455584.git.zong.li@sifive.com
Link: http://lkml.kernel.org/r/23980cd0f0e5d79e24a92169116407c75bcc650d.1587455584.git.zong.li@sifive.comSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

375d315c

drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup · 4fb6eabf

Scott Cheloha authored Jun 03, 2020

Searching for a particular memory block by id is an O(n) operation because
each memory block's underlying device is kept in an unsorted linked list
on the subsystem bus.

We can cut the lookup cost to O(log n) if we cache each memory block
in an xarray.  This time complexity improvement is significant on
systems with many memory blocks.  For example:

1. A 128GB POWER9 VM with 256MB memblocks has 512 blocks.  With this
   change  memory_dev_init() completes ~12ms faster and walk_memory_blocks()
   completes ~12ms faster.

Before:
[    0.005042] memory_dev_init: adding memory blocks
[    0.021591] memory_dev_init: added memory blocks
[    0.022699] walk_memory_blocks: walking memory blocks
[    0.038730] walk_memory_blocks: walked memory blocks 0-511

After:
[    0.005057] memory_dev_init: adding memory blocks
[    0.009415] memory_dev_init: added memory blocks
[    0.010519] walk_memory_blocks: walking memory blocks
[    0.014135] walk_memory_blocks: walked memory blocks 0-511

2. A 256GB POWER9 LPAR with 256MB memblocks has 1024 blocks.  With
   this change memory_dev_init() completes ~88ms faster and
   walk_memory_blocks() completes ~87ms faster.

Before:
[    0.252246] memory_dev_init: adding memory blocks
[    0.395469] memory_dev_init: added memory blocks
[    0.409413] walk_memory_blocks: walking memory blocks
[    0.433028] walk_memory_blocks: walked memory blocks 0-511
[    0.433094] walk_memory_blocks: walking memory blocks
[    0.500244] walk_memory_blocks: walked memory blocks 131072-131583

After:
[    0.245063] memory_dev_init: adding memory blocks
[    0.299539] memory_dev_init: added memory blocks
[    0.313609] walk_memory_blocks: walking memory blocks
[    0.315287] walk_memory_blocks: walked memory blocks 0-511
[    0.315349] walk_memory_blocks: walking memory blocks
[    0.316988] walk_memory_blocks: walked memory blocks 131072-131583

3. A 32TB POWER9 LPAR with 256MB memblocks has 131072 blocks.  With
   this change we complete memory_dev_init() ~37 minutes faster and
   walk_memory_blocks() at least ~30 minutes faster.  The exact timing
   for walk_memory_blocks() is  missing, though I observed that the
   soft lockups in walk_memory_blocks() disappeared with the change,
   suggesting that lower bound.

Before:
[   13.703907] memory_dev_init: adding blocks
[ 2287.406099] memory_dev_init: added all blocks
[ 2347.494986] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2527.625378] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2707.761977] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2887.899975] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3068.028318] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3248.158764] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3428.287296] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3608.425357] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3788.554572] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3968.695071] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 4148.823970] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160

After:
[   13.696898] memory_dev_init: adding blocks
[   15.660035] memory_dev_init: added all blocks
(the walk_memory_blocks traces disappear)

There should be no significant negative impact for machines with few
memory blocks.  A sparse xarray has a small footprint and an O(log n)
lookup is negligibly slower than an O(n) lookup for only the smallest
number of memory blocks.

1. A 16GB x86 machine with 128MB memblocks has 132 blocks.  With this
   change memory_dev_init() completes ~300us faster and walk_memory_blocks()
   completes no faster or slower.  The improvement is pretty close to noise.

Before:
[    0.224752] memory_dev_init: adding memory blocks
[    0.227116] memory_dev_init: added memory blocks
[    0.227183] walk_memory_blocks: walking memory blocks
[    0.227183] walk_memory_blocks: walked memory blocks 0-131

After:
[    0.224911] memory_dev_init: adding memory blocks
[    0.226935] memory_dev_init: added memory blocks
[    0.227089] walk_memory_blocks: walking memory blocks
[    0.227089] walk_memory_blocks: walked memory blocks 0-131

[david@redhat.com: document the locking]
  Link: http://lkml.kernel.org/r/bc21eec6-7251-4c91-2f57-9a0671f8d414@redhat.comSigned-off-by: Scott Cheloha <cheloha@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Nathan Lynch <nathanl@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rick Lindsley <ricklind@linux.vnet.ibm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Link: http://lkml.kernel.org/r/20200121231028.13699-1-cheloha@linux.ibm.comSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4fb6eabf

mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid() · 86ec2da0

Anshuman Khandual authored Jun 03, 2020

pmd_present() is expected to test positive after pmdp_mknotpresent() as
the PMD entry still points to a valid huge page in memory.
pmdp_mknotpresent() implies that given PMD entry is just invalidated from
MMU perspective while still holding on to pmd_page() referred valid huge
page thus also clearing pmd_present() test.  This creates the following
situation which is counter intuitive.

[pmd_present(pmd_mknotpresent(pmd)) = true]

This renames pmd_mknotpresent() as pmd_mkinvalid() reflecting the helper's
functionality more accurately while changing the above mentioned situation
as follows.  This does not create any functional change.

[pmd_present(pmd_mkinvalid(pmd)) = true]

This is not applicable for platforms that define own pmdp_invalidate() via
__HAVE_ARCH_PMDP_INVALIDATE.  Suggestion for renaming came during a
previous discussion here.

https://patchwork.kernel.org/patch/11019637/

[anshuman.khandual@arm.com: change pmd_mknotvalid() to pmd_mkinvalid() per Will]
  Link: http://lkml.kernel.org/r/1587520326-10099-3-git-send-email-anshuman.khandual@arm.comSuggested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Link: http://lkml.kernel.org/r/1584680057-13753-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

86ec2da0

powerpc/mm: drop platform defined pmd_mknotpresent() · 124cb3a6

Anshuman Khandual authored Jun 03, 2020

Patch series "mm/thp: Rename pmd_mknotpresent() as pmd_mknotvalid()", v2.

This series renames pmd_mknotpresent() as pmd_mknotvalid().  Before that
it drops an existing pmd_mknotpresent() definition from powerpc platform
which was never required as it defines it's pmdp_invalidate() through
subscribing __HAVE_ARCH_PMDP_INVALIDATE.  This does not create any
functional change.

This rename was suggested by Catalin during a previous discussion while we
were trying to change the THP helpers on arm64 platform for migration.

https://patchwork.kernel.org/patch/11019637/

This patch (of 2):

Platform needs to define pmd_mknotpresent() for generic pmdp_invalidate()
only when __HAVE_ARCH_PMDP_INVALIDATE is not subscribed.  Otherwise
platform specific pmd_mknotpresent() is not required.  Hence just drop it.
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1587520326-10099-1-git-send-email-anshuman.khandual@arm.com
Link: http://lkml.kernel.org/r/1584680057-13753-1-git-send-email-anshuman.khandual@arm.com
Link: http://lkml.kernel.org/r/1584680057-13753-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

124cb3a6

mm: thp: don't need to drain lru cache when splitting and mlocking THP · 67e4eb07

Yang Shi authored Jun 03, 2020

Since commit 8f182270 ("mm/swap.c: flush lru pvecs on compound page
arrival") THP would not stay in pagevec anymore. So the optimization made
by commit d9654322 ("thp: increase split_huge_page() success rate")
doesn't make sense anymore, which tries to unpin munlocked THPs from
pagevec by draining pagevec.

Draining lru cache before isolating THP in mlock path is also unnecessary.
b676b293 ("mm, thp: fix mapped pages avoiding unevictable list on
mlock") added it and 9a73f61b ("thp, mlock: do not mlock PTE-mapped
file huge pages") accidentally carried it over after the above
optimization went in.
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Link: http://lkml.kernel.org/r/1585946493-7531-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

67e4eb07

hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs · 88590253

Shijie Hu authored Jun 03, 2020

In a 32-bit program, running on arm64 architecture.  When the address
space below mmap base is completely exhausted, shmat() for huge pages will
return ENOMEM, but shmat() for normal pages can still success on no-legacy
mode.  This seems not fair.

For normal pages, the calling trace of get_unmapped_area() is:

	=> mm->get_unmapped_area()
	if on legacy mode,
		=> arch_get_unmapped_area()
			=> vm_unmapped_area()
	if on no-legacy mode,
		=> arch_get_unmapped_area_topdown()
			=> vm_unmapped_area()

For huge pages, the calling trace of get_unmapped_area() is:

	=> file->f_op->get_unmapped_area()
		=> hugetlb_get_unmapped_area()
			=> vm_unmapped_area()

To solve this issue, we only need to make hugetlb_get_unmapped_area() take
the same way as mm->get_unmapped_area().  Add *bottomup() and *topdown()
for hugetlbfs, and check current mm->get_unmapped_area() to decide which
one to use.  If mm->get_unmapped_area is equal to
arch_get_unmapped_area_topdown(), hugetlb_get_unmapped_area() calls
topdown routine, otherwise calls bottomup routine.
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Shijie Hu <hushijie3@huawei.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Will Deacon <will@kernel.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: yangerkun <yangerkun@huawei.com>
Cc: ChenGang <cg.chen@huawei.com>
Cc: Chen Jie <chenjie6@huawei.com>
Link: http://lkml.kernel.org/r/20200518065338.113664-1-hushijie3@huawei.comSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

88590253

sparc32: register memory occupied by kernel as memblock.memory · 4360dfa9

Mike Rapoport authored Jun 03, 2020

sparc32 never registered the memory occupied by the kernel image with
memblock_add() and it only reserved this memory with meblock_reserve().

With openbios as system firmware, the memory occupied by the kernel is
reserved in openbios and removed from mem.available. The prom setup code
in the kernel uses mem.available to set up the memory banks and
essentially there is a hole for the memory occupied by the kernel image.

Later in bootmem_init() this memory is memblock_reserve()d.

Up until recently, memmap initialization would call __init_single_page()
for the pages in that hole, the free_low_memory_core_early() would mark
them as reserved and everything would be Ok.

After the change in memmap initialization introduced by the commit "mm:
memmap_init: iterate over memblock regions rather that check each PFN",
the hole is skipped and the page structs for it are not initialized. And
when they are passed from memblock to page allocator as reserved, the
latter gets confused.

Simply registering the memory occupied by the kernel with memblock_add()
resolves this issue.

Tested on qemu-system-sparc with Debian Etch [1] userspace.

[1] https://people.debian.org/~aurel32/qemu/sparc/debian_etch_sparc_small.qcow2Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Guenter Roeck <linux@roeck-us.net>
Link: https://lkml.kernel.org/r/20200517000050.GA87467@roeck-us.nlllllet/Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4360dfa9

include/linux/memblock.h: fix minor typo and unclear comment · 8cbd54f5

chenqiwu authored Jun 03, 2020

Fix a minor typo "usabe->usable" for the current discription of member
variable "memory" in struct memblock.

BTW, I think it's unclear the member variable "base" in struct
memblock_type is currently described as the physical address of memory
region, change it to base address of the region is clearer since the
variable is decorated as phys_addr_t.
Signed-off-by: chenqiwu <chenqiwu@xiaomi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Link: http://lkml.kernel.org/r/1588846952-32166-1-git-send-email-qiwuchen55@gmail.comSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8cbd54f5

mm, mempolicy: fix up gup usage in lookup_node · 2d3a36a4

Michal Hocko authored Jun 03, 2020

ba841078 ("mm/mempolicy: Allow lookup_node() to handle fatal signal")
has added a special casing for 0 return value because that was a possible
gup return value when interrupted by fatal signal. This has been fixed by
ae46d2aa ("mm/gup: Let __get_user_pages_locked() return -EINTR for
fatal signal") in the mean time so ba841078 can be reverted.

This patch however doesn't go all the way to revert it because the check
for 0 is wrong and confusing here. Firstly it is inherently unsafe to
access the page when get_user_pages_locked returns 0 (aka no page
returned).

Fortunatelly this will not happen because get_user_pages_locked will not
return 0 when nr_pages > 0 unless FOLL_NOWAIT is specified which is not
the case here. Document this potential error code in gup code while we
are at it.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Xu <peterx@redhat.com>
Link: http://lkml.kernel.org/r/20200421071026.18394-1-mhocko@kernel.orgSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2d3a36a4

tools/vm/page_owner_sort.c: filter out unneeded line · 5b94ce2f

Changhee Han authored Jun 03, 2020

To see a sorted result from page_owner, there must be a tiresome
preprocessing step before running page_owner_sort.  This patch simply
filters out lines which start with "PFN" while reading the page owner
report.
Signed-off-by: Changhee Han <ch0.han@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Link: http://lkml.kernel.org/r/20200429052940.16968-1-ch0.han@lge.comSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5b94ce2f

mm: swap: memcg: fix memcg stats for huge pages · 21e330fc

Shakeel Butt authored Jun 03, 2020

The commit 2262185c ("mm: per-cgroup memory reclaim stats") added
PGLAZYFREE, PGACTIVATE & PGDEACTIVATE stats for cgroups but missed
couple of places and PGLAZYFREE missed huge page handling. Fix that.
Also for PGLAZYFREE use the irq-unsafe function to update as the irq is
already disabled.

Fixes: 2262185c ("mm: per-cgroup memory reclaim stats")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: http://lkml.kernel.org/r/20200527182947.251343-1-shakeelb@google.comSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

21e330fc

mm: swap: fix vmstats for huge pages · 5d91f31f

Shakeel Butt authored Jun 03, 2020

Many of the callbacks called by pagevec_lru_move_fn() does not correctly
update the vmstats for huge pages. Fix that. Also __pagevec_lru_add_fn()
use the irq-unsafe alternative to update the stat as the irqs are
already disabled.
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: http://lkml.kernel.org/r/20200527182916.249910-1-shakeelb@google.comSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5d91f31f

mm: vmscan: limit the range of LRU type balancing · d483a5dd

Johannes Weiner authored Jun 03, 2020

When LRU cost only shows up on one list, we abruptly stop scanning that
list altogether.  That's an extreme reaction: by the time the other list
starts thrashing and the pendulum swings back, we may have no recent age
information on the first list anymore, and we could have significant
latencies until the scanner has caught up.

Soften this change in the feedback system by ensuring that no list
receives less than a third of overall pressure, and only distribute the
other 66% according to LRU cost.  This ensures that we maintain a minimum
rate of aging on the entire workingset while it's being pressured, while
still allowing a generous rate of convergence when the relative sizes of
the lists need to adjust.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-15-hannes@cmpxchg.orgSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d483a5dd

mm: vmscan: reclaim writepage is IO cost · 96f8bf4f

Johannes Weiner authored Jun 03, 2020

The VM tries to balance reclaim pressure between anon and file so as to
reduce the amount of IO incurred due to the memory shortage. It already
counts refaults and swapins, but in addition it should also count
writepage calls during reclaim.

For swap, this is obvious: it's IO that wouldn't have occurred if the
anonymous memory hadn't been under memory pressure. From a relative
balancing point of view this makes sense as well: even if anon is cold and
reclaimable, a cache that isn't thrashing may have equally cold pages that
don't require IO to reclaim.

For file writeback, it's trickier: some of the reclaim writepage IO would
have likely occurred anyway due to dirty expiration. But not all of it -
premature writeback reduces batching and generates additional writes.
Since the flushers are already woken up by the time the VM starts writing
cache pages one by one, let's assume that we'e likely causing writes that
wouldn't have happened without memory pressure. In addition, the per-page
cost of IO would have probably been much cheaper if written in larger
batches from the flusher thread rather than the single-page-writes from
kswapd.

For our purposes - getting the trend right to accelerate convergence on a
stable state that doesn't require paging at all - this is sufficiently
accurate. If we later wanted to optimize for sustained thrashing, we can
still refine the measurements.

Count all writepage calls from kswapd as IO cost toward the LRU that the
page belongs to.

Why do this dynamically? Don't we know in advance that anon pages require
IO to reclaim, and so could build in a static bias?

First, scanning is not the same as reclaiming. If all the anon pages are
referenced, we may not swap for a while just because we're scanning the
anon list. During this time, however, it's important that we age
anonymous memory and the page cache at the same rate so that their
hot-cold gradients are comparable. Everything else being equal, we still
want to reclaim the coldest memory overall.

Second, we keep copies in swap unless the page changes. If there is
swap-backed data that's mostly read (tmpfs file) and has been swapped out
before, we can reclaim it without incurring additional IO.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.orgSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

96f8bf4f

mm: vmscan: determine anon/file pressure balance at the reclaim root · 7cf111bc

Johannes Weiner authored Jun 03, 2020

We split the LRU lists into anon and file, and we rebalance the scan
pressure between them when one of them begins thrashing: if the file cache
experiences workingset refaults, we increase the pressure on anonymous
pages; if the workload is stalled on swapins, we increase the pressure on
the file cache instead.

With cgroups and their nested LRU lists, we currently don't do this
correctly. While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, LRU pressure balancing is
done on an individual cgroup LRU level. As a result, when one cgroup is
thrashing on the filesystem cache while a sibling may have cold anonymous
pages, pressure doesn't get equalized between them.

This patch moves LRU balancing decision to the root of reclaim - the same
level where the LRU order is established.

It does this by tracking LRU cost recursively, so that every level of the
cgroup tree knows the aggregate LRU cost of all memory within its domain.
When the page scanner calculates the scan balance for any given individual
cgroup's LRU list, it uses the values from the ancestor cgroup that
initiated the reclaim cycle.

If one sibling is then thrashing on the cache, it will tip the pressure
balance inside its ancestors, and the next hierarchical reclaim iteration
will go more after the anon pages in the tree.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-13-hannes@cmpxchg.orgSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7cf111bc

mm: balance LRU lists based on relative thrashing · 314b57fb

Johannes Weiner authored Jun 03, 2020

Since the LRUs were split into anon and file lists, the VM has been
balancing between page cache and anonymous pages based on per-list ratios
of scanned vs. rotated pages. In most cases that tips page reclaim
towards the list that is easier to reclaim and has the fewest actively
used pages, but there are a few problems with it:

1. Refaults and LRU rotations are weighted the same way, even though
one costs IO and the other costs a bit of CPU.

2. The less we scan an LRU list based on already observed rotations,
the more we increase the sampling interval for new references, and
rotations become even more likely on that list. This can enter a
death spiral in which we stop looking at one list completely until
the other one is all but annihilated by page reclaim.

Since commit a528910e ("mm: thrash detection-based file cache sizing")
we have refault detection for the page cache. Along with swapin events,
they are good indicators of when the file or anon list, respectively, is
too small for its workingset and needs to grow.

For example, if the page cache is thrashing, the cache pages need more
time in memory, while there may be colder pages on the anonymous list.
Likewise, if swapped pages are faulting back in, it indicates that we
reclaim anonymous pages too aggressively and should back off.

Replace LRU rotations with refaults and swapins as the basis for relative
reclaim cost of the two LRUs. This will have the VM target list balances
that incur the least amount of IO on aggregate.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-12-hannes@cmpxchg.orgSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

314b57fb

mm: only count actual rotations as LRU reclaim cost · 264e90cc

Johannes Weiner authored Jun 03, 2020

When shrinking the active file list we rotate referenced pages only when
they're in an executable mapping. The others get deactivated. When it
comes to balancing scan pressure, though, we count all referenced pages as
rotated, even the deactivated ones. Yet they do not carry the same cost
to the system: the deactivated page *might* refault later on, but the
deactivation is tangible progress toward freeing pages; rotations on the
other hand cost time and effort without getting any closer to freeing
memory.

Don't treat both events as equal. The following patch will hook up LRU
balancing to cache and anon refaults, which are a much more concrete cost
signal for reclaiming one list over the other. Thus, remove the maybe-IO
cost bias from page references, and only note the CPU cost for actual
rotations that prevent the pages from getting reclaimed.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-11-hannes@cmpxchg.orgSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

264e90cc

mm: deactivations shouldn't bias the LRU balance · fbbb602e

Johannes Weiner authored Jun 03, 2020

Operations like MADV_FREE, FADV_DONTNEED etc.  currently move any affected
active pages to the inactive list to accelerate their reclaim (good) but
also steer page reclaim toward that LRU type, or away from the other
(bad).

The reason why this is undesirable is that such operations are not part of
the regular page aging cycle, and rather a fluke that doesn't say much
about the remaining pages on that list; they might all be in heavy use,
and once the chunk of easy victims has been purged, the VM continues to
apply elevated pressure on those remaining hot pages.  The other LRU,
meanwhile, might have easily reclaimable pages, and there was never a need
to steer away from it in the first place.

As the previous patch outlined, we should focus on recording actually
observed cost to steer the balance rather than speculating about the
potential value of one LRU list over the other.  In that spirit, leave
explicitely deactivated pages to the LRU algorithm to pick up, and let
rotations decide which list is the easiest to reclaim.

[cai@lca.pw: fix set-but-not-used warning]
  Link: http://lkml.kernel.org/r/20200522133335.GA624@Qians-MacBook-Air.localSigned-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/20200520232525.798933-10-hannes@cmpxchg.orgSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fbbb602e

mm: base LRU balancing on an explicit cost model · 1431d4d1

Johannes Weiner authored Jun 03, 2020

Currently, scan pressure between the anon and file LRU lists is balanced
based on a mixture of reclaim efficiency and a somewhat vague notion of
"value" of having certain pages in memory over others. That concept of
value is problematic, because it has caused us to count any event that
remotely makes one LRU list more or less preferrable for reclaim, even
when these events are not directly comparable and impose very different
costs on the system. One example is referenced file pages that we still
deactivate and referenced anonymous pages that we actually rotate back to
the head of the list.

There is also conceptual overlap with the LRU algorithm itself. By
rotating recently used pages instead of reclaiming them, the algorithm
already biases the applied scan pressure based on page value. Thus, when
rebalancing scan pressure due to rotations, we should think of reclaim
cost, and leave assessing the page value to the LRU algorithm.

Lastly, considering both value-increasing as well as value-decreasing
events can sometimes cause the same type of event to be counted twice,
i.e. how rotating a page increases the LRU value, while reclaiming it
succesfully decreases the value. In itself this will balance out fine,
but it quietly skews the impact of events that are only recorded once.

The abstract metric of "value", the murky relationship with the LRU
algorithm, and accounting both negative and positive events make the
current pressure balancing model hard to reason about and modify.

This patch switches to a balancing model of accounting the concrete,
actually observed cost of reclaiming one LRU over another. For now, that
cost includes pages that are scanned but rotated back to the list head.
Subsequent patches will add consideration for IO caused by refaulting of
recently evicted pages.

Replace struct zone_reclaim_stat with two cost counters in the lruvec, and
make everything that affects cost go through a new lru_note_cost()
function.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-9-hannes@cmpxchg.orgSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1431d4d1

mm: vmscan: drop unnecessary div0 avoidance rounding in get_scan_count() · a4fe1631

Johannes Weiner authored Jun 03, 2020

When we calculate the relative scan pressure between the anon and file LRU
lists, we have to assume that reclaim_stat can contain zeroes.  To avoid
div0 crashes, we add 1 to all denominators like so:

        anon_prio = swappiness;
        file_prio = 200 - anon_prio;

	[...]

        /*
         * The amount of pressure on anon vs file pages is inversely
         * proportional to the fraction of recently scanned pages on
         * each list that were recently referenced and in active use.
         */
        ap = anon_prio * (reclaim_stat->recent_scanned[0] + 1);
        ap /= reclaim_stat->recent_rotated[0] + 1;

        fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
        fp /= reclaim_stat->recent_rotated[1] + 1;
        spin_unlock_irq(&pgdat->lru_lock);

        fraction[0] = ap;
        fraction[1] = fp;
        denominator = ap + fp + 1;

While reclaim_stat can contain 0, it's not actually possible for ap + fp
to be 0.  One of anon_prio or file_prio could be zero, but they must still
add up to 200.  And the reclaim_stat fraction, due to the +1 in there, is
always at least 1.  So if one of the two numerators is 0, the other one
can't be.  ap + fp is always at least 1.  Drop the + 1.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-8-hannes@cmpxchg.orgSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a4fe1631

mm: remove use-once cache bias from LRU balancing · 96824687

Johannes Weiner authored Jun 03, 2020

When the splitlru patches divided page cache and swap-backed pages into
separate LRU lists, the pressure balance between the lists was biased to
account for the fact that streaming IO can cause memory pressure with a
flood of pages that are used only once. New page cache additions would
tip the balance toward the file LRU, and repeat access would neutralize
that bias again. This ensured that page reclaim would always go for
used-once cache first.

Since e9868505 ("mm,vmscan: only evict file pages when we have
plenty"), page reclaim generally skips over swap-backed memory entirely as
long as there is used-once cache present, and will apply the LRU balancing
when only repeatedly accessed cache pages are left - at which point the
previous use-once bias will have been neutralized. This makes the
use-once cache balancing bias unnecessary.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-7-hannes@cmpxchg.orgSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

96824687

mm: workingset: let cache workingset challenge anon · 34e58cac

Johannes Weiner authored Jun 03, 2020

We activate cache refaults with reuse distances in pages smaller than the
size of the total cache. This allows new pages with competitive access
frequencies to establish themselves, as well as challenge and potentially
displace pages on the active list that have gone cold.

However, that assumes that active cache can only replace other active
cache in a competition for the hottest memory. This is not a great
default assumption. The page cache might be thrashing while there are
enough completely cold and unused anonymous pages sitting around that we'd
only have to write to swap once to stop all IO from the cache.

Activate cache refaults when their reuse distance in pages is smaller than
the total userspace workingset, including anonymous pages.

Reclaim can still decide how to balance pressure among the two LRUs
depending on the IO situation. Rotational drives will prefer avoiding
random IO from swap and go harder after cache. But fundamentally, hot
cache should be able to compete with anon pages for a place in RAM.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-6-hannes@cmpxchg.orgSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>

34e58cac