1. 11 Dec, 2020 3 commits
    • Giovanni Gherdovich's avatar
      x86: Print ratio freq_max/freq_base used in frequency invariance calculations · 3149cd55
      Giovanni Gherdovich authored
      The value freq_max/freq_base is a fundamental component of frequency
      invariance calculations. It may come from a variety of sources such as MSRs
      or ACPI data, tracking it down when troubleshooting a system could be
      non-trivial. It is worth saving it in the kernel logs.
      
       # dmesg | grep 'Estimated ratio of average max'
       [   14.024036] smpboot: Estimated ratio of average max frequency by base frequency (times 1024): 1289
      Signed-off-by: default avatarGiovanni Gherdovich <ggherdovich@suse.cz>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20201112182614.10700-4-ggherdovich@suse.cz
      3149cd55
    • Giovanni Gherdovich's avatar
      x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC · 976df7e5
      Giovanni Gherdovich authored
      Frequency invariant accounting calculations need the ratio
      freq_curr/freq_max, but freq_max is unknown as it depends on dynamic power
      allocation between cores: AMD EPYC CPUs implement "Core Performance Boost".
      Three candidates are considered to estimate this value:
      
      - maximum non-boost frequency
      - maximum boost frequency
      - the mid point between the above two
      
      Experimental data on an AMD EPYC Zen2 machine slightly favors the third
      option, which is applied with this patch.
      
      The analysis uses the ondemand cpufreq governor as baseline, and compares
      it with schedutil in a number of configurations. Using the freq_max value
      described above offers a moderate advantage in performance and efficiency:
      
      sugov-max (freq_max=max_boost) performs the worst on tbench: less
      throughput and reduced efficiency than the other invariant-schedutil
      options (see "Data Overview" below). Consider that tbench is generally a
      problematic case as no schedutil version currently is better than ondemand.
      
      sugov-P0 (freq_max=max_P) is the worst on dbench, while the other sugov's
      can surpass ondemand with less filesystem latency and slightly increased
      efficiency.
      
      1. DATA OVERVIEW
      2. DETAILED PERFORMANCE TABLES
      3. POWER CONSUMPTION TABLE
      
      1. DATA OVERVIEW
      ================
      
      sugov-noinv : non-invariant schedutil governor
      sugov-max   : invariant schedutil, freq_max=max_boost
      sugov-mid   : invariant schedutil, freq_max=midpoint
      sugov-P0    : invariant schedutil, freq_max=max_P
      perfgov     : performance governor
      
      driver      : acpi_cpufreq
      machine     : AMD EPYC 7742 (Zen2, aka "Rome"), dual socket,
                    128 cores / 256 threads, SATA SSD storage, 250G of memory,
      	      XFS filesystem
      
      Benchmarks are described in the next section.
      Tilde (~) means the value is the same as baseline.
      
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                  ondemand  perfgov  sugov-noinv  sugov-max  sugov-mid  sugov-P0  better if
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                                              PERFORMANCE RATIOS
      tbench        1.00       1.44       0.90       0.87       0.93       0.93      higher
      dbench        1.00       0.91       0.95       0.94       0.94       1.06      lower
      kernbench     1.00       0.93       ~          ~          ~          0.97      lower
      gitsource     1.00       0.66       0.97       0.96       ~          0.95      lower
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                                          PERFORMANCE-PER-WATT RATIOS
      tbench        1.00       1.16       0.84       0.84       0.88       0.85      higher
      dbench        1.00       1.03       1.02       1.02       1.02       0.93      higher
      kernbench     1.00       1.05       ~          ~          ~          ~         higher
      gitsource     1.00       1.46       1.04       1.04       ~          1.05      higher
      
      2. DETAILED PERFORMANCE TABLES
      ==============================
      
      Benchmark          : tbench4 (i.e. dbench4 over the network, actually loopback)
      Varying parameter  : number of clients
      Unit               : MB/sec (higher is better)
      
                        5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Hmean  1        427.19  +- 0.16% (        )     778.35  +- 0.10% (  82.20%)     346.92  +- 0.14% ( -18.79%)
      Hmean  2        853.82  +- 0.09% (        )    1536.23  +- 0.03% (  79.93%)     694.36  +- 0.05% ( -18.68%)
      Hmean  4       1657.54  +- 0.12% (        )    2938.18  +- 0.12% (  77.26%)    1362.81  +- 0.11% ( -17.78%)
      Hmean  8       3301.87  +- 0.06% (        )    5679.10  +- 0.04% (  72.00%)    2693.35  +- 0.04% ( -18.43%)
      Hmean  16      6139.65  +- 0.05% (        )    9498.81  +- 0.04% (  54.71%)    4889.97  +- 0.17% ( -20.35%)
      Hmean  32     11170.28  +- 0.09% (        )   17393.25  +- 0.08% (  55.71%)    9104.55  +- 0.09% ( -18.49%)
      Hmean  64     19322.97  +- 0.17% (        )   31573.91  +- 0.08% (  63.40%)   18552.52  +- 0.40% (  -3.99%)
      Hmean  128    30383.71  +- 0.11% (        )   37416.91  +- 0.15% (  23.15%)   25938.70  +- 0.41% ( -14.63%)
      Hmean  256    31143.96  +- 0.41% (        )   30908.76  +- 0.88% (  -0.76%)   29754.32  +- 0.24% (  -4.46%)
      Hmean  512    30858.49  +- 0.26% (        )   38524.60  +- 1.19% (  24.84%)   42080.39  +- 0.56% (  36.37%)
      Hmean  1024   39187.37  +- 0.19% (        )   36213.86  +- 0.26% (  -7.59%)   39555.98  +- 0.12% (   0.94%)
      
                                  5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Hmean  1        352.59  +- 1.03% ( -17.46%)     352.08  +- 0.75% ( -17.58%)     352.31  +- 1.48% ( -17.53%)
      Hmean  2        697.32  +- 0.08% ( -18.33%)     700.16  +- 0.20% ( -18.00%)     696.79  +- 0.06% ( -18.39%)
      Hmean  4       1369.88  +- 0.04% ( -17.35%)    1369.72  +- 0.07% ( -17.36%)    1365.91  +- 0.05% ( -17.59%)
      Hmean  8       2696.79  +- 0.04% ( -18.33%)    2711.06  +- 0.04% ( -17.89%)    2715.10  +- 0.61% ( -17.77%)
      Hmean  16      4725.03  +- 0.03% ( -23.04%)    4875.65  +- 0.02% ( -20.59%)    4953.05  +- 0.28% ( -19.33%)
      Hmean  32      9231.65  +- 0.10% ( -17.36%)    8704.89  +- 0.27% ( -22.07%)   10562.02  +- 0.36% (  -5.45%)
      Hmean  64     15364.27  +- 0.19% ( -20.49%)   17786.64  +- 0.15% (  -7.95%)   19665.40  +- 0.22% (   1.77%)
      Hmean  128    42100.58  +- 0.13% (  38.56%)   34946.28  +- 0.13% (  15.02%)   38635.79  +- 0.06% (  27.16%)
      Hmean  256    30660.23  +- 1.08% (  -1.55%)   32307.67  +- 0.54% (   3.74%)   31153.27  +- 0.12% (   0.03%)
      Hmean  512    24604.32  +- 0.14% ( -20.27%)   40408.50  +- 1.10% (  30.95%)   38800.29  +- 1.23% (  25.74%)
      Hmean  1024   35535.47  +- 0.28% (  -9.32%)   41070.38  +- 2.56% (   4.81%)   31308.29  +- 2.52% ( -20.11%)
      
      Benchmark          : dbench (filesystem stressor)
      Varying parameter  : number of clients
      Unit               : seconds (lower is better)
      
      NOTE-1: This dbench version measures the average latency of a set of filesystem
              operations, as we found the traditional dbench metric (throughput) to be
      	misleading.
      NOTE-2: Due to high variability, we partition the original dataset and apply
              statistical bootrapping (a resampling method). Accuracy is reported in the
      	form of 95% confidence intervals.
      
                        5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      SubAmean  1         98.79  +- 0.92 (        )      83.36  +- 0.82 (  15.62%)      84.82  +- 0.92 (  14.14%)
      SubAmean  2        116.00  +- 0.89 (        )     102.12  +- 0.77 (  11.96%)     109.63  +- 0.89 (   5.49%)
      SubAmean  4        149.90  +- 1.03 (        )     132.12  +- 0.91 (  11.86%)     143.90  +- 1.15 (   4.00%)
      SubAmean  8        182.41  +- 1.13 (        )     159.86  +- 0.93 (  12.36%)     165.82  +- 1.03 (   9.10%)
      SubAmean  16       237.83  +- 1.23 (        )     219.46  +- 1.14 (   7.72%)     229.28  +- 1.19 (   3.59%)
      SubAmean  32       334.34  +- 1.49 (        )     309.94  +- 1.42 (   7.30%)     321.19  +- 1.36 (   3.93%)
      SubAmean  64       576.61  +- 2.16 (        )     540.75  +- 2.00 (   6.22%)     551.27  +- 1.99 (   4.39%)
      SubAmean  128     1350.07  +- 4.14 (        )    1205.47  +- 3.20 (  10.71%)    1280.26  +- 3.75 (   5.17%)
      SubAmean  256     3444.42  +- 7.97 (        )    3698.00 +- 27.43 (  -7.36%)    3494.14  +- 7.81 (  -1.44%)
      SubAmean  2048   39457.89 +- 29.01 (        )   34105.33 +- 41.85 (  13.57%)   39688.52 +- 36.26 (  -0.58%)
      
                                  5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      SubAmean  1         85.68  +- 1.04 (  13.27%)      84.16  +- 0.84 (  14.81%)      83.99  +- 0.90 (  14.99%)
      SubAmean  2        108.42  +- 0.95 (   6.54%)     109.91  +- 1.39 (   5.24%)     112.06  +- 0.91 (   3.39%)
      SubAmean  4        136.90  +- 1.04 (   8.67%)     137.59  +- 0.93 (   8.21%)     136.55  +- 0.95 (   8.91%)
      SubAmean  8        163.15  +- 0.96 (  10.56%)     166.07  +- 1.02 (   8.96%)     165.81  +- 0.99 (   9.10%)
      SubAmean  16       224.86  +- 1.12 (   5.45%)     223.83  +- 1.06 (   5.89%)     230.66  +- 1.19 (   3.01%)
      SubAmean  32       320.51  +- 1.38 (   4.13%)     322.85  +- 1.49 (   3.44%)     321.96  +- 1.46 (   3.70%)
      SubAmean  64       553.25  +- 1.93 (   4.05%)     554.19  +- 2.08 (   3.89%)     562.26  +- 2.22 (   2.49%)
      SubAmean  128     1264.35  +- 3.72 (   6.35%)    1256.99  +- 3.46 (   6.89%)    2018.97 +- 18.79 ( -49.55%)
      SubAmean  256     3466.25  +- 8.25 (  -0.63%)    3450.58  +- 8.44 (  -0.18%)    5032.12 +- 38.74 ( -46.09%)
      SubAmean  2048   39133.10 +- 45.71 (   0.82%)   39905.95 +- 34.33 (  -1.14%)   53811.86 +-193.04 ( -36.38%)
      
      Benchmark          : kernbench (kernel compilation)
      Varying parameter  : number of jobs
      Unit               : seconds (lower is better)
      
                        5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Amean  2        471.71 +- 26.61% (        )     409.88 +- 16.99% (  13.11%)     430.63  +- 0.18% (   8.71%)
      Amean  4        211.87  +- 0.58% (        )     194.03  +- 0.74% (   8.42%)     215.33  +- 0.64% (  -1.63%)
      Amean  8        109.79  +- 1.27% (        )     101.43  +- 1.53% (   7.61%)     111.05  +- 1.95% (  -1.15%)
      Amean  16        59.50  +- 1.28% (        )      55.61  +- 1.35% (   6.55%)      59.65  +- 1.78% (  -0.24%)
      Amean  32        34.94  +- 1.22% (        )      32.36  +- 1.95% (   7.41%)      35.44  +- 0.63% (  -1.43%)
      Amean  64        22.58  +- 0.38% (        )      20.97  +- 1.28% (   7.11%)      22.41  +- 1.73% (   0.74%)
      Amean  128       17.72  +- 0.44% (        )      16.68  +- 0.32% (   5.88%)      17.65  +- 0.96% (   0.37%)
      Amean  256       16.44  +- 0.53% (        )      15.76  +- 0.32% (   4.18%)      16.76  +- 0.60% (  -1.93%)
      Amean  512       16.54  +- 0.21% (        )      15.62  +- 0.41% (   5.53%)      16.84  +- 0.85% (  -1.83%)
      
                                  5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Amean  2        421.30  +- 0.24% (  10.69%)     419.26  +- 0.15% (  11.12%)     414.38  +- 0.33% (  12.15%)
      Amean  4  	217.81  +- 5.53% (  -2.80%)     211.63  +- 0.99% (   0.12%)     208.43  +- 0.47% (   1.63%)
      Amean  8  	108.80  +- 0.43% (   0.90%)     108.48  +- 1.44% (   1.19%)     108.59  +- 3.08% (   1.09%)
      Amean  16 	 58.84  +- 0.74% (   1.12%)      58.37  +- 0.94% (   1.91%)      57.78  +- 0.78% (   2.90%)
      Amean  32 	 34.04  +- 2.00% (   2.59%)      34.28  +- 1.18% (   1.91%)      33.98  +- 2.21% (   2.75%)
      Amean  64 	 22.22  +- 1.69% (   1.60%)      22.27  +- 1.60% (   1.38%)      22.25  +- 1.41% (   1.47%)
      Amean  128	 17.55  +- 0.24% (   0.97%)      17.53  +- 0.94% (   1.04%)      17.49  +- 0.43% (   1.30%)
      Amean  256	 16.51  +- 0.46% (  -0.40%)      16.48  +- 0.48% (  -0.19%)      16.44  +- 1.21% (   0.00%)
      Amean  512	 16.50  +- 0.35% (   0.19%)      16.35  +- 0.42% (   1.14%)      16.37  +- 0.33% (   0.99%)
      
      Benchmark          : gitsource (time to run the git unit test suite)
      Varying parameter  : none
      Unit               : seconds (lower is better)
      
                        5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Amean          1035.76  +- 0.30% (        )     688.21  +- 0.04% (  33.56%)    1003.85  +- 0.14% (   3.08%)
      
                                  5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Amean           995.82  +- 0.08% (   3.86%)    1011.98  +- 0.03% (   2.30%)     986.87  +- 0.19% (   4.72%)
      
      3. POWER CONSUMPTION TABLE
      ==========================
      
      Average power consumption (watts).
      
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                  ondemand  perfgov  sugov-noinv  sugov-max  sugov-mid  sugov-P0
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      tbench4     227.25     281.83     244.17     236.76     241.50     247.99
      dbench4     151.97     161.87     157.08     158.10     158.06     153.73
      kernbench   162.78     167.22     162.90     164.19     164.65     164.72
      gitsource   133.65     139.00     133.04     134.43     134.18     134.32
      Signed-off-by: default avatarGiovanni Gherdovich <ggherdovich@suse.cz>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20201112182614.10700-3-ggherdovich@suse.cz
      976df7e5
    • Nathan Fontenot's avatar
      x86, sched: Calculate frequency invariance for AMD systems · 41ea6672
      Nathan Fontenot authored
      This is the first pass in creating the ability to calculate the
      frequency invariance on AMD systems. This approach uses the CPPC
      highest performance and nominal performance values that range from
      0 - 255 instead of a high and base frquency. This is because we do
      not have the ability on AMD to get a highest frequency value.
      
      On AMD systems the highest performance and nominal performance
      vaues do correspond to the highest and base frequencies for the system
      so using them should produce an appropriate ratio but some tweaking
      is likely necessary.
      
      Due to CPPC being initialized later in boot than when the frequency
      invariant calculation is currently made, I had to create a callback
      from the CPPC init code to do the calculation after we have CPPC
      data.
      
      Special thanks to "kernel test robot <lkp@intel.com>" for reporting that
      compilation of drivers/acpi/cppc_acpi.c is conditional to
      CONFIG_ACPI_CPPC_LIB, not just CONFIG_ACPI.
      
      [ ggherdovich@suse.cz: made safe under CPU hotplug, edited changelog. ]
      Signed-off-by: default avatarNathan Fontenot <nathan.fontenot@amd.com>
      Signed-off-by: default avatarGiovanni Gherdovich <ggherdovich@suse.cz>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20201112182614.10700-2-ggherdovich@suse.cz
      41ea6672
  2. 27 Nov, 2020 1 commit
  3. 26 Nov, 2020 2 commits
  4. 25 Nov, 2020 1 commit
    • Linus Torvalds's avatar
      Merge tag 'media/v5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · fa02fcd9
      Linus Torvalds authored
      Pull media fixes from Mauro Carvalho Chehab:
      
       - a rand Kconfig fixup for mtk-vcodec
      
       - a fix at h264 handling at cedrus codec driver
      
       - some warning fixes when config PM is not enabled at marvell-ccic
      
       - two fixes at venus codec driver: one related to codec profile and the
         other one related to a bad error path which causes an OOPS on module
         re-bind
      
      * tag 'media/v5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
        media: venus: pm_helpers: Fix kernel module reload
        media: venus: venc: Fix setting of profile and level
        media: cedrus: h264: Fix check for presence of scaling matrix
        media: media/platform/marvell-ccic: fix warnings when CONFIG_PM is not enabled
        media: mtk-vcodec: fix build breakage when one of VPU or SCP is enabled
        media: mtk-vcodec: move firmware implementations into their own files
      fa02fcd9
  5. 24 Nov, 2020 12 commits
    • Linus Torvalds's avatar
      Merge tag '5.10-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 · 127c501a
      Linus Torvalds authored
      Pull cifs fixes from Steve French:
       "Four smb3 fixes for stable: one fixes a memleak, the other three
        address a problem found with decryption offload that can cause a use
        after free"
      
      * tag '5.10-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
        smb3: Handle error case during offload read path
        smb3: Avoid Mid pending list corruption
        smb3: Call cifs reconnect from demultiplex thread
        cifs: fix a memleak with modefromsid
      127c501a
    • Hugh Dickins's avatar
      mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback) · 073861ed
      Hugh Dickins authored
      Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
      on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
      end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
      no longer an ext4 page at all.
      
      The problem is that PageWriteback is not accompanied by a page reference
      (as the NOTE at the end of test_clear_page_writeback() acknowledges): as
      soon as TestClearPageWriteback has been done, that page could be removed
      from page cache, freed, and reused for something else by the time that
      wake_up_page() is reached.
      
      https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
      Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
      check; but I'm paranoid about even looking at an unreferenced struct page,
      lest its memory might itself have already been reused or hotremoved (and
      wake_up_page_bit() may modify that memory with its ClearPageWaiters()).
      
      Then on crashing a second time, realized there's a stronger reason against
      that approach.  If my testing just occasionally crashes on that check,
      when the page is reused for part of a compound page, wouldn't it be much
      more common for the page to get reused as an order-0 page before reaching
      wake_up_page()?  And on rare occasions, might that reused page already be
      marked PageWriteback by its new user, and already be waited upon?  What
      would that look like?
      
      It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
      in write_cache_pages() (though I have never seen that crash myself).
      
      Matthew Wilcox explaining this to himself:
       "page is allocated, added to page cache, dirtied, writeback starts,
      
        --- thread A ---
        filesystem calls end_page_writeback()
              test_clear_page_writeback()
        --- context switch to thread B ---
        truncate_inode_pages_range() finds the page, it doesn't have writeback set,
        we delete it from the page cache.  Page gets reallocated, dirtied, writeback
        starts again.  Then we call write_cache_pages(), see
        PageWriteback() set, call wait_on_page_writeback()
        --- context switch back to thread A ---
        wake_up_page(page, PG_writeback);
        ... thread B is woken, but because the wakeup was for the old use of
        the page, PageWriteback is still set.
      
        Devious"
      
      And prior to 2a9127fc ("mm: rewrite wait_on_page_bit_common() logic")
      this would have been much less likely: before that, wake_page_function()'s
      non-exclusive case would stop walking and not wake if it found Writeback
      already set again; whereas now the non-exclusive case proceeds to wake.
      
      I have not thought of a fix that does not add a little overhead: the
      simplest fix is for end_page_writeback() to get_page() before calling
      test_clear_page_writeback(), then put_page() after wake_up_page().
      
      Was there a chance of missed wakeups before, since a page freed before
      reaching wake_up_page() would have PageWaiters cleared?  I think not,
      because each waiter does hold a reference on the page.  This bug comes
      when the old use of the page, the one we do TestClearPageWriteback on,
      had *no* waiters, so no additional page reference beyond the page cache
      (and whoever racily freed it).  The reuse of the page has a waiter
      holding a reference, and its own PageWriteback set; but the belated
      wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).
      
      Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
      Reported-by: default avatarQian Cai <cai@lca.pw>
      Fixes: 2a9127fc ("mm: rewrite wait_on_page_bit_common() logic")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org # v5.8+
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      073861ed
    • Linus Torvalds's avatar
      Merge tag 's390-5.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 80145ac2
      Linus Torvalds authored
      Pull s390 fix from Heiko Carstens:
       "Disable interrupts when restoring fpu and vector registers, otherwise
        KVM guests might see corrupted register contents"
      
      * tag 's390-5.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390: fix fpu restore in entry.S
      80145ac2
    • Linus Torvalds's avatar
      Merge tag 'arc-5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc · b1489422
      Linus Torvalds authored
      Pull ARC fixes from Vineet Gupta:
       "A couple more stack unwinder related fixes:
      
         - More stack unwinding updates
      
         - Misc minor fixes"
      
      * tag 'arc-5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
        ARC: stack unwinding: reorganize how initial register state setup
        ARC: stack unwinding: don't assume non-current task is sleeping
        ARC: mm: fix spelling mistakes
        ARC: bitops: Remove unecessary operation and value
      b1489422
    • Peter Zijlstra's avatar
      irq_work: Optimize irq_work_single() · 2914b0ba
      Peter Zijlstra authored
      Trade one atomic op for a full memory barrier.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      2914b0ba
    • Peter Zijlstra's avatar
      smp: Cleanup smp_call_function*() · 545b8c8d
      Peter Zijlstra authored
      Get rid of the __call_single_node union and cleanup the API a little
      to avoid external code relying on the structure layout as much.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      545b8c8d
    • Peter Zijlstra's avatar
      irq_work: Cleanup · 7a9f50a0
      Peter Zijlstra authored
      Get rid of the __call_single_node union and clean up the API a little
      to avoid external code relying on the structure layout as much.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      7a9f50a0
    • Mel Gorman's avatar
      sched: Limit the amount of NUMA imbalance that can exist at fork time · 23e6082a
      Mel Gorman authored
      At fork time currently, a local node can be allowed to fill completely
      and allow the periodic load balancer to fix the problem. This can be
      problematic in cases where a task creates lots of threads that idle until
      woken as part of a worker poll causing a memory bandwidth problem.
      
      However, a "real" workload suffers badly from this behaviour. The workload
      in question is mostly NUMA aware but spawns large numbers of threads
      that act as a worker pool that can be called from anywhere. These need
      to spread early to get reasonable behaviour.
      
      This patch limits how much a local node can fill before spilling over
      to another node and it will not be a universal win. Specifically,
      very short-lived workloads that fit within a NUMA node would prefer
      the memory bandwidth.
      
      As I cannot describe the "real" workload, the best proxy measure I found
      for illustration was a page fault microbenchmark. It's not representative
      of the workload but demonstrates the hazard of the current behaviour.
      
      pft timings
                                       5.10.0-rc2             5.10.0-rc2
                                imbalancefloat-v2          forkspread-v2
      Amean     elapsed-1        46.37 (   0.00%)       46.05 *   0.69%*
      Amean     elapsed-4        12.43 (   0.00%)       12.49 *  -0.47%*
      Amean     elapsed-7         7.61 (   0.00%)        7.55 *   0.81%*
      Amean     elapsed-12        4.79 (   0.00%)        4.80 (  -0.17%)
      Amean     elapsed-21        3.13 (   0.00%)        2.89 *   7.74%*
      Amean     elapsed-30        3.65 (   0.00%)        2.27 *  37.62%*
      Amean     elapsed-48        3.08 (   0.00%)        2.13 *  30.69%*
      Amean     elapsed-79        2.00 (   0.00%)        1.90 *   4.95%*
      Amean     elapsed-80        2.00 (   0.00%)        1.90 *   4.70%*
      
      This is showing the time to fault regions belonging to threads. The target
      machine has 80 logical CPUs and two nodes. Note the ~30% gain when the
      machine is approximately the point where one node becomes fully utilised.
      The slower results are borderline noise.
      
      Kernel building shows similar benefits around the same balance point.
      Generally performance was either neutral or better in the tests conducted.
      The main consideration with this patch is the point where fork stops
      spreading a task so some workloads may benefit from different balance
      points but it would be a risky tuning parameter.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/20201120090630.3286-5-mgorman@techsingularity.net
      23e6082a
    • Mel Gorman's avatar
      sched/numa: Allow a floating imbalance between NUMA nodes · 7d2b5dd0
      Mel Gorman authored
      Currently, an imbalance is only allowed when a destination node
      is almost completely idle. This solved one basic class of problems
      and was the cautious approach.
      
      This patch revisits the possibility that NUMA nodes can be imbalanced
      until 25% of the CPUs are occupied. The reasoning behind 25% is somewhat
      superficial -- it's half the cores when HT is enabled.  At higher
      utilisations, balancing should continue as normal and keep things even
      until scheduler domains are fully busy or over utilised.
      
      Note that this is not expected to be a universal win. Any benchmark
      that prefers spreading as wide as possible with limited communication
      will favour the old behaviour as there is more memory bandwidth.
      Workloads that communicate heavily in pairs such as netperf or tbench
      benefit. For the tests I ran, the vast majority of workloads saw
      a benefit so it seems to be a worthwhile trade-off.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/20201120090630.3286-4-mgorman@techsingularity.net
      7d2b5dd0
    • Mel Gorman's avatar
      sched: Avoid unnecessary calculation of load imbalance at clone time · 5c339005
      Mel Gorman authored
      In find_idlest_group(), the load imbalance is only relevant when the group
      is either overloaded or fully busy but it is calculated unconditionally.
      This patch moves the imbalance calculation to the context it is required.
      Technically, it is a micro-optimisation but really the benefit is avoiding
      confusing one type of imbalance with another depending on the group_type
      in the next patch.
      
      No functional change.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/20201120090630.3286-3-mgorman@techsingularity.net
      5c339005
    • Mel Gorman's avatar
      sched/numa: Rename nr_running and break out the magic number · abeae76a
      Mel Gorman authored
      This is simply a preparation patch to make the following patches easier
      to read. No functional change.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/20201120090630.3286-2-mgorman@techsingularity.net
      abeae76a
    • Thomas Gleixner's avatar
      sched: Make migrate_disable/enable() independent of RT · 74d862b6
      Thomas Gleixner authored
      Now that the scheduler can deal with migrate disable properly, there is no
      real compelling reason to make it only available for RT.
      
      There are quite some code pathes which needlessly disable preemption in
      order to prevent migration and some constructs like kmap_atomic() enforce
      it implicitly.
      
      Making it available independent of RT allows to provide a preemptible
      variant of kmap_atomic() and makes the code more consistent in general.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Grudgingly-Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20201118204007.269943012@linutronix.de
      74d862b6
  6. 23 Nov, 2020 4 commits
  7. 22 Nov, 2020 17 commits
    • Linus Torvalds's avatar
      Linux 5.10-rc5 · 418baf2c
      Linus Torvalds authored
      418baf2c
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid · d5530d82
      Linus Torvalds authored
      Pull HID fixes from Jiri Kosina:
      
       - Various functionality / regression fixes for Logitech devices from
         Hans de Goede
      
       - Fix for (recently added) GPIO support in mcp2221 driver from Lars
         Povlsen
      
       - Power management handling fix/quirk in i2c-hid driver for certain
         BIOSes that have strange aproach to power-cycle from Hans de Goede
      
       - a few device ID additions and device-specific quirks
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
        HID: logitech-dj: Fix Dinovo Mini when paired with a MX5x00 receiver
        HID: logitech-dj: Fix an error in mse_bluetooth_descriptor
        HID: Add Logitech Dinovo Edge battery quirk
        HID: logitech-hidpp: Add HIDPP_CONSUMER_VENDOR_KEYS quirk for the Dinovo Edge
        HID: logitech-dj: Handle quad/bluetooth keyboards with a builtin trackpad
        HID: add HID_QUIRK_INCREMENT_USAGE_ON_DUPLICATE for Gamevice devices
        HID: mcp2221: Fix GPIO output handling
        HID: hid-sensor-hub: Fix issue with devices with no report ID
        HID: i2c-hid: Put ACPI enumerated devices in D3 on shutdown
        HID: add support for Sega Saturn
        HID: cypress: Support Varmilo Keyboards' media hotkeys
        HID: ite: Replace ABS_MISC 120/121 events with touchpad on/off keypresses
        HID: logitech-hidpp: Add PID for MX Anywhere 2
        HID: uclogic: Add ID for Trust Flex Design Tablet
      d5530d82
    • Linus Torvalds's avatar
      Merge tag 'sched-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · f4b936f5
      Linus Torvalds authored
      Pull scheduler fixes from Thomas Gleixner:
       "A couple of scheduler fixes:
      
         - Make the conditional update of the overutilized state work
           correctly by caching the relevant flags state before overwriting
           them and checking them afterwards.
      
         - Fix a data race in the wakeup path which caused loadavg on ARM64
           platforms to become a random number generator.
      
         - Fix the ordering of the iowaiter accounting operations so it can't
           be decremented before it is incremented.
      
         - Fix a bug in the deadline scheduler vs. priority inheritance when a
           non-deadline task A has inherited the parameters of a deadline task
           B and then blocks on a non-deadline task C.
      
           The second inheritance step used the static deadline parameters of
           task A, which are usually 0, instead of further propagating task
           B's parameters. The zero initialized parameters trigger a bug in
           the deadline scheduler"
      
      * tag 'sched-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/deadline: Fix priority inheritance with multiple scheduling classes
        sched: Fix rq->nr_iowait ordering
        sched: Fix data-race in wakeup
        sched/fair: Fix overutilized update in enqueue_task_fair()
      f4b936f5
    • Linus Torvalds's avatar
      Merge tag 'perf-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 48da3305
      Linus Torvalds authored
      Pull perf fix from Thomas Gleixner:
       "A single fix for the x86 perf sysfs interfaces which used kobject
        attributes instead of device attributes and therefore making clang's
        control flow integrity checker upset"
      
      * tag 'perf-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86: fix sysfs type mismatches
      48da3305
    • Linus Torvalds's avatar
      Merge tag 'locking-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 855cf1ee
      Linus Torvalds authored
      Pull locking fix from Thomas Gleixner:
       "A single fix for lockdep which makes the recursion protection cover
        graph lock/unlock"
      
      * tag 'locking-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        lockdep: Put graph lock/unlock under lock_recursion protection
      855cf1ee
    • Linus Torvalds's avatar
      Merge tag 'efi-urgent-for-v5.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 68d3fa23
      Linus Torvalds authored
      Pull EFI fixes from Borislav Petkov:
       "Forwarded EFI fixes from Ard Biesheuvel:
      
         - fix memory leak in efivarfs driver
      
         - fix HYP mode issue in 32-bit ARM version of the EFI stub when built
           in Thumb2 mode
      
         - avoid leaking EFI pgd pages on allocation failure"
      
      * tag 'efi-urgent-for-v5.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        efi/x86: Free efi_pgd with free_pages()
        efivarfs: fix memory leak in efivarfs_create()
        efi/arm: set HSCTLR Thumb2 bit correctly for HVC calls from HYP
      68d3fa23
    • Linus Torvalds's avatar
      Merge tag 'x86_urgent_for_v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7d53be55
      Linus Torvalds authored
      Pull x86 fixes from Borislav Petkov:
      
       - An IOMMU VT-d build fix when CONFIG_PCI_ATS=n along with a revert of
         same because the proper one is going through the IOMMU tree (Thomas
         Gleixner)
      
       - An Intel microcode loader fix to save the correct microcode patch to
         apply during resume (Chen Yu)
      
       - A fix to not access user memory of other processes when dumping
         opcode bytes (Thomas Gleixner)
      
      * tag 'x86_urgent_for_v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        Revert "iommu/vt-d: Take CONFIG_PCI_ATS into account"
        x86/dumpstack: Do not try to access user space code of other tasks
        x86/microcode/intel: Check patch signature before saving microcode for early loading
        iommu/vt-d: Take CONFIG_PCI_ATS into account
      7d53be55
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 4a51c60a
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "8 patches.
      
        Subsystems affected by this patch series: mm (madvise, pagemap,
        readahead, memcg, userfaultfd), kbuild, and vfs"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm: fix madvise WILLNEED performance problem
        libfs: fix error cast of negative value in simple_attr_write()
        mm/userfaultfd: do not access vma->vm_mm after calling handle_userfault()
        mm: memcg/slab: fix root memcg vmstats
        mm: fix readahead_page_batch for retry entries
        mm: fix phys_to_target_node() and memory_add_physaddr_to_nid() exports
        compiler-clang: remove version check for BPF Tracing
        mm/madvise: fix memory leak from process_madvise
      4a51c60a
    • Linus Torvalds's avatar
      Merge tag 'staging-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · d27637ec
      Linus Torvalds authored
      Pull staging and IIO fixes from Greg KH:
       "Here are some small Staging and IIO driver fixes for 5.10-rc5. They
        include:
      
         - IIO fixes for reported regressions and problems
      
         - new device ids for IIO drivers
      
         - new device id for rtl8723bs driver
      
         - staging ralink driver Kconfig dependency fix
      
         - staging mt7621-pci bus resource fix
      
        All of these have been in linux-next all week with no reported issues"
      
      * tag 'staging-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        iio: accel: kxcjk1013: Add support for KIOX010A ACPI DSM for setting tablet-mode
        iio: accel: kxcjk1013: Replace is_smo8500_device with an acpi_type enum
        docs: ABI: testing: iio: stm32: remove re-introduced unsupported ABI
        iio: light: fix kconfig dependency bug for VCNL4035
        iio/adc: ingenic: Fix AUX/VBAT readings when touchscreen is used
        iio/adc: ingenic: Fix battery VREF for JZ4770 SoC
        staging: rtl8723bs: Add 024c:0627 to the list of SDIO device-ids
        staging: ralink-gdma: fix kconfig dependency bug for DMA_RALINK
        staging: mt7621-pci: avoid to request pci bus resources
        iio: imu: st_lsm6dsx: set 10ms as min shub slave timeout
        counter/ti-eqep: Fix regmap max_register
        iio: adc: stm32-adc: fix a regression when using dma and irq
        iio: adc: mediatek: fix unset field
        iio: cros_ec: Use default frequencies when EC returns invalid information
      d27637ec
    • Linus Torvalds's avatar
      Merge tag 'tty-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · de758035
      Linus Torvalds authored
      Pull tty fixes from Greg KH:
       "Here are some small tty/serial fixes for 5.10-rc5 that resolve some
        reported issues:
      
         - speakup crash when telling the kernel to use a device that isn't
           really there
      
         - imx serial driver fixes for reported problems
      
         - ar933x_uart driver fix for probe error handling path
      
        All have been in linux-next for a while with no reported issues"
      
      * tag 'tty-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        serial: ar933x_uart: disable clk on error handling path in probe
        tty: serial: imx: keep console clocks always on
        speakup: Do not let the line discipline be used several times
        tty: serial: imx: fix potential deadlock
      de758035
    • Linus Torvalds's avatar
      Merge tag 'ext4_for_linus_fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · a7f07fc1
      Linus Torvalds authored
      Pull ext4 fixes from Ted Ts'o:
       "A final set of miscellaneous bug fixes for ext4"
      
      * tag 'ext4_for_linus_fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        ext4: fix bogus warning in ext4_update_dx_flag()
        jbd2: fix kernel-doc markups
        ext4: drop fast_commit from /proc/mounts
      a7f07fc1
    • David Howells's avatar
      afs: Fix speculative status fetch going out of order wrt to modifications · a9e5c87c
      David Howells authored
      When doing a lookup in a directory, the afs filesystem uses a bulk
      status fetch to speculatively retrieve the statuses of up to 48 other
      vnodes found in the same directory and it will then either update extant
      inodes or create new ones - effectively doing 'lookup ahead'.
      
      To avoid the possibility of deadlocking itself, however, the filesystem
      doesn't lock all of those inodes; rather just the directory inode is
      locked (by the VFS).
      
      When the operation completes, afs_inode_init_from_status() or
      afs_apply_status() is called, depending on whether the inode already
      exists, to commit the new status.
      
      A case exists, however, where the speculative status fetch operation may
      straddle a modification operation on one of those vnodes.  What can then
      happen is that the speculative bulk status RPC retrieves the old status,
      and whilst that is happening, the modification happens - which returns
      an updated status, then the modification status is committed, then we
      attempt to commit the speculative status.
      
      This results in something like the following being seen in dmesg:
      
      	kAFS: vnode modified {100058:861} 8->9 YFS.InlineBulkStatus
      
      showing that for vnode 861 on volume 100058, we saw YFS.InlineBulkStatus
      say that the vnode had data version 8 when we'd already recorded version
      9 due to a local modification.  This was causing the cache to be
      invalidated for that vnode when it shouldn't have been.  If it happens
      on a data file, this might lead to local changes being lost.
      
      Fix this by ignoring speculative status updates if the data version
      doesn't match the expected value.
      
      Note that it is possible to get a DV regression if a volume gets
      restored from a backup - but we should get a callback break in such a
      case that should trigger a recheck anyway.  It might be worth checking
      the volume creation time in the volsync info and, if a change is
      observed in that (as would happen on a restore), invalidate all caches
      associated with the volume.
      
      Fixes: 5cf9dd55 ("afs: Prospectively look up extra files when doing a single lookup")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9e5c87c
    • Matthew Wilcox (Oracle)'s avatar
      mm: fix madvise WILLNEED performance problem · 66383800
      Matthew Wilcox (Oracle) authored
      The calculation of the end page index was incorrect, leading to a
      regression of 70% when running stress-ng.
      
      With this fix, we instead see a performance improvement of 3%.
      
      Fixes: e6e88712 ("mm: optimise madvise WILLNEED")
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarXing Zhengjun <zhengjun.xing@linux.intel.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: "Chen, Rong A" <rong.a.chen@intel.com>
      Link: https://lkml.kernel.org/r/20201109134851.29692-1-willy@infradead.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66383800
    • Yicong Yang's avatar
      libfs: fix error cast of negative value in simple_attr_write() · 488dac0c
      Yicong Yang authored
      The attr->set() receive a value of u64, but simple_strtoll() is used for
      doing the conversion.  It will lead to the error cast if user inputs a
      negative value.
      
      Use kstrtoull() instead of simple_strtoll() to convert a string got from
      the user to an unsigned value.  The former will return '-EINVAL' if it
      gets a negetive value, but the latter can't handle the situation
      correctly.  Make 'val' unsigned long long as what kstrtoull() takes,
      this will eliminate the compile warning on no 64-bit architectures.
      
      Fixes: f7b88631 ("fs/libfs.c: fix simple_attr_write() on 32bit machines")
      Signed-off-by: default avatarYicong Yang <yangyicong@hisilicon.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: https://lkml.kernel.org/r/1605341356-11872-1-git-send-email-yangyicong@hisilicon.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      488dac0c
    • Gerald Schaefer's avatar
      mm/userfaultfd: do not access vma->vm_mm after calling handle_userfault() · bfe8cc1d
      Gerald Schaefer authored
      Alexander reported a syzkaller / KASAN finding on s390, see below for
      complete output.
      
      In do_huge_pmd_anonymous_page(), the pre-allocated pagetable will be
      freed in some cases.  In the case of userfaultfd_missing(), this will
      happen after calling handle_userfault(), which might have released the
      mmap_lock.  Therefore, the following pte_free(vma->vm_mm, pgtable) will
      access an unstable vma->vm_mm, which could have been freed or re-used
      already.
      
      For all architectures other than s390 this will go w/o any negative
      impact, because pte_free() simply frees the page and ignores the
      passed-in mm.  The implementation for SPARC32 would also access
      mm->page_table_lock for pte_free(), but there is no THP support in
      SPARC32, so the buggy code path will not be used there.
      
      For s390, the mm->context.pgtable_list is being used to maintain the 2K
      pagetable fragments, and operating on an already freed or even re-used
      mm could result in various more or less subtle bugs due to list /
      pagetable corruption.
      
      Fix this by calling pte_free() before handle_userfault(), similar to how
      it is already done in __do_huge_pmd_anonymous_page() for the WRITE /
      non-huge_zero_page case.
      
      Commit 6b251fc9 ("userfaultfd: call handle_userfault() for
      userfaultfd_missing() faults") actually introduced both, the
      do_huge_pmd_anonymous_page() and also __do_huge_pmd_anonymous_page()
      changes wrt to calling handle_userfault(), but only in the latter case
      it put the pte_free() before calling handle_userfault().
      
        BUG: KASAN: use-after-free in do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744
        Read of size 8 at addr 00000000962d6988 by task syz-executor.0/9334
      
        CPU: 1 PID: 9334 Comm: syz-executor.0 Not tainted 5.10.0-rc1-syzkaller-07083-g4c9720875573 #0
        Hardware name: IBM 3906 M04 701 (KVM/Linux)
        Call Trace:
          do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744
          create_huge_pmd mm/memory.c:4256 [inline]
          __handle_mm_fault+0xe6e/0x1068 mm/memory.c:4480
          handle_mm_fault+0x288/0x748 mm/memory.c:4607
          do_exception+0x394/0xae0 arch/s390/mm/fault.c:479
          do_dat_exception+0x34/0x80 arch/s390/mm/fault.c:567
          pgm_check_handler+0x1da/0x22c arch/s390/kernel/entry.S:706
          copy_from_user_mvcos arch/s390/lib/uaccess.c:111 [inline]
          raw_copy_from_user+0x3a/0x88 arch/s390/lib/uaccess.c:174
          _copy_from_user+0x48/0xa8 lib/usercopy.c:16
          copy_from_user include/linux/uaccess.h:192 [inline]
          __do_sys_sigaltstack kernel/signal.c:4064 [inline]
          __s390x_sys_sigaltstack+0xc8/0x240 kernel/signal.c:4060
          system_call+0xe0/0x28c arch/s390/kernel/entry.S:415
      
        Allocated by task 9334:
          slab_alloc_node mm/slub.c:2891 [inline]
          slab_alloc mm/slub.c:2899 [inline]
          kmem_cache_alloc+0x118/0x348 mm/slub.c:2904
          vm_area_dup+0x9c/0x2b8 kernel/fork.c:356
          __split_vma+0xba/0x560 mm/mmap.c:2742
          split_vma+0xca/0x108 mm/mmap.c:2800
          mlock_fixup+0x4ae/0x600 mm/mlock.c:550
          apply_vma_lock_flags+0x2c6/0x398 mm/mlock.c:619
          do_mlock+0x1aa/0x718 mm/mlock.c:711
          __do_sys_mlock2 mm/mlock.c:738 [inline]
          __s390x_sys_mlock2+0x86/0xa8 mm/mlock.c:728
          system_call+0xe0/0x28c arch/s390/kernel/entry.S:415
      
        Freed by task 9333:
          slab_free mm/slub.c:3142 [inline]
          kmem_cache_free+0x7c/0x4b8 mm/slub.c:3158
          __vma_adjust+0x7b2/0x2508 mm/mmap.c:960
          vma_merge+0x87e/0xce0 mm/mmap.c:1209
          userfaultfd_release+0x412/0x6b8 fs/userfaultfd.c:868
          __fput+0x22c/0x7a8 fs/file_table.c:281
          task_work_run+0x200/0x320 kernel/task_work.c:151
          tracehook_notify_resume include/linux/tracehook.h:188 [inline]
          do_notify_resume+0x100/0x148 arch/s390/kernel/signal.c:538
          system_call+0xe6/0x28c arch/s390/kernel/entry.S:416
      
        The buggy address belongs to the object at 00000000962d6948 which belongs to the cache vm_area_struct of size 200
        The buggy address is located 64 bytes inside of 200-byte region [00000000962d6948, 00000000962d6a10)
        The buggy address belongs to the page: page:00000000313a09fe refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x962d6 flags: 0x3ffff00000000200(slab)
        raw: 3ffff00000000200 000040000257e080 0000000c0000000c 000000008020ba00
        raw: 0000000000000000 000f001e00000000 ffffffff00000001 0000000096959501
        page dumped because: kasan: bad access detected
        page->mem_cgroup:0000000096959501
      
        Memory state around the buggy address:
         00000000962d6880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
         00000000962d6900: 00 fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb
        >00000000962d6980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                              ^
         00000000962d6a00: fb fb fc fc fc fc fc fc fc fc 00 00 00 00 00 00
         00000000962d6a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        ==================================================================
      
      Fixes: 6b251fc9 ("userfaultfd: call handle_userfault() for userfaultfd_missing() faults")
      Reported-by: default avatarAlexander Egorenkov <egorenar@linux.ibm.com>
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.3+]
      Link: https://lkml.kernel.org/r/20201110190329.11920-1-gerald.schaefer@linux.ibm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bfe8cc1d
    • Muchun Song's avatar
      mm: memcg/slab: fix root memcg vmstats · 8faeb1ff
      Muchun Song authored
      If we reparent the slab objects to the root memcg, when we free the slab
      object, we need to update the per-memcg vmstats to keep it correct for
      the root memcg.  Now this at least affects the vmstat of
      NR_KERNEL_STACK_KB for !CONFIG_VMAP_STACK when the thread stack size is
      smaller than the PAGE_SIZE.
      
      David said:
       "I assume that without this fix that the root memcg's vmstat would
        always be inflated if we reparented"
      
      Fixes: ec9f0238 ("mm: workingset: fix vmstat counters for shadow nodes")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: <stable@vger.kernel.org>	[5.3+]
      Link: https://lkml.kernel.org/r/20201110031015.15715-1-songmuchun@bytedance.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8faeb1ff
    • Matthew Wilcox (Oracle)'s avatar
      mm: fix readahead_page_batch for retry entries · 4349a83a
      Matthew Wilcox (Oracle) authored
      Both btrfs and fuse have reported faults caused by seeing a retry entry
      instead of the page they were looking for.  This was caused by a missing
      check in the iterator.
      
      As can be seen in the below panic log, the accessing 0x402 causes a
      panic.  In the xarray.h, 0x402 means RETRY_ENTRY.
      
        BUG: kernel NULL pointer dereference, address: 0000000000000402
        CPU: 14 PID: 306003 Comm: as Not tainted 5.9.0-1-amd64 #1 Debian 5.9.1-1
        Hardware name: Lenovo ThinkSystem SR665/7D2VCTO1WW, BIOS D8E106Q-1.01 05/30/2020
        RIP: 0010:fuse_readahead+0x152/0x470 [fuse]
        Code: 41 8b 57 18 4c 8d 54 10 ff 4c 89 d6 48 8d 7c 24 10 e8 d2 e3 28 f9 48 85 c0 0f 84 fe 00 00 00 44 89 f2 49 89 04 d4 44 8d 72 01 <48> 8b 10 41 8b 4f 1c 48 c1 ea 10 83 e2 01 80 fa 01 19 d2 81 e2 01
        RSP: 0018:ffffad99ceaebc50 EFLAGS: 00010246
        RAX: 0000000000000402 RBX: 0000000000000001 RCX: 0000000000000002
        RDX: 0000000000000000 RSI: ffff94c5af90bd98 RDI: ffffad99ceaebc60
        RBP: ffff94ddc1749a00 R08: 0000000000000402 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000100 R12: ffff94de6c429ce0
        R13: ffff94de6c4d3700 R14: 0000000000000001 R15: ffffad99ceaebd68
        FS:  00007f228c5c7040(0000) GS:ffff94de8ed80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000402 CR3: 0000001dbd9b4000 CR4: 0000000000350ee0
        Call Trace:
          read_pages+0x83/0x270
          page_cache_readahead_unbounded+0x197/0x230
          generic_file_buffered_read+0x57a/0xa20
          new_sync_read+0x112/0x1a0
          vfs_read+0xf8/0x180
          ksys_read+0x5f/0xe0
          do_syscall_64+0x33/0x80
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 042124cc ("mm: add new readahead_control API")
      Reported-by: default avatarDavid Sterba <dsterba@suse.com>
      Reported-by: default avatarWonhyuk Yang <vvghjk1234@gmail.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201103142852.8543-1-willy@infradead.org
      Link: https://lkml.kernel.org/r/20201103124349.16722-1-vvghjk1234@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4349a83a