1. 28 Oct, 2019 1 commit
    • Marvin Liu's avatar
      virtio_ring: fix stalls for packed rings · 40ce7919
      Marvin Liu authored
      When VIRTIO_F_RING_EVENT_IDX is negotiated, virtio devices can
      use virtqueue_enable_cb_delayed_packed to reduce the number of device
      interrupts.  At the moment, this is the case for virtio-net when the
      napi_tx module parameter is set to false.
      
      In this case, the virtio driver selects an event offset and expects that
      the device will send a notification when rolling over the event offset
      in the ring.  However, if this roll-over happens before the event
      suppression structure update, the notification won't be sent. To address
      this race condition the driver needs to check wether the device rolled
      over the offset after updating the event suppression structure.
      
      With VIRTIO_F_RING_PACKED, the virtio driver did this by reading the
      flags field of the descriptor at the specified offset.
      
      Unfortunately, checking at the event offset isn't reliable: if
      descriptors are chained (e.g. when INDIRECT is off) not all descriptors
      are overwritten by the device, so it's possible that the device skipped
      the specific descriptor driver is checking when writing out used
      descriptors. If this happens, the driver won't detect the race condition
      and will incorrectly expect the device to send a notification.
      
      For virtio-net, the result will be a TX queue stall, with the
      transmission getting blocked forever.
      
      With the packed ring, it isn't easy to find a location which is
      guaranteed to change upon the roll-over, except the next device
      descriptor, as described in the spec:
      
              Writes of device and driver descriptors can generally be
              reordered, but each side (driver and device) are only required to
              poll (or test) a single location in memory: the next device descriptor after
              the one they processed previously, in circular order.
      
      while this might be sub-optimal, let's do exactly this for now.
      
      Cc: stable@vger.kernel.org
      Cc: Jason Wang <jasowang@redhat.com>
      Fixes: f51f9826 ("virtio_ring: leverage event idx in packed ring")
      Signed-off-by: default avatarMarvin Liu <yong.liu@intel.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      40ce7919
  2. 20 Oct, 2019 6 commits
  3. 19 Oct, 2019 33 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 531e93d1
      Linus Torvalds authored
      Pull networking fixes from David Miller:
       "I was battling a cold after some recent trips, so quite a bit piled up
        meanwhile, sorry about that.
      
        Highlights:
      
         1) Fix fd leak in various bpf selftests, from Brian Vazquez.
      
         2) Fix crash in xsk when device doesn't support some methods, from
            Magnus Karlsson.
      
         3) Fix various leaks and use-after-free in rxrpc, from David Howells.
      
         4) Fix several SKB leaks due to confusion of who owns an SKB and who
            should release it in the llc code. From Eric Biggers.
      
         5) Kill a bunc of KCSAN warnings in TCP, from Eric Dumazet.
      
         6) Jumbo packets don't work after resume on r8169, as the BIOS resets
            the chip into non-jumbo mode during suspend. From Heiner Kallweit.
      
         7) Corrupt L2 header during MPLS push, from Davide Caratti.
      
         8) Prevent possible infinite loop in tc_ctl_action, from Eric
            Dumazet.
      
         9) Get register bits right in bcmgenet driver, based upon chip
            version. From Florian Fainelli.
      
        10) Fix mutex problems in microchip DSA driver, from Marek Vasut.
      
        11) Cure race between route lookup and invalidation in ipv4, from Wei
            Wang.
      
        12) Fix performance regression due to false sharing in 'net'
            structure, from Eric Dumazet"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (145 commits)
        net: reorder 'struct net' fields to avoid false sharing
        net: dsa: fix switch tree list
        net: ethernet: dwmac-sun8i: show message only when switching to promisc
        net: aquantia: add an error handling in aq_nic_set_multicast_list
        net: netem: correct the parent's backlog when corrupted packet was dropped
        net: netem: fix error path for corrupted GSO frames
        macb: propagate errors when getting optional clocks
        xen/netback: fix error path of xenvif_connect_data()
        net: hns3: fix mis-counting IRQ vector numbers issue
        net: usb: lan78xx: Connect PHY before registering MAC
        vsock/virtio: discard packets if credit is not respected
        vsock/virtio: send a credit update when buffer size is changed
        mlxsw: spectrum_trap: Push Ethernet header before reporting trap
        net: ensure correct skb->tstamp in various fragmenters
        net: bcmgenet: reset 40nm EPHY on energy detect
        net: bcmgenet: soft reset 40nm EPHYs before MAC init
        net: phy: bcm7xxx: define soft_reset for 40nm EPHY
        net: bcmgenet: don't set phydev->link from MAC
        net: Update address for MediaTek ethernet driver in MAINTAINERS
        ipv4: fix race condition between route lookup and invalidation
        ...
      531e93d1
    • Eric Dumazet's avatar
      net: reorder 'struct net' fields to avoid false sharing · 2a06b898
      Eric Dumazet authored
      Intel test robot reported a ~7% regression on TCP_CRR tests
      that they bisected to the cited commit.
      
      Indeed, every time a new TCP socket is created or deleted,
      the atomic counter net->count is touched (via get_net(net)
      and put_net(net) calls)
      
      So cpus might have to reload a contended cache line in
      net_hash_mix(net) calls.
      
      We need to reorder 'struct net' fields to move @hash_mix
      in a read mostly cache line.
      
      We move in the first cache line fields that can be
      dirtied often.
      
      We probably will have to address in a followup patch
      the __randomize_layout that was added in linux-4.13,
      since this might break our placement choices.
      
      Fixes: 355b9855 ("netns: provide pure entropy for net_hash_mix()")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2a06b898
    • Vivien Didelot's avatar
      net: dsa: fix switch tree list · 50c7d2ba
      Vivien Didelot authored
      If there are multiple switch trees on the device, only the last one
      will be listed, because the arguments of list_add_tail are swapped.
      
      Fixes: 83c0afae ("net: dsa: Add new binding implementation")
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@gmail.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50c7d2ba
    • Mans Rullgard's avatar
      net: ethernet: dwmac-sun8i: show message only when switching to promisc · 05908d72
      Mans Rullgard authored
      Printing the info message every time more than the max number of mac
      addresses are requested generates unnecessary log spam.  Showing it only
      when the hw is not already in promiscous mode is equally informative
      without being annoying.
      Signed-off-by: default avatarMans Rullgard <mans@mansr.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05908d72
    • Chenwandun's avatar
      net: aquantia: add an error handling in aq_nic_set_multicast_list · 3d00cf2f
      Chenwandun authored
      add an error handling in aq_nic_set_multicast_list, it may not
      work when hw_multicast_list_set error; and at the same time
      it will remove gcc Wunused-but-set-variable warning.
      Signed-off-by: default avatarChenwandun <chenwandun@huawei.com>
      Reviewed-by: default avatarIgor Russkikh <igor.russkikh@aquantia.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3d00cf2f
    • David S. Miller's avatar
      Merge branch 'netem-fix-further-issues-with-packet-corruption' · 70873837
      David S. Miller authored
      Jakub Kicinski says:
      
      ====================
      net: netem: fix further issues with packet corruption
      
      This set is fixing two more issues with the netem packet corruption.
      
      First patch (which was previously posted) avoids NULL pointer dereference
      if the first frame gets freed due to allocation or checksum failure.
      v2 improves the clarity of the code a little as requested by Cong.
      
      Second patch ensures we don't return SUCCESS if the frame was in fact
      dropped. Thanks to this commit message for patch 1 no longer needs the
      "this will still break with a single-frame failure" disclaimer.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      70873837
    • Jakub Kicinski's avatar
      net: netem: correct the parent's backlog when corrupted packet was dropped · e0ad032e
      Jakub Kicinski authored
      If packet corruption failed we jump to finish_segs and return
      NET_XMIT_SUCCESS. Seeing success will make the parent qdisc
      increment its backlog, that's incorrect - we need to return
      NET_XMIT_DROP.
      
      Fixes: 6071bd1a ("netem: Segment GSO packets on enqueue")
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e0ad032e
    • Jakub Kicinski's avatar
      net: netem: fix error path for corrupted GSO frames · a7fa12d1
      Jakub Kicinski authored
      To corrupt a GSO frame we first perform segmentation.  We then
      proceed using the first segment instead of the full GSO skb and
      requeue the rest of the segments as separate packets.
      
      If there are any issues with processing the first segment we
      still want to process the rest, therefore we jump to the
      finish_segs label.
      
      Commit 177b8007 ("net: netem: fix backlog accounting for
      corrupted GSO frames") started using the pointer to the first
      segment in the "rest of segments processing", but as mentioned
      above the first segment may had already been freed at this point.
      
      Backlog corrections for parent qdiscs have to be adjusted.
      
      Fixes: 177b8007 ("net: netem: fix backlog accounting for corrupted GSO frames")
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reported-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a7fa12d1
    • Michael Tretter's avatar
      macb: propagate errors when getting optional clocks · bd310aca
      Michael Tretter authored
      The tx_clk, rx_clk, and tsu_clk are optional. Currently the macb driver
      marks clock as not available if it receives an error when trying to get
      a clock. This is wrong, because a clock controller might return
      -EPROBE_DEFER if a clock is not available, but will eventually become
      available.
      
      In these cases, the driver would probe successfully but will never be
      able to adjust the clocks, because the clocks were not available during
      probe, but became available later.
      
      For example, the clock controller for the ZynqMP is implemented in the
      PMU firmware and the clocks are only available after the firmware driver
      has been probed.
      
      Use devm_clk_get_optional() in instead of devm_clk_get() to get the
      optional clock and propagate all errors to the calling function.
      Signed-off-by: default avatarMichael Tretter <m.tretter@pengutronix.de>
      Acked-by: default avatarNicolas Ferre <nicolas.ferre@microchip.com>
      Tested-by: default avatarNicolas Ferre <nicolas.ferre@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd310aca
    • Juergen Gross's avatar
      xen/netback: fix error path of xenvif_connect_data() · 3d5c1a03
      Juergen Gross authored
      xenvif_connect_data() calls module_put() in case of error. This is
      wrong as there is no related module_get().
      
      Remove the superfluous module_put().
      
      Fixes: 279f438e ("xen-netback: Don't destroy the netdev until the vif is shut down")
      Cc: <stable@vger.kernel.org> # 3.12
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Reviewed-by: default avatarPaul Durrant <paul@xen.org>
      Reviewed-by: default avatarWei Liu <wei.liu@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3d5c1a03
    • Yonglong Liu's avatar
      net: hns3: fix mis-counting IRQ vector numbers issue · 580a05f9
      Yonglong Liu authored
      Currently, the num_msi_left means the vector numbers of NIC,
      but if the PF supported RoCE, it contains the vector numbers
      of NIC and RoCE(Not expected).
      
      This may cause interrupts lost in some case, because of the
      NIC module used the vector resources which belongs to RoCE.
      
      This patch adds a new variable num_nic_msi to store the vector
      numbers of NIC, and adjust the default TQP numbers and rss_size
      according to the value of num_nic_msi.
      
      Fixes: 46a3df9f ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support")
      Signed-off-by: default avatarYonglong Liu <liuyonglong@huawei.com>
      Signed-off-by: default avatarHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      580a05f9
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 998d7551
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "Rather a lot of fixes, almost all affecting mm/"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits)
        scripts/gdb: fix debugging modules on s390
        kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe register
        mm/thp: allow dropping THP from page cache
        mm/vmscan.c: support removing arbitrary sized pages from mapping
        mm/thp: fix node page state in split_huge_page_to_list()
        proc/meminfo: fix output alignment
        mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch
        mm/filemap.c: include <linux/ramfs.h> for generic_file_vm_ops definition
        mm: include <linux/huge_mm.h> for is_vma_temporary_stack
        zram: fix race between backing_dev_show and backing_dev_store
        mm/memcontrol: update lruvec counters in mem_cgroup_move_account
        ocfs2: fix panic due to ocfs2_wq is null
        hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic()
        mm: memblock: do not enforce current limit for memblock_phys* family
        mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
        mm/gup: fix a misnamed "write" argument, and a related bug
        mm/gup_benchmark: add a missing "w" to getopt string
        ocfs2: fix error handling in ocfs2_setattr()
        mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release
        mm/memunmap: don't access uninitialized memmap in memunmap_pages()
        ...
      998d7551
    • Ilya Leoshkevich's avatar
      scripts/gdb: fix debugging modules on s390 · 585d730d
      Ilya Leoshkevich authored
      Currently lx-symbols assumes that module text is always located at
      module->core_layout->base, but s390 uses the following layout:
      
        +------+  <- module->core_layout->base
        | GOT  |
        +------+  <- module->core_layout->base + module->arch->plt_offset
        | PLT  |
        +------+  <- module->core_layout->base + module->arch->plt_offset +
        | TEXT |     module->arch->plt_size
        +------+
      
      Therefore, when trying to debug modules on s390, all the symbol
      addresses are skewed by plt_offset + plt_size.
      
      Fix by adding plt_offset + plt_size to module_addr in
      load_module_symbols().
      
      Link: http://lkml.kernel.org/r/20191017085917.81791-1-iii@linux.ibm.comSigned-off-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Reviewed-by: default avatarJan Kiszka <jan.kiszka@siemens.com>
      Cc: Kieran Bingham <kbingham@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      585d730d
    • Song Liu's avatar
      kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe register · aa5de305
      Song Liu authored
      Attaching uprobe to text section in THP splits the PMD mapped page table
      into PTE mapped entries.  On uprobe detach, we would like to regroup PMD
      mapped page table entry to regain performance benefit of THP.
      
      However, the regroup is broken For perf_event based trace_uprobe.  This
      is because perf_event based trace_uprobe calls uprobe_unregister twice
      on close: first in TRACE_REG_PERF_CLOSE, then in
      TRACE_REG_PERF_UNREGISTER.  The second call will split the PMD mapped
      page table entry, which is not the desired behavior.
      
      Fix this by only use FOLL_SPLIT_PMD for uprobe register case.
      
      Add a WARN() to confirm uprobe unregister never work on huge pages, and
      abort the operation when this WARN() triggers.
      
      Link: http://lkml.kernel.org/r/20191017164223.2762148-6-songliubraving@fb.com
      Fixes: 5a52c9df ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Reviewed-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa5de305
    • Kirill A. Shutemov's avatar
      mm/thp: allow dropping THP from page cache · ef18a1ca
      Kirill A. Shutemov authored
      Once a THP is added to the page cache, it cannot be dropped via
      /proc/sys/vm/drop_caches.  Fix this issue with proper handling in
      invalidate_mapping_pages().
      
      Link: http://lkml.kernel.org/r/20191017164223.2762148-5-songliubraving@fb.com
      Fixes: 99cb0dbd ("mm,thp: add read-only THP support for (non-shmem) FS")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Tested-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef18a1ca
    • William Kucharski's avatar
      mm/vmscan.c: support removing arbitrary sized pages from mapping · 906d278d
      William Kucharski authored
      __remove_mapping() assumes that pages can only be either base pages or
      HPAGE_PMD_SIZE.  Ask the page what size it is.
      
      Link: http://lkml.kernel.org/r/20191017164223.2762148-4-songliubraving@fb.com
      Fixes: 99cb0dbd ("mm,thp: add read-only THP support for (non-shmem) FS")
      Signed-off-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      906d278d
    • Kirill A. Shutemov's avatar
      mm/thp: fix node page state in split_huge_page_to_list() · 06d3eff6
      Kirill A. Shutemov authored
      Make sure split_huge_page_to_list() handles the state of shmem THP and
      file THP properly.
      
      Link: http://lkml.kernel.org/r/20191017164223.2762148-3-songliubraving@fb.com
      Fixes: 60fbf0ab ("mm,thp: stats for file backed THP")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Tested-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      06d3eff6
    • Kirill A. Shutemov's avatar
      proc/meminfo: fix output alignment · 2be5fbf9
      Kirill A. Shutemov authored
      Patch series "Fixes for THP in page cache", v2.
      
      This patch (of 5):
      
      Add extra space for FileHugePages and FilePmdMapped, so the output is
      aligned with other rows.
      
      Link: http://lkml.kernel.org/r/20191017164223.2762148-2-songliubraving@fb.com
      Fixes: 60fbf0ab ("mm,thp: stats for file backed THP")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Tested-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2be5fbf9
    • Ben Dooks (Codethink)'s avatar
      mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch · a2ae8c05
      Ben Dooks (Codethink) authored
      mm_init.c needs to include <linux/mman.h> for the definition of
      vm_committed_as_batch.  Fixes the following sparse warning:
      
        mm/mm_init.c:141:5: warning: symbol 'vm_committed_as_batch' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20191016091509.26708-1-ben.dooks@codethink.co.ukSigned-off-by: default avatarBen Dooks <ben.dooks@codethink.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2ae8c05
    • Ben Dooks's avatar
      mm/filemap.c: include <linux/ramfs.h> for generic_file_vm_ops definition · d0e6a582
      Ben Dooks authored
      The generic_file_vm_ops is defined in <linux/ramfs.h> so include it to
      fix the following warning:
      
        mm/filemap.c:2717:35: warning: symbol 'generic_file_vm_ops' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20191008102311.25432-1-ben.dooks@codethink.co.ukSigned-off-by: default avatarBen Dooks <ben.dooks@codethink.co.uk>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0e6a582
    • Ben Dooks's avatar
      mm: include <linux/huge_mm.h> for is_vma_temporary_stack · 444f84fd
      Ben Dooks authored
      Include <linux/huge_mm.h> for the definition of is_vma_temporary_stack
      to fix the following sparse warning:
      
        mm/rmap.c:1673:6: warning: symbol 'is_vma_temporary_stack' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20191009151155.27763-1-ben.dooks@codethink.co.ukSigned-off-by: default avatarBen Dooks <ben.dooks@codethink.co.uk>
      Reviewed-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      444f84fd
    • Chenwandun's avatar
      zram: fix race between backing_dev_show and backing_dev_store · f7daefe4
      Chenwandun authored
      CPU0:				       CPU1:
      backing_dev_show		       backing_dev_store
          ......				   ......
          file = zram->backing_dev;
          down_read(&zram->init_lock);	   down_read(&zram->init_init_lock)
          file_path(file, ...);		   zram->backing_dev = backing_dev;
          up_read(&zram->init_lock);		   up_read(&zram->init_lock);
      
      gets the value of zram->backing_dev too early in backing_dev_show, which
      resultin the value being NULL at the beginning, and not NULL later.
      
      backtrace:
        d_path+0xcc/0x174
        file_path+0x10/0x18
        backing_dev_show+0x40/0xb4
        dev_attr_show+0x20/0x54
        sysfs_kf_seq_show+0x9c/0x10c
        kernfs_seq_show+0x28/0x30
        seq_read+0x184/0x488
        kernfs_fop_read+0x5c/0x1a4
        __vfs_read+0x44/0x128
        vfs_read+0xa0/0x138
        SyS_read+0x54/0xb4
      
      Link: http://lkml.kernel.org/r/1571046839-16814-1-git-send-email-chenwandun@huawei.comSigned-off-by: default avatarChenwandun <chenwandun@huawei.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7daefe4
    • Konstantin Khlebnikov's avatar
      mm/memcontrol: update lruvec counters in mem_cgroup_move_account · ae8af438
      Konstantin Khlebnikov authored
      Mapped, dirty and writeback pages are also counted in per-lruvec stats.
      These counters needs update when page is moved between cgroups.
      
      Currently is nobody *consuming* the lruvec versions of these counters and
      that there is no user-visible effect.
      
      Link: http://lkml.kernel.org/r/157112699975.7360.1062614888388489788.stgit@buzz
      Fixes: 00f3ca2c ("mm: memcontrol: per-lruvec stats infrastructure")
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae8af438
    • Yi Li's avatar
      ocfs2: fix panic due to ocfs2_wq is null · b918c430
      Yi Li authored
      mount.ocfs2 failed when reading ocfs2 filesystem superblock encounters
      an error.  ocfs2_initialize_super() returns before allocating ocfs2_wq.
      ocfs2_dismount_volume() triggers the following panic.
      
        Oct 15 16:09:27 cnwarekv-205120 kernel: On-disk corruption discovered.Please run fsck.ocfs2 once the filesystem is unmounted.
        Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_read_locked_inode:537 ERROR: status = -30
        Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:458 ERROR: status = -30
        Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:491 ERROR: status = -30
        Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_initialize_super:2313 ERROR: status = -30
        Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_fill_super:1033 ERROR: status = -30
        ------------[ cut here ]------------
        Oops: 0002 [#1] SMP NOPTI
        CPU: 1 PID: 11753 Comm: mount.ocfs2 Tainted: G  E
              4.14.148-200.ckv.x86_64 #1
        Hardware name: Sugon H320-G30/35N16-US, BIOS 0SSDX017 12/21/2018
        task: ffff967af0520000 task.stack: ffffa5f05484000
        RIP: 0010:mutex_lock+0x19/0x20
        Call Trace:
          flush_workqueue+0x81/0x460
          ocfs2_shutdown_local_alloc+0x47/0x440 [ocfs2]
          ocfs2_dismount_volume+0x84/0x400 [ocfs2]
          ocfs2_fill_super+0xa4/0x1270 [ocfs2]
          ? ocfs2_initialize_super.isa.211+0xf20/0xf20 [ocfs2]
          mount_bdev+0x17f/0x1c0
          mount_fs+0x3a/0x160
      
      Link: http://lkml.kernel.org/r/1571139611-24107-1-git-send-email-yili@winhong.comSigned-off-by: default avatarYi Li <yilikernel@gmail.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b918c430
    • David Hildenbrand's avatar
      hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic() · f231fe42
      David Hildenbrand authored
      Uninitialized memmaps contain garbage and in the worst case trigger
      kernel BUGs, especially with CONFIG_PAGE_POISONING.  They should not get
      touched.
      
      Let's make sure that we only consider online memory (managed by the
      buddy) that has initialized memmaps.  ZONE_DEVICE is not applicable.
      
      page_zone() will call page_to_nid(), which will trigger
      VM_BUG_ON_PGFLAGS(PagePoisoned(page), page) with CONFIG_PAGE_POISONING
      and CONFIG_DEBUG_VM_PGFLAGS when called on uninitialized memmaps.  This
      can be the case when an offline memory block (e.g., never onlined) is
      spanned by a zone.
      
      Note: As explained by Michal in [1], alloc_contig_range() will verify
      the range.  So it boils down to the wrong access in this function.
      
      [1] http://lkml.kernel.org/r/20180423000943.GO17484@dhcp22.suse.cz
      
      Link: http://lkml.kernel.org/r/20191015120717.4858-1-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarMichal Hocko <mhocko@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f231fe42
    • Mike Rapoport's avatar
      mm: memblock: do not enforce current limit for memblock_phys* family · f3057ad7
      Mike Rapoport authored
      Until commit 92d12f95 ("memblock: refactor internal allocation
      functions") the maximal address for memblock allocations was forced to
      memblock.current_limit only for the allocation functions returning
      virtual address.  The changes introduced by that commit moved the limit
      enforcement into the allocation core and as a result the allocation
      functions returning physical address also started to limit allocations
      to memblock.current_limit.
      
      This caused breakage of etnaviv GPU driver:
      
        etnaviv etnaviv: bound 130000.gpu (ops gpu_ops)
        etnaviv etnaviv: bound 134000.gpu (ops gpu_ops)
        etnaviv etnaviv: bound 2204000.gpu (ops gpu_ops)
        etnaviv-gpu 130000.gpu: model: GC2000, revision: 5108
        etnaviv-gpu 130000.gpu: command buffer outside valid memory window
        etnaviv-gpu 134000.gpu: model: GC320, revision: 5007
        etnaviv-gpu 134000.gpu: command buffer outside valid memory window
        etnaviv-gpu 2204000.gpu: model: GC355, revision: 1215
        etnaviv-gpu 2204000.gpu: Ignoring GPU with VG and FE2.0
      
      Restore the behaviour of memblock_phys* family so that these functions
      will not enforce memblock.current_limit.
      
      Link: http://lkml.kernel.org/r/1570915861-17633-1-git-send-email-rppt@kernel.org
      Fixes: 92d12f95 ("memblock: refactor internal allocation functions")
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reported-by: default avatarAdam Ford <aford173@gmail.com>
      Tested-by: Adam Ford <aford173@gmail.com>	[imx6q-logicpd]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Fabio Estevam <festevam@gmail.com>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3057ad7
    • Honglei Wang's avatar
      mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size · b11edebb
      Honglei Wang authored
      Commit 1a61ab80 ("mm: memcontrol: replace zone summing with
      lruvec_page_state()") has made lruvec_page_state to use per-cpu counters
      instead of calculating it directly from lru_zone_size with an idea that
      this would be more effective.
      
      Tim has reported that this is not really the case for their database
      benchmark which is showing an opposite results where lruvec_page_state
      is taking up a huge chunk of CPU cycles (about 25% of the system time
      which is roughly 7% of total cpu cycles) on 5.3 kernels.  The workload
      is running on a larger machine (96cpus), it has many cgroups (500) and
      it is heavily direct reclaim bound.
      
      Tim Chen said:
      
      : The problem can also be reproduced by running simple multi-threaded
      : pmbench benchmark with a fast Optane SSD swap (see profile below).
      :
      :
      : 6.15%     3.08%  pmbench          [kernel.vmlinux]            [k] lruvec_lru_size
      :             |
      :             |--3.07%--lruvec_lru_size
      :             |          |
      :             |          |--2.11%--cpumask_next
      :             |          |          |
      :             |          |           --1.66%--find_next_bit
      :             |          |
      :             |           --0.57%--call_function_interrupt
      :             |                     |
      :             |                      --0.55%--smp_call_function_interrupt
      :             |
      :             |--1.59%--0x441f0fc3d009
      :             |          _ops_rdtsc_init_base_freq
      :             |          access_histogram
      :             |          page_fault
      :             |          __do_page_fault
      :             |          handle_mm_fault
      :             |          __handle_mm_fault
      :             |          |
      :             |           --1.54%--do_swap_page
      :             |                     swapin_readahead
      :             |                     swap_cluster_readahead
      :             |                     |
      :             |                      --1.53%--read_swap_cache_async
      :             |                                __read_swap_cache_async
      :             |                                alloc_pages_vma
      :             |                                __alloc_pages_nodemask
      :             |                                __alloc_pages_slowpath
      :             |                                try_to_free_pages
      :             |                                do_try_to_free_pages
      :             |                                shrink_node
      :             |                                shrink_node_memcg
      :             |                                |
      :             |                                |--0.77%--lruvec_lru_size
      :             |                                |
      :             |                                 --0.76%--inactive_list_is_low
      :             |                                           |
      :             |                                            --0.76%--lruvec_lru_size
      :             |
      :              --1.50%--measure_read
      :                        page_fault
      :                        __do_page_fault
      :                        handle_mm_fault
      :                        __handle_mm_fault
      :                        do_swap_page
      :                        swapin_readahead
      :                        swap_cluster_readahead
      :                        |
      :                         --1.48%--read_swap_cache_async
      :                                   __read_swap_cache_async
      :                                   alloc_pages_vma
      :                                   __alloc_pages_nodemask
      :                                   __alloc_pages_slowpath
      :                                   try_to_free_pages
      :                                   do_try_to_free_pages
      :                                   shrink_node
      :                                   shrink_node_memcg
      :                                   |
      :                                   |--0.75%--inactive_list_is_low
      :                                   |          |
      :                                   |           --0.75%--lruvec_lru_size
      :                                   |
      :                                    --0.73%--lruvec_lru_size
      
      The likely culprit is the cache traffic the lruvec_page_state_local
      generates.  Dave Hansen says:
      
      : I was thinking purely of the cache footprint.  If it's reading
      : pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
      : bytes of cache *96 CPUs = 18k of data, mostly read-only.  1 cgroup would
      : be 18k of data for the whole system and the caching would be pretty
      : efficient and all 18k would probably survive a tight page fault loop in
      : the L1.  500 cgroups would be ~90k of data per CPU thread which doesn't
      : fit in the L1 and probably wouldn't survive a tight page fault loop if
      : both logical threads were banging on different cgroups.
      :
      : It's just a theory, but it's why I noted the number of cgroups when I
      : initially saw this show up in profiles
      
      Fix the regression by partially reverting the said commit and calculate
      the lru size explicitly.
      
      Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com
      Fixes: 1a61ab80 ("mm: memcontrol: replace zone summing with lruvec_page_state()")
      Signed-off-by: default avatarHonglei Wang <honglei.wang@oracle.com>
      Reported-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Tested-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: <stable@vger.kernel.org>	[5.2+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b11edebb
    • John Hubbard's avatar
      mm/gup: fix a misnamed "write" argument, and a related bug · 0cd22afd
      John Hubbard authored
      In several routines, the "flags" argument is incorrectly named "write".
      Change it to "flags".
      
      Also, in one place, the misnaming led to an actual bug:
      "flags & FOLL_WRITE" is required, rather than just "flags".
      (That problem was flagged by krobot, in v1 of this patch.)
      
      Also, change the flags argument from int, to unsigned int.
      
      You can see that this was a simple oversight, because the
      calling code passes "flags" to the fifth argument:
      
      gup_pgd_range():
          ...
          if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
      		    PGDIR_SHIFT, next, flags, pages, nr))
      
      ...which, until this patch, the callees referred to as "write".
      
      Also, change two lines to avoid checkpatch line length
      complaints, and another line to fix another oversight
      that checkpatch called out: missing "int" on pdshift.
      
      Link: http://lkml.kernel.org/r/20191014184639.1512873-3-jhubbard@nvidia.com
      Fixes: b798bec4 ("mm/gup: change write parameter to flags in fast walk")
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Suggested-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Suggested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0cd22afd
    • John Hubbard's avatar
      mm/gup_benchmark: add a missing "w" to getopt string · 6f24c8d3
      John Hubbard authored
      Even though gup_benchmark.c has code to handle the -w command-line option,
      the "w" is not part of the getopt string.  It looks as if it has been
      missing the whole time.
      
      On my machine, this leads naturally to the following predictable result:
      
        $ sudo ./gup_benchmark -w
        ./gup_benchmark: invalid option -- 'w'
      
      ...which is fixed with this commit.
      
      Link: http://lkml.kernel.org/r/20191014184639.1512873-2-jhubbard@nvidia.comSigned-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: kbuild test robot <lkp@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f24c8d3
    • Chengguang Xu's avatar
      ocfs2: fix error handling in ocfs2_setattr() · ce750f43
      Chengguang Xu authored
      Should set transfer_to[USRQUOTA/GRPQUOTA] to NULL on error case before
      jumping to do dqput().
      
      Link: http://lkml.kernel.org/r/20191010082349.1134-1-cgxu519@mykernel.netSigned-off-by: default avatarChengguang Xu <cgxu519@mykernel.net>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce750f43
    • Roman Gushchin's avatar
      mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release · b749ecfa
      Roman Gushchin authored
      Karsten reported the following panic in __free_slab() happening on a s390x
      machine:
      
        Unable to handle kernel pointer dereference in virtual kernel address space
        Failing address: 0000000000000000 TEID: 0000000000000483
        Fault in home space mode while using kernel ASCE.
        AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
        Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
        Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
        Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
        Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
                   R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
        Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
                   0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
                   0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
                   0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
                   000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
        Krnl Code: 00000000003cada6: e310a1400004        lg      %r1,320(%r10)
                   00000000003cadac: c0e50046c286        brasl   %r14,ca32b8
                  #00000000003cadb2: a7f4fe36            brc     15,3caa1e
                  >00000000003cadb6: e32060800024        stg     %r2,128(%r6)
                   00000000003cadbc: a7f4fd9e            brc     15,3ca8f8
                   00000000003cadc0: c0e50046790c        brasl   %r14,c99fd8
                   00000000003cadc6: a7f4fe2c            brc     15,3caa
                   00000000003cadc6: a7f4fe2c            brc     15,3caa1e
                   00000000003cadca: ecb1ffff00d9        aghik   %r11,%r1,-1
        Call Trace:
        (<00000000003cabcc> __free_slab+0x49c/0x6b0)
         <00000000001f5886> rcu_core+0x5a6/0x7e0
         <0000000000ca2dea> __do_softirq+0xf2/0x5c0
         <0000000000152644> irq_exit+0x104/0x130
         <000000000010d222> do_IRQ+0x9a/0xf0
         <0000000000ca2344> ext_int_handler+0x130/0x134
         <0000000000103648> enabled_wait+0x58/0x128
        (<0000000000103634> enabled_wait+0x44/0x128)
         <0000000000103b00> arch_cpu_idle+0x40/0x58
         <0000000000ca0544> default_idle_call+0x3c/0x68
         <000000000018eaa4> do_idle+0xec/0x1c0
         <000000000018ee0e> cpu_startup_entry+0x36/0x40
         <000000000122df34> arch_call_rest_init+0x5c/0x88
         <0000000000000000> 0x0
        INFO: lockdep is turned off.
        Last Breaking-Event-Address:
         <00000000003ca8f4> __free_slab+0x1c4/0x6b0
        Kernel panic - not syncing: Fatal exception in interrupt
      
      The kernel panics on an attempt to dereference the NULL memcg pointer.
      When shutdown_cache() is called from the kmem_cache_destroy() context, a
      memcg kmem_cache might have empty slab pages in a partial list, which are
      still charged to the memory cgroup.
      
      These pages are released by free_partial() at the beginning of
      shutdown_cache(): either directly or by scheduling a RCU-delayed work
      (if the kmem_cache has the SLAB_TYPESAFE_BY_RCU flag).  The latter case
      is when the reported panic can happen: memcg_unlink_cache() is called
      immediately after shrinking partial lists, without waiting for scheduled
      RCU works.  It sets the kmem_cache->memcg_params.memcg pointer to NULL,
      and the following attempt to dereference it by __free_slab() from the
      RCU work context causes the panic.
      
      To fix the issue, let's postpone the release of the memcg pointer to
      destroy_memcg_params().  It's called from a separate work context by
      slab_caches_to_rcu_destroy_workfn(), which contains a full RCU barrier.
      This guarantees that all scheduled page release RCU works will complete
      before the memcg pointer will be zeroed.
      
      Big thanks for Karsten for the perfect report containing all necessary
      information, his help with the analysis of the problem and testing of the
      fix.
      
      Link: http://lkml.kernel.org/r/20191010160549.1584316-1-guro@fb.com
      Fixes: fb2f2b0a ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reported-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Tested-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Karsten Graul <kgraul@linux.ibm.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b749ecfa
    • Aneesh Kumar K.V's avatar
      mm/memunmap: don't access uninitialized memmap in memunmap_pages() · 77e080e7
      Aneesh Kumar K.V authored
      Patch series "mm/memory_hotplug: Shrink zones before removing memory",
      v6.
      
      This series fixes the access of uninitialized memmaps when shrinking
      zones/nodes and when removing memory.  Also, it contains all fixes for
      crashes that can be triggered when removing certain namespace using
      memunmap_pages() - ZONE_DEVICE, reported by Aneesh.
      
      We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
      more involved (we don't have SECTION_IS_ONLINE as an indicator), and
      shrinking is only of limited use (set_zone_contiguous() cannot detect
      the ZONE_DEVICE as contiguous).
      
      We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount
      of code to a minimum.  Shrinking is especially necessary to keep
      zone->contiguous set where possible, especially, on memory unplug of
      DIMMs at zone boundaries.
      
      --------------------------------------------------------------------------
      
      Zones are now properly shrunk when offlining memory blocks or when
      onlining failed.  This allows to properly shrink zones on memory unplug
      even if the separate memory blocks of a DIMM were onlined to different
      zones or re-onlined to a different zone after offlining.
      
      Example:
      
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  0
                present  0
                managed  0
        :/# echo "online_movable" > /sys/devices/system/memory/memory41/state
        :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  98304
                present  65536
                managed  65536
        :/# echo 0 > /sys/devices/system/memory/memory43/online
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  32768
                present  32768
                managed  32768
        :/# echo 0 > /sys/devices/system/memory/memory41/online
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  0
                present  0
                managed  0
      
      This patch (of 10):
      
      With an altmap, the memmap falling into the reserved altmap space are not
      initialized and, therefore, contain a garbage NID and a garbage zone.
      Make sure to read the NID/zone from a memmap that was initialized.
      
      This fixes a kernel crash that is observed when destroying a namespace:
      
        kernel BUG at include/linux/mm.h:1107!
        cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
            pc: c0000000004b9728: memunmap_pages+0x238/0x340
            lr: c0000000004b9724: memunmap_pages+0x234/0x340
        ...
            pid   = 3669, comm = ndctl
        kernel BUG at include/linux/mm.h:1107!
          devm_action_release+0x30/0x50
          release_nodes+0x268/0x2d0
          device_release_driver_internal+0x174/0x240
          unbind_store+0x13c/0x190
          drv_attr_store+0x44/0x60
          sysfs_kf_write+0x70/0xa0
          kernfs_fop_write+0x1ac/0x290
          __vfs_write+0x3c/0x70
          vfs_write+0xe4/0x200
          ksys_write+0x7c/0x140
          system_call+0x5c/0x68
      
      The "page_zone(pfn_to_page(pfn)" was introduced by 69324b8f ("mm,
      devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support"), however, I
      think we will never have driver reserved memory with
      MEMORY_DEVICE_PRIVATE (no altmap AFAIKS).
      
      [david@redhat.com: minimze code changes, rephrase description]
      Link: http://lkml.kernel.org/r/20191006085646.5768-2-david@redhat.com
      Fixes: 2c2a5af6 ("mm, memory_hotplug: add nid parameter to arch_remove_memory")
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Damian Tometzki <damian.tometzki@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Halil Pasic <pasic@linux.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77e080e7
    • David Hildenbrand's avatar
      mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span() · 00d6c019
      David Hildenbrand authored
      We might use the nid of memmaps that were never initialized.  For
      example, if the memmap was poisoned, we will crash the kernel in
      pfn_to_nid() right now.  Let's use the calculated boundaries of the
      separate zones instead.  This now also avoids having to iterate over a
      whole bunch of subsections again, after shrinking one zone.
      
      Before commit d0dc12e8 ("mm/memory_hotplug: optimize memory
      hotplug"), the memmap was initialized to 0 and the node was set to the
      right value.  After that commit, the node might be garbage.
      
      We'll have to fix shrink_zone_span() next.
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-4-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[d0dc12e8]
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Damian Tometzki <damian.tometzki@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Halil Pasic <pasic@linux.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00d6c019