1. 05 Dec, 2018 18 commits
    • Bernd Eckstein's avatar
      usbnet: ipheth: fix potential recvmsg bug and recvmsg bug 2 · 535b494a
      Bernd Eckstein authored
      [ Upstream commit 45611c61 ]
      
      The bug is not easily reproducable, as it may occur very infrequently
      (we had machines with 20minutes heavy downloading before it occurred)
      However, on a virual machine (VMWare on Windows 10 host) it occurred
      pretty frequently (1-2 seconds after a speedtest was started)
      
      dev->tx_skb mab be freed via dev_kfree_skb_irq on a callback
      before it is set.
      
      This causes the following problems:
      - double free of the skb or potential memory leak
      - in dmesg: 'recvmsg bug' and 'recvmsg bug 2' and eventually
        general protection fault
      
      Example dmesg output:
      [  134.841986] ------------[ cut here ]------------
      [  134.841987] recvmsg bug: copied 9C24A555 seq 9C24B557 rcvnxt 9C25A6B3 fl 0
      [  134.841993] WARNING: CPU: 7 PID: 2629 at /build/linux-hwe-On9fm7/linux-hwe-4.15.0/net/ipv4/tcp.c:1865 tcp_recvmsg+0x44d/0xab0
      [  134.841994] Modules linked in: ipheth(OE) kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd vmw_balloon intel_rapl_perf joydev input_leds serio_raw vmw_vsock_vmci_transport vsock shpchp i2c_piix4 mac_hid binfmt_misc vmw_vmci parport_pc ppdev lp parport autofs4 vmw_pvscsi vmxnet3 hid_generic usbhid hid vmwgfx ttm drm_kms_helper syscopyarea sysfillrect mptspi mptscsih sysimgblt ahci psmouse fb_sys_fops pata_acpi mptbase libahci e1000 drm scsi_transport_spi
      [  134.842046] CPU: 7 PID: 2629 Comm: python Tainted: G        W  OE    4.15.0-34-generic #37~16.04.1-Ubuntu
      [  134.842046] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
      [  134.842048] RIP: 0010:tcp_recvmsg+0x44d/0xab0
      [  134.842048] RSP: 0018:ffffa6630422bcc8 EFLAGS: 00010286
      [  134.842049] RAX: 0000000000000000 RBX: ffff997616f4f200 RCX: 0000000000000006
      [  134.842049] RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff9976257d6490
      [  134.842050] RBP: ffffa6630422bd98 R08: 0000000000000001 R09: 000000000004bba4
      [  134.842050] R10: 0000000001e00c6f R11: 000000000004bba4 R12: ffff99760dee3000
      [  134.842051] R13: 0000000000000000 R14: ffff99760dee3514 R15: 0000000000000000
      [  134.842051] FS:  00007fe332347700(0000) GS:ffff9976257c0000(0000) knlGS:0000000000000000
      [  134.842052] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  134.842053] CR2: 0000000001e41000 CR3: 000000020e9b4006 CR4: 00000000003606e0
      [  134.842055] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  134.842055] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  134.842057] Call Trace:
      [  134.842060]  ? aa_sk_perm+0x53/0x1a0
      [  134.842064]  inet_recvmsg+0x51/0xc0
      [  134.842066]  sock_recvmsg+0x43/0x50
      [  134.842070]  SYSC_recvfrom+0xe4/0x160
      [  134.842072]  ? __schedule+0x3de/0x8b0
      [  134.842075]  ? ktime_get_ts64+0x4c/0xf0
      [  134.842079]  SyS_recvfrom+0xe/0x10
      [  134.842082]  do_syscall_64+0x73/0x130
      [  134.842086]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      [  134.842086] RIP: 0033:0x7fe331f5a81d
      [  134.842088] RSP: 002b:00007ffe8da98398 EFLAGS: 00000246 ORIG_RAX: 000000000000002d
      [  134.842090] RAX: ffffffffffffffda RBX: ffffffffffffffff RCX: 00007fe331f5a81d
      [  134.842094] RDX: 00000000000003fb RSI: 0000000001e00874 RDI: 0000000000000003
      [  134.842095] RBP: 00007fe32f642c70 R08: 0000000000000000 R09: 0000000000000000
      [  134.842097] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe332347698
      [  134.842099] R13: 0000000001b7e0a0 R14: 0000000001e00874 R15: 0000000000000000
      [  134.842103] Code: 24 fd ff ff e9 cc fe ff ff 48 89 d8 41 8b 8c 24 10 05 00 00 44 8b 45 80 48 c7 c7 08 bd 59 8b 48 89 85 68 ff ff ff e8 b3 c4 7d ff <0f> 0b 48 8b 85 68 ff ff ff e9 e9 fe ff ff 41 8b 8c 24 10 05 00
      [  134.842126] ---[ end trace b7138fc08c83147f ]---
      [  134.842144] general protection fault: 0000 [#1] SMP PTI
      [  134.842145] Modules linked in: ipheth(OE) kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd vmw_balloon intel_rapl_perf joydev input_leds serio_raw vmw_vsock_vmci_transport vsock shpchp i2c_piix4 mac_hid binfmt_misc vmw_vmci parport_pc ppdev lp parport autofs4 vmw_pvscsi vmxnet3 hid_generic usbhid hid vmwgfx ttm drm_kms_helper syscopyarea sysfillrect mptspi mptscsih sysimgblt ahci psmouse fb_sys_fops pata_acpi mptbase libahci e1000 drm scsi_transport_spi
      [  134.842161] CPU: 7 PID: 2629 Comm: python Tainted: G        W  OE    4.15.0-34-generic #37~16.04.1-Ubuntu
      [  134.842162] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
      [  134.842164] RIP: 0010:tcp_close+0x2c6/0x440
      [  134.842165] RSP: 0018:ffffa6630422bde8 EFLAGS: 00010202
      [  134.842167] RAX: 0000000000000000 RBX: ffff99760dee3000 RCX: 0000000180400034
      [  134.842168] RDX: 5c4afd407207a6c4 RSI: ffffe868495bd300 RDI: ffff997616f4f200
      [  134.842169] RBP: ffffa6630422be08 R08: 0000000016f4d401 R09: 0000000180400034
      [  134.842169] R10: ffffa6630422bd98 R11: 0000000000000000 R12: 000000000000600c
      [  134.842170] R13: 0000000000000000 R14: ffff99760dee30c8 R15: ffff9975bd44fe00
      [  134.842171] FS:  00007fe332347700(0000) GS:ffff9976257c0000(0000) knlGS:0000000000000000
      [  134.842173] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  134.842174] CR2: 0000000001e41000 CR3: 000000020e9b4006 CR4: 00000000003606e0
      [  134.842177] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  134.842178] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  134.842179] Call Trace:
      [  134.842181]  inet_release+0x42/0x70
      [  134.842183]  __sock_release+0x42/0xb0
      [  134.842184]  sock_close+0x15/0x20
      [  134.842187]  __fput+0xea/0x220
      [  134.842189]  ____fput+0xe/0x10
      [  134.842191]  task_work_run+0x8a/0xb0
      [  134.842193]  exit_to_usermode_loop+0xc4/0xd0
      [  134.842195]  do_syscall_64+0xf4/0x130
      [  134.842197]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      [  134.842197] RIP: 0033:0x7fe331f5a560
      [  134.842198] RSP: 002b:00007ffe8da982e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
      [  134.842200] RAX: 0000000000000000 RBX: 00007fe32f642c70 RCX: 00007fe331f5a560
      [  134.842201] RDX: 00000000008f5320 RSI: 0000000001cd4b50 RDI: 0000000000000003
      [  134.842202] RBP: 00007fe32f6500f8 R08: 000000000000003c R09: 00000000009343c0
      [  134.842203] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe32f6500d0
      [  134.842204] R13: 00000000008f5320 R14: 00000000008f5320 R15: 0000000001cd4770
      [  134.842205] Code: c8 00 00 00 45 31 e4 49 39 fe 75 4d eb 50 83 ab d8 00 00 00 01 48 8b 17 48 8b 47 08 48 c7 07 00 00 00 00 48 c7 47 08 00 00 00 00 <48> 89 42 08 48 89 10 0f b6 57 34 8b 47 2c 2b 47 28 83 e2 01 80
      [  134.842226] RIP: tcp_close+0x2c6/0x440 RSP: ffffa6630422bde8
      [  134.842227] ---[ end trace b7138fc08c831480 ]---
      
      The proposed patch eliminates a potential racing condition.
      Before, usb_submit_urb was called and _after_ that, the skb was attached
      (dev->tx_skb). So, on a callback it was possible, however unlikely that the
      skb was freed before it was set. That way (because dev->tx_skb was not set
      to NULL after it was freed), it could happen that a skb from a earlier
      transmission was freed a second time (and the skb we should have freed did
      not get freed at all)
      
      Now we free the skb directly in ipheth_tx(). It is not passed to the
      callback anymore, eliminating the posibility of a double free of the same
      skb. Depending on the retval of usb_submit_urb() we use dev_kfree_skb_any()
      respectively dev_consume_skb_any() to free the skb.
      Signed-off-by: default avatarOliver Zweigle <Oliver.Zweigle@faro.com>
      Signed-off-by: default avatarBernd Eckstein <3ernd.Eckstein@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      535b494a
    • Julian Wiedmann's avatar
      s390/qeth: fix length check in SNMP processing · 711e3d37
      Julian Wiedmann authored
      [ Upstream commit 9a764c1e ]
      
      The response for a SNMP request can consist of multiple parts, which
      the cmd callback stages into a kernel buffer until all parts have been
      received. If the callback detects that the staging buffer provides
      insufficient space, it bails out with error.
      This processing is buggy for the first part of the response - while it
      initially checks for a length of 'data_len', it later copies an
      additional amount of 'offsetof(struct qeth_snmp_cmd, data)' bytes.
      
      Fix the calculation of 'data_len' for the first part of the response.
      This also nicely cleans up the memcpy code.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Reviewed-by: default avatarUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      711e3d37
    • Pan Bian's avatar
      rapidio/rionet: do not free skb before reading its length · 720e0d05
      Pan Bian authored
      [ Upstream commit cfc43519 ]
      
      skb is freed via dev_kfree_skb_any, however, skb->len is read then. This
      may result in a use-after-free bug.
      
      Fixes: e6161d64 ("rapidio/rionet: rework driver initialization and removal")
      Signed-off-by: default avatarPan Bian <bianpan2016@163.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      720e0d05
    • Willem de Bruijn's avatar
      packet: copy user buffers before orphan or clone · f2a67e68
      Willem de Bruijn authored
      [ Upstream commit 5cd8d46e ]
      
      tpacket_snd sends packets with user pages linked into skb frags. It
      notifies that pages can be reused when the skb is released by setting
      skb->destructor to tpacket_destruct_skb.
      
      This can cause data corruption if the skb is orphaned (e.g., on
      transmit through veth) or cloned (e.g., on mirror to another psock).
      
      Create a kernel-private copy of data in these cases, same as tun/tap
      zerocopy transmission. Reuse that infrastructure: mark the skb as
      SKBTX_ZEROCOPY_FRAG, which will trigger copy in skb_orphan_frags(_rx).
      
      Unlike other zerocopy packets, do not set shinfo destructor_arg to
      struct ubuf_info. tpacket_destruct_skb already uses that ptr to notify
      when the original skb is released and a timestamp is recorded. Do not
      change this timestamp behavior. The ubuf_info->callback is not needed
      anyway, as no zerocopy notification is expected.
      
      Mark destructor_arg as not-a-uarg by setting the lower bit to 1. The
      resulting value is not a valid ubuf_info pointer, nor a valid
      tpacket_snd frame address. Add skb_zcopy_.._nouarg helpers for this.
      
      The fix relies on features introduced in commit 52267790 ("sock:
      add MSG_ZEROCOPY"), so can be backported as is only to 4.14.
      
      Tested with from `./in_netns.sh ./txring_overwrite` from
      http://github.com/wdebruij/kerneltools/tests
      
      Fixes: 69e3c75f ("net: TX_RING and packet mmap")
      Reported-by: default avatarAnand H. Krishnan <anandhkrishnan@gmail.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f2a67e68
    • Lorenzo Bianconi's avatar
      net: thunderx: set tso_hdrs pointer to NULL in nicvf_free_snd_queue · abc963e4
      Lorenzo Bianconi authored
      [ Upstream commit ef2a7cf1 ]
      
      Reset snd_queue tso_hdrs pointer to NULL in nicvf_free_snd_queue routine
      since it is used to check if tso dma descriptor queue has been previously
      allocated. The issue can be triggered with the following reproducer:
      
      $ip link set dev enP2p1s0v0 xdpdrv obj xdp_dummy.o
      $ip link set dev enP2p1s0v0 xdpdrv off
      
      [  341.467649] WARNING: CPU: 74 PID: 2158 at mm/vmalloc.c:1511 __vunmap+0x98/0xe0
      [  341.515010] Hardware name: GIGABYTE H270-T70/MT70-HD0, BIOS T49 02/02/2018
      [  341.521874] pstate: 60400005 (nZCv daif +PAN -UAO)
      [  341.526654] pc : __vunmap+0x98/0xe0
      [  341.530132] lr : __vunmap+0x98/0xe0
      [  341.533609] sp : ffff00001c5db860
      [  341.536913] x29: ffff00001c5db860 x28: 0000000000020000
      [  341.542214] x27: ffff810feb5090b0 x26: ffff000017e57000
      [  341.547515] x25: 0000000000000000 x24: 00000000fbd00000
      [  341.552816] x23: 0000000000000000 x22: ffff810feb5090b0
      [  341.558117] x21: 0000000000000000 x20: 0000000000000000
      [  341.563418] x19: ffff000017e57000 x18: 0000000000000000
      [  341.568719] x17: 0000000000000000 x16: 0000000000000000
      [  341.574020] x15: 0000000000000010 x14: ffffffffffffffff
      [  341.579321] x13: ffff00008985eb27 x12: ffff00000985eb2f
      [  341.584622] x11: ffff0000096b3000 x10: ffff00001c5db510
      [  341.589923] x9 : 00000000ffffffd0 x8 : ffff0000086868e8
      [  341.595224] x7 : 3430303030303030 x6 : 00000000000006ef
      [  341.600525] x5 : 00000000003fffff x4 : 0000000000000000
      [  341.605825] x3 : 0000000000000000 x2 : ffffffffffffffff
      [  341.611126] x1 : ffff0000096b3728 x0 : 0000000000000038
      [  341.616428] Call trace:
      [  341.618866]  __vunmap+0x98/0xe0
      [  341.621997]  vunmap+0x3c/0x50
      [  341.624961]  arch_dma_free+0x68/0xa0
      [  341.628534]  dma_direct_free+0x50/0x80
      [  341.632285]  nicvf_free_resources+0x160/0x2d8 [nicvf]
      [  341.637327]  nicvf_config_data_transfer+0x174/0x5e8 [nicvf]
      [  341.642890]  nicvf_stop+0x298/0x340 [nicvf]
      [  341.647066]  __dev_close_many+0x9c/0x108
      [  341.650977]  dev_close_many+0xa4/0x158
      [  341.654720]  rollback_registered_many+0x140/0x530
      [  341.659414]  rollback_registered+0x54/0x80
      [  341.663499]  unregister_netdevice_queue+0x9c/0xe8
      [  341.668192]  unregister_netdev+0x28/0x38
      [  341.672106]  nicvf_remove+0xa4/0xa8 [nicvf]
      [  341.676280]  nicvf_shutdown+0x20/0x30 [nicvf]
      [  341.680630]  pci_device_shutdown+0x44/0x88
      [  341.684720]  device_shutdown+0x144/0x250
      [  341.688640]  kernel_restart_prepare+0x44/0x50
      [  341.692986]  kernel_restart+0x20/0x68
      [  341.696638]  __se_sys_reboot+0x210/0x238
      [  341.700550]  __arm64_sys_reboot+0x24/0x30
      [  341.704555]  el0_svc_handler+0x94/0x110
      [  341.708382]  el0_svc+0x8/0xc
      [  341.711252] ---[ end trace 3f4019c8439959c9 ]---
      [  341.715874] page:ffff7e0003ef4000 count:0 mapcount:0 mapping:0000000000000000 index:0x4
      [  341.723872] flags: 0x1fffe000000000()
      [  341.727527] raw: 001fffe000000000 ffff7e0003f1a008 ffff7e0003ef4048 0000000000000000
      [  341.735263] raw: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
      [  341.742994] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
      
      where xdp_dummy.c is a simple bpf program that forwards the incoming
      frames to the network stack (available here:
      https://github.com/altoor/xdp_walkthrough_examples/blob/master/sample_1/xdp_dummy.c)
      
      Fixes: 05c773f5 ("net: thunderx: Add basic XDP support")
      Fixes: 4863dea3 ("net: Adding support for Cavium ThunderX network controller")
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      abc963e4
    • Andreas Fiedler's avatar
      net: gemini: Fix copy/paste error · cfbee9e9
      Andreas Fiedler authored
      [ Upstream commit 07093b76 ]
      
      The TX stats should be started with the tx_stats_syncp,
      there seems to be a copy/paste error in the driver.
      Signed-off-by: default avatarAndreas Fiedler <andreas.fiedler@gmx.net>
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cfbee9e9
    • Paolo Abeni's avatar
      net: don't keep lonely packets forever in the gro hash · b24a813e
      Paolo Abeni authored
      [ Upstream commit 605108ac ]
      
      Eric noted that with UDP GRO and NAPI timeout, we could keep a single
      UDP packet inside the GRO hash forever, if the related NAPI instance
      calls napi_gro_complete() at an higher frequency than the NAPI timeout.
      Willem noted that even TCP packets could be trapped there, till the
      next retransmission.
      This patch tries to address the issue, flushing the old packets -
      those with a NAPI_GRO_CB age before the current jiffy - before scheduling
      the NAPI timeout. The rationale is that such a timeout should be
      well below a jiffy and we are not flushing packets eligible for sane GRO.
      
      v1  -> v2:
       - clarified the commit message and comment
      
      RFC -> v1:
       - added 'Fixes tags', cleaned-up the wording.
      Reported-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Fixes: 3b47d303 ("net: gro: add a per device gro flush timer")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b24a813e
    • Bryan Whitehead's avatar
      lan743x: fix return value for lan743x_tx_napi_poll · 18dd9bf5
      Bryan Whitehead authored
      [ Upstream commit cc592205 ]
      
      The lan743x driver, when under heavy traffic load, has been noticed
      to sometimes hang, or cause a kernel panic.
      
      Debugging reveals that the TX napi poll routine was returning
      the wrong value, 'weight'. Most other drivers return 0.
      And call napi_complete, instead of napi_complete_done.
      
      Additionally when creating the tx napi poll routine.
      Changed netif_napi_add, to netif_tx_napi_add.
      
      Updates for v3:
          changed 'fixes' tag to match defined format
      
      Updates for v2:
      use napi_complete, instead of napi_complete_done in
          lan743x_tx_napi_poll
      use netif_tx_napi_add, instead of netif_napi_add for
          registration of tx napi poll routine
      
      fixes: 23f0703c ("lan743x: Add main source files for new lan743x driver")
      Signed-off-by: default avatarBryan Whitehead <Bryan.Whitehead@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      18dd9bf5
    • Bryan Whitehead's avatar
      lan743x: Enable driver to work with LAN7431 · 767d8903
      Bryan Whitehead authored
      [ Upstream commit 4df5ce9b ]
      
      This driver was designed to work with both LAN7430 and LAN7431.
      The only difference between the two is the LAN7431 has support
      for external phy.
      
      This change adds LAN7431 to the list of recognized devices
      supported by this driver.
      
      Updates for v2:
          changed 'fixes' tag to match defined format
      
      fixes: 23f0703c ("lan743x: Add main source files for new lan743x driver")
      Signed-off-by: default avatarBryan Whitehead <Bryan.Whitehead@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      767d8903
    • Hugh Dickins's avatar
      mm/khugepaged: collapse_shmem() do not crash on Compound · 8b37c405
      Hugh Dickins authored
      commit 06a5e126 upstream.
      
      collapse_shmem()'s VM_BUG_ON_PAGE(PageTransCompound) was unsafe: before
      it holds page lock of the first page, racing truncation then extension
      might conceivably have inserted a hugepage there already.  Fail with the
      SCAN_PAGE_COMPOUND result, instead of crashing (CONFIG_DEBUG_VM=y) or
      otherwise mishandling the unexpected hugepage - though later we might
      code up a more constructive way of handling it, with SCAN_SUCCESS.
      
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261529310.2275@eggly.anvils
      Fixes: f3f0e1d2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8b37c405
    • Hugh Dickins's avatar
      mm/khugepaged: collapse_shmem() without freezing new_page · af24c018
      Hugh Dickins authored
      commit 87c460a0 upstream.
      
      khugepaged's collapse_shmem() does almost all of its work, to assemble
      the huge new_page from 512 scattered old pages, with the new_page's
      refcount frozen to 0 (and refcounts of all old pages so far also frozen
      to 0).  Including shmem_getpage() to read in any which were out on swap,
      memory reclaim if necessary to allocate their intermediate pages, and
      copying over all the data from old to new.
      
      Imagine the frozen refcount as a spinlock held, but without any lock
      debugging to highlight the abuse: it's not good, and under serious load
      heads into lockups - speculative getters of the page are not expecting
      to spin while khugepaged is rescheduled.
      
      One can get a little further under load by hacking around elsewhere; but
      fortunately, freezing the new_page turns out to have been entirely
      unnecessary, with no hacks needed elsewhere.
      
      The huge new_page lock is already held throughout, and guards all its
      subpages as they are brought one by one into the page cache tree; and
      anything reading the data in that page, without the lock, before it has
      been marked PageUptodate, would already be in the wrong.  So simply
      eliminate the freezing of the new_page.
      
      Each of the old pages remains frozen with refcount 0 after it has been
      replaced by a new_page subpage in the page cache tree, until they are
      all unfrozen on success or failure: just as before.  They could be
      unfrozen sooner, but cause no problem once no longer visible to
      find_get_entry(), filemap_map_pages() and other speculative lookups.
      
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils
      Fixes: f3f0e1d2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      af24c018
    • Hugh Dickins's avatar
      mm/khugepaged: minor reorderings in collapse_shmem() · 3e9646c7
      Hugh Dickins authored
      commit 042a3082 upstream.
      
      Several cleanups in collapse_shmem(): most of which probably do not
      really matter, beyond doing things in a more familiar and reassuring
      order.  Simplify the failure gotos in the main loop, and on success
      update stats while interrupts still disabled from the last iteration.
      
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261526400.2275@eggly.anvils
      Fixes: f3f0e1d2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      3e9646c7
    • Hugh Dickins's avatar
      mm/khugepaged: collapse_shmem() remember to clear holes · ee13d69b
      Hugh Dickins authored
      commit 2af8ff29 upstream.
      
      Huge tmpfs testing reminds us that there is no __GFP_ZERO in the gfp
      flags khugepaged uses to allocate a huge page - in all common cases it
      would just be a waste of effort - so collapse_shmem() must remember to
      clear out any holes that it instantiates.
      
      The obvious place to do so, where they are put into the page cache tree,
      is not a good choice: because interrupts are disabled there.  Leave it
      until further down, once success is assured, where the other pages are
      copied (before setting PageUptodate).
      
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261525080.2275@eggly.anvils
      Fixes: f3f0e1d2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ee13d69b
    • Hugh Dickins's avatar
      mm/khugepaged: fix crashes due to misaccounted holes · 78141aab
      Hugh Dickins authored
      commit aaa52e34 upstream.
      
      Huge tmpfs testing on a shortish file mapped into a pmd-rounded extent
      hit shmem_evict_inode()'s WARN_ON(inode->i_blocks) followed by
      clear_inode()'s BUG_ON(inode->i_data.nrpages) when the file was later
      closed and unlinked.
      
      khugepaged's collapse_shmem() was forgetting to update mapping->nrpages
      on the rollback path, after it had added but then needs to undo some
      holes.
      
      There is indeed an irritating asymmetry between shmem_charge(), whose
      callers want it to increment nrpages after successfully accounting
      blocks, and shmem_uncharge(), when __delete_from_page_cache() already
      decremented nrpages itself: oh well, just add a comment on that to them
      both.
      
      And shmem_recalc_inode() is supposed to be called when the accounting is
      expected to be in balance (so it can deduce from imbalance that reclaim
      discarded some pages): so change shmem_charge() to update nrpages
      earlier (though it's rare for the difference to matter at all).
      
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261523450.2275@eggly.anvils
      Fixes: 800d8c63 ("shmem: add huge pages support")
      Fixes: f3f0e1d2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      78141aab
    • Hugh Dickins's avatar
      mm/khugepaged: collapse_shmem() stop if punched or truncated · 8797f2f4
      Hugh Dickins authored
      commit 701270fa upstream.
      
      Huge tmpfs testing showed that although collapse_shmem() recognizes a
      concurrently truncated or hole-punched page correctly, its handling of
      holes was liable to refill an emptied extent.  Add check to stop that.
      
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261522040.2275@eggly.anvils
      Fixes: f3f0e1d2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMatthew Wilcox <willy@infradead.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8797f2f4
    • Hugh Dickins's avatar
      mm/huge_memory: fix lockdep complaint on 32-bit i_size_read() · d31ff472
      Hugh Dickins authored
      commit 006d3ff2 upstream.
      
      Huge tmpfs testing, on 32-bit kernel with lockdep enabled, showed that
      __split_huge_page() was using i_size_read() while holding the irq-safe
      lru_lock and page tree lock, but the 32-bit i_size_read() uses an
      irq-unsafe seqlock which should not be nested inside them.
      
      Instead, read the i_size earlier in split_huge_page_to_list(), and pass
      the end offset down to __split_huge_page(): all while holding head page
      lock, which is enough to prevent truncation of that extent before the
      page tree lock has been taken.
      
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261520070.2275@eggly.anvils
      Fixes: baa355fd ("thp: file pages support for split_huge_page()")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d31ff472
    • Hugh Dickins's avatar
      mm/huge_memory: splitting set mapping+index before unfreeze · 7e18656c
      Hugh Dickins authored
      commit 173d9d9f upstream.
      
      Huge tmpfs stress testing has occasionally hit shmem_undo_range()'s
      VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page).
      
      Move the setting of mapping and index up before the page_ref_unfreeze()
      in __split_huge_page_tail() to fix this: so that a page cache lookup
      cannot get a reference while the tail's mapping and index are unstable.
      
      In fact, might as well move them up before the smp_wmb(): I don't see an
      actual need for that, but if I'm missing something, this way round is
      safer than the other, and no less efficient.
      
      You might argue that VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page) is
      misplaced, and should be left until after the trylock_page(); but left as
      is has not crashed since, and gives more stringent assurance.
      
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261516380.2275@eggly.anvils
      Fixes: e9b61f19 ("thp: reintroduce split_huge_page()")
      Requires: 605ca5ed ("mm/huge_memory.c: reorder operations in __split_huge_page_tail()")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      7e18656c
    • Hugh Dickins's avatar
      mm/huge_memory: rename freeze_page() to unmap_page() · 69697e6a
      Hugh Dickins authored
      commit 906f9cdf upstream.
      
      The term "freeze" is used in several ways in the kernel, and in mm it
      has the particular meaning of forcing page refcount temporarily to 0.
      freeze_page() is just too confusing a name for a function that unmaps a
      page: rename it unmap_page(), and rename unfreeze_page() remap_page().
      
      Went to change the mention of freeze_page() added later in mm/rmap.c,
      but found it to be incorrect: ordinary page reclaim reaches there too;
      but the substance of the comment still seems correct, so edit it down.
      
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261514080.2275@eggly.anvils
      Fixes: e9b61f19 ("thp: reintroduce split_huge_page()")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      69697e6a
  2. 01 Dec, 2018 22 commits