1. 15 Oct, 2014 14 commits
    • Sabrina Dubroca's avatar
      ipv6: fix rtnl locking in setsockopt for anycast and multicast · 4c163b4a
      Sabrina Dubroca authored
      [ Upstream commit a9ed4a29 ]
      
      Calling setsockopt with IPV6_JOIN_ANYCAST or IPV6_LEAVE_ANYCAST
      triggers the assertion in addrconf_join_solict()/addrconf_leave_solict()
      
      ipv6_sock_ac_join(), ipv6_sock_ac_drop(), ipv6_sock_ac_close() need to
      take RTNL before calling ipv6_dev_ac_inc/dec. Same thing with
      ipv6_sock_mc_join(), ipv6_sock_mc_drop(), ipv6_sock_mc_close() before
      calling ipv6_dev_mc_inc/dec.
      
      This patch moves ASSERT_RTNL() up a level in the call stack.
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Reported-by: default avatarTommi Rantala <tt.rantala@gmail.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4c163b4a
    • Guillaume Nault's avatar
      l2tp: fix race while getting PMTU on PPP pseudo-wire · 5f3a420d
      Guillaume Nault authored
      [ Upstream commit eed4d839 ]
      
      Use dst_entry held by sk_dst_get() to retrieve tunnel's PMTU.
      
      The dst_mtu(__sk_dst_get(tunnel->sock)) call was racy. __sk_dst_get()
      could return NULL if tunnel->sock->sk_dst_cache was reset just before the
      call, thus making dst_mtu() dereference a NULL pointer:
      
      [ 1937.661598] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
      [ 1937.664005] IP: [<ffffffffa049db88>] pppol2tp_connect+0x33d/0x41e [l2tp_ppp]
      [ 1937.664005] PGD daf0c067 PUD d9f93067 PMD 0
      [ 1937.664005] Oops: 0000 [#1] SMP
      [ 1937.664005] Modules linked in: l2tp_ppp l2tp_netlink l2tp_core ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables udp_tunnel pppoe pppox ppp_generic slhc deflate ctr twofish_generic twofish_x86_64_3way xts lrw gf128mul glue_helper twofish_x86_64 twofish_common blowfish_generic blowfish_x86_64 blowfish_common des_generic cbc xcbc rmd160 sha512_generic hmac crypto_null af_key xfrm_algo 8021q garp bridge stp llc tun atmtcp clip atm ext3 mbcache jbd iTCO_wdt coretemp kvm_intel iTCO_vendor_support kvm pcspkr evdev ehci_pci lpc_ich mfd_core i5400_edac edac_core i5k_amb shpchp button processor thermal_sys xfs crc32c_generic libcrc32c dm_mod usbhid sg hid sr_mod sd_mod cdrom crc_t10dif crct10dif_common ata_generic ahci ata_piix tg3 libahci libata uhci_hcd ptp ehci_hcd pps_core usbcore scsi_mod libphy usb_common [last unloaded: l2tp_core]
      [ 1937.664005] CPU: 0 PID: 10022 Comm: l2tpstress Tainted: G           O   3.17.0-rc1 #1
      [ 1937.664005] Hardware name: HP ProLiant DL160 G5, BIOS O12 08/22/2008
      [ 1937.664005] task: ffff8800d8fda790 ti: ffff8800c43c4000 task.ti: ffff8800c43c4000
      [ 1937.664005] RIP: 0010:[<ffffffffa049db88>]  [<ffffffffa049db88>] pppol2tp_connect+0x33d/0x41e [l2tp_ppp]
      [ 1937.664005] RSP: 0018:ffff8800c43c7de8  EFLAGS: 00010282
      [ 1937.664005] RAX: ffff8800da8a7240 RBX: ffff8800d8c64600 RCX: 000001c325a137b5
      [ 1937.664005] RDX: 8c6318c6318c6320 RSI: 000000000000010c RDI: 0000000000000000
      [ 1937.664005] RBP: ffff8800c43c7ea8 R08: 0000000000000000 R09: 0000000000000000
      [ 1937.664005] R10: ffffffffa048e2c0 R11: ffff8800d8c64600 R12: ffff8800ca7a5000
      [ 1937.664005] R13: ffff8800c439bf40 R14: 000000000000000c R15: 0000000000000009
      [ 1937.664005] FS:  00007fd7f610f700(0000) GS:ffff88011a600000(0000) knlGS:0000000000000000
      [ 1937.664005] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 1937.664005] CR2: 0000000000000020 CR3: 00000000d9d75000 CR4: 00000000000027e0
      [ 1937.664005] Stack:
      [ 1937.664005]  ffffffffa049da80 ffff8800d8fda790 000000000000005b ffff880000000009
      [ 1937.664005]  ffff8800daf3f200 0000000000000003 ffff8800c43c7e48 ffffffff81109b57
      [ 1937.664005]  ffffffff81109b0e ffffffff8114c566 0000000000000000 0000000000000000
      [ 1937.664005] Call Trace:
      [ 1937.664005]  [<ffffffffa049da80>] ? pppol2tp_connect+0x235/0x41e [l2tp_ppp]
      [ 1937.664005]  [<ffffffff81109b57>] ? might_fault+0x9e/0xa5
      [ 1937.664005]  [<ffffffff81109b0e>] ? might_fault+0x55/0xa5
      [ 1937.664005]  [<ffffffff8114c566>] ? rcu_read_unlock+0x1c/0x26
      [ 1937.664005]  [<ffffffff81309196>] SYSC_connect+0x87/0xb1
      [ 1937.664005]  [<ffffffff813e56f7>] ? sysret_check+0x1b/0x56
      [ 1937.664005]  [<ffffffff8107590d>] ? trace_hardirqs_on_caller+0x145/0x1a1
      [ 1937.664005]  [<ffffffff81213dee>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [ 1937.664005]  [<ffffffff8114c262>] ? spin_lock+0x9/0xb
      [ 1937.664005]  [<ffffffff813092b4>] SyS_connect+0x9/0xb
      [ 1937.664005]  [<ffffffff813e56d2>] system_call_fastpath+0x16/0x1b
      [ 1937.664005] Code: 10 2a 84 81 e8 65 76 bd e0 65 ff 0c 25 10 bb 00 00 4d 85 ed 74 37 48 8b 85 60 ff ff ff 48 8b 80 88 01 00 00 48 8b b8 10 02 00 00 <48> 8b 47 20 ff 50 20 85 c0 74 0f 83 e8 28 89 83 10 01 00 00 89
      [ 1937.664005] RIP  [<ffffffffa049db88>] pppol2tp_connect+0x33d/0x41e [l2tp_ppp]
      [ 1937.664005]  RSP <ffff8800c43c7de8>
      [ 1937.664005] CR2: 0000000000000020
      [ 1939.559375] ---[ end trace 82d44500f28f8708 ]---
      
      Fixes: f34c4a35 ("l2tp: take PMTU from tunnel UDP socket")
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5f3a420d
    • Gerhard Stenzel's avatar
      vxlan: fix incorrect initializer in union vxlan_addr · ed6d9919
      Gerhard Stenzel authored
      [ Upstream commit a45e92a5 ]
      
      The first initializer in the following
      
              union vxlan_addr ipa = {
                  .sin.sin_addr.s_addr = tip,
                  .sa.sa_family = AF_INET,
              };
      
      is optimised away by the compiler, due to the second initializer,
      therefore initialising .sin.sin_addr.s_addr always to 0.
      This results in netlink messages indicating a L3 miss never contain the
      missed IP address. This was observed with GCC 4.8 and 4.9. I do not know about previous versions.
      The problem affects user space programs relying on an IP address being
      sent as part of a netlink message indicating a L3 miss.
      
      Changing
                  .sa.sa_family = AF_INET,
      to
                  .sin.sin_family = AF_INET,
      fixes the problem.
      Signed-off-by: default avatarGerhard Stenzel <gerhard.stenzel@de.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ed6d9919
    • Jiri Benc's avatar
      openvswitch: fix panic with multiple vlan headers · 5bfbaf50
      Jiri Benc authored
      [ Upstream commit 2ba5af42 ]
      
      When there are multiple vlan headers present in a received frame, the first
      one is put into vlan_tci and protocol is set to ETH_P_8021Q. Anything in the
      skb beyond the VLAN TPID may be still non-linear, including the inner TCI
      and ethertype. While ovs_flow_extract takes care of IP and IPv6 headers, it
      does nothing with ETH_P_8021Q. Later, if OVS_ACTION_ATTR_POP_VLAN is
      executed, __pop_vlan_tci pulls the next vlan header into vlan_tci.
      
      This leads to two things:
      
      1. Part of the resulting ethernet header is in the non-linear part of the
         skb. When eth_type_trans is called later as the result of
         OVS_ACTION_ATTR_OUTPUT, kernel BUGs in __skb_pull. Also, __pop_vlan_tci
         is in fact accessing random data when it reads past the TPID.
      
      2. network_header points into the ethernet header instead of behind it.
         mac_len is set to a wrong value (10), too.
      Reported-by: default avatarYulong Pei <ypei@redhat.com>
      Signed-off-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5bfbaf50
    • Eric Dumazet's avatar
      packet: handle too big packets for PACKET_V3 · 454116bc
      Eric Dumazet authored
      [ Upstream commit dc808110 ]
      
      af_packet can currently overwrite kernel memory by out of bound
      accesses, because it assumed a [new] block can always hold one frame.
      
      This is not generally the case, even if most existing tools do it right.
      
      This patch clamps too long frames as API permits, and issue a one time
      error on syslog.
      
      [  394.357639] tpacket_rcv: packet too big, clamped from 5042 to 3966. macoff=82
      
      In this example, packet header tp_snaplen was set to 3966,
      and tp_len was set to 5042 (skb->len)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: f6fb8f10 ("af-packet: TPACKET_V3 flexible buffer implementation.")
      Acked-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      454116bc
    • Neal Cardwell's avatar
      tcp: fix ssthresh and undo for consecutive short FRTO episodes · 867f806f
      Neal Cardwell authored
      [ Upstream commit 0c9ab092 ]
      
      Fix TCP FRTO logic so that it always notices when snd_una advances,
      indicating that any RTO after that point will be a new and distinct
      loss episode.
      
      Previously there was a very specific sequence that could cause FRTO to
      fail to notice a new loss episode had started:
      
      (1) RTO timer fires, enter FRTO and retransmit packet 1 in write queue
      (2) receiver ACKs packet 1
      (3) FRTO sends 2 more packets
      (4) RTO timer fires again (should start a new loss episode)
      
      The problem was in step (3) above, where tcp_process_loss() returned
      early (in the spot marked "Step 2.b"), so that it never got to the
      logic to clear icsk_retransmits. Thus icsk_retransmits stayed
      non-zero. Thus in step (4) tcp_enter_loss() would see the non-zero
      icsk_retransmits, decide that this RTO is not a new episode, and
      decide not to cut ssthresh and remember the current cwnd and ssthresh
      for undo.
      
      There were two main consequences to the bug that we have
      observed. First, ssthresh was not decreased in step (4). Second, when
      there was a series of such FRTO (1-4) sequences that happened to be
      followed by an FRTO undo, we would restore the cwnd and ssthresh from
      before the entire series started (instead of the cwnd and ssthresh
      from before the most recent RTO). This could result in cwnd and
      ssthresh being restored to values much bigger than the proper values.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Fixes: e33099f9 ("tcp: implement RFC5682 F-RTO")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      867f806f
    • Neal Cardwell's avatar
      tcp: fix tcp_release_cb() to dispatch via address family for mtu_reduced() · 97b477f8
      Neal Cardwell authored
      [ Upstream commit 4fab9071 ]
      
      Make sure we use the correct address-family-specific function for
      handling MTU reductions from within tcp_release_cb().
      
      Previously AF_INET6 sockets were incorrectly always using the IPv6
      code path when sometimes they were handling IPv4 traffic and thus had
      an IPv4 dst.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Diagnosed-by: default avatarWillem de Bruijn <willemb@google.com>
      Fixes: 563d34d0 ("tcp: dont drop MTU reduction indications")
      Reviewed-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      97b477f8
    • Shmulik Ladkani's avatar
      sit: Fix ipip6_tunnel_lookup device matching criteria · 8e27e5cd
      Shmulik Ladkani authored
      [ Upstream commit bc8fc7b8 ]
      
      As of 4fddbf5d ("sit: strictly restrict incoming traffic to tunnel link device"),
      when looking up a tunnel, tunnel's underlying interface (t->parms.link)
      is verified to match incoming traffic's ingress device.
      
      However the comparison was incorrectly based on skb->dev->iflink.
      
      Instead, dev->ifindex should be used, which correctly represents the
      interface from which the IP stack hands the ipip6 packets.
      
      This allows setting up sit tunnels bound to vlan interfaces (otherwise
      incoming ipip6 traffic on the vlan interface was dropped due to
      ipip6_tunnel_lookup match failure).
      Signed-off-by: default avatarShmulik Ladkani <shmulik.ladkani@gmail.com>
      Acked-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8e27e5cd
    • Andrey Vagin's avatar
      tcp: don't use timestamp from repaired skb-s to calculate RTT (v2) · d7bdb8e9
      Andrey Vagin authored
      [ Upstream commit 9d186cac ]
      
      We don't know right timestamp for repaired skb-s. Wrong RTT estimations
      isn't good, because some congestion modules heavily depends on it.
      
      This patch adds the TCPCB_REPAIRED flag, which is included in
      TCPCB_RETRANS.
      
      Thanks to Eric for the advice how to fix this issue.
      
      This patch fixes the warning:
      [  879.562947] WARNING: CPU: 0 PID: 2825 at net/ipv4/tcp_input.c:3078 tcp_ack+0x11f5/0x1380()
      [  879.567253] CPU: 0 PID: 2825 Comm: socket-tcpbuf-l Not tainted 3.16.0-next-20140811 #1
      [  879.567829] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [  879.568177]  0000000000000000 00000000c532680c ffff880039643d00 ffffffff817aa2d2
      [  879.568776]  0000000000000000 ffff880039643d38 ffffffff8109afbd ffff880039d6ba80
      [  879.569386]  ffff88003a449800 000000002983d6bd 0000000000000000 000000002983d6bc
      [  879.569982] Call Trace:
      [  879.570264]  [<ffffffff817aa2d2>] dump_stack+0x4d/0x66
      [  879.570599]  [<ffffffff8109afbd>] warn_slowpath_common+0x7d/0xa0
      [  879.570935]  [<ffffffff8109b0ea>] warn_slowpath_null+0x1a/0x20
      [  879.571292]  [<ffffffff816d0a05>] tcp_ack+0x11f5/0x1380
      [  879.571614]  [<ffffffff816d10bd>] tcp_rcv_established+0x1ed/0x710
      [  879.571958]  [<ffffffff816dc9da>] tcp_v4_do_rcv+0x10a/0x370
      [  879.572315]  [<ffffffff81657459>] release_sock+0x89/0x1d0
      [  879.572642]  [<ffffffff816c81a0>] do_tcp_setsockopt.isra.36+0x120/0x860
      [  879.573000]  [<ffffffff8110a52e>] ? rcu_read_lock_held+0x6e/0x80
      [  879.573352]  [<ffffffff816c8912>] tcp_setsockopt+0x32/0x40
      [  879.573678]  [<ffffffff81654ac4>] sock_common_setsockopt+0x14/0x20
      [  879.574031]  [<ffffffff816537b0>] SyS_setsockopt+0x80/0xf0
      [  879.574393]  [<ffffffff817b40a9>] system_call_fastpath+0x16/0x1b
      [  879.574730] ---[ end trace a17cbc38eb8c5c00 ]---
      
      v2: moving setting of skb->when for repaired skb-s in tcp_write_xmit,
          where it's set for other skb-s.
      
      Fixes: 431a9124 ("tcp: timestamp SYN+DATA messages")
      Fixes: 740b0f18 ("tcp: switch rtt estimations to usec resolution")
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarAndrey Vagin <avagin@openvz.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d7bdb8e9
    • Neerav Parikh's avatar
      i40e: Don't stop driver probe when querying DCB config fails · 46b6abb9
      Neerav Parikh authored
      Commit id: 014269ff in Linus's tree
      should be queued up for stable 3.14 & 3.15 since the i40e driver will
      not load when DCB is enabled, unless this patch is applied.
      
      In case of any AQ command to query port's DCB configuration fails
      during driver's probe time; the probe fails and returns an error.
      
      This patch prevents this issue by continuing the driver probe even
      when an error is returned.
      
      Also, added an error message to dump the AQ error status to show what
      error caused the failure to get the DCB configuration from firmware.
      
      Change-ID: Ifd5663512588bca684069bb7d4fb586dd72221af
      Signed-off-by: default avatarNeerav Parikh <neerav.parikh@intel.com>
      Signed-off-by: default avatarCatherine Sullivan <catherine.sullivan@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      46b6abb9
    • Stanislaw Gruszka's avatar
      myri10ge: check for DMA mapping errors · b3804066
      Stanislaw Gruszka authored
      [ Upstream commit 10545937 ]
      
      On IOMMU systems DMA mapping can fail, we need to check for
      that possibility.
      Signed-off-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b3804066
    • Vlad Yasevich's avatar
      net: Always untag vlan-tagged traffic on input. · ff81e63f
      Vlad Yasevich authored
      [ Upstream commit 0d5501c1 ]
      
      Currently the functionality to untag traffic on input resides
      as part of the vlan module and is build only when VLAN support
      is enabled in the kernel.  When VLAN is disabled, the function
      vlan_untag() turns into a stub and doesn't really untag the
      packets.  This seems to create an interesting interaction
      between VMs supporting checksum offloading and some network drivers.
      
      There are some drivers that do not allow the user to change
      tx-vlan-offload feature of the driver.  These drivers also seem
      to assume that any VLAN-tagged traffic they transmit will
      have the vlan information in the vlan_tci and not in the vlan
      header already in the skb.  When transmitting skbs that already
      have tagged data with partial checksum set, the checksum doesn't
      appear to be updated correctly by the card thus resulting in a
      failure to establish TCP connections.
      
      The following is a packet trace taken on the receiver where a
      sender is a VM with a VLAN configued.  The host VM is running on
      doest not have VLAN support and the outging interface on the
      host is tg3:
      10:12:43.503055 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
      (0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27243,
      offset 0, flags [DF], proto TCP (6), length 60)
          10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
      -> 0x48d9), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
      4294837885 ecr 0,nop,wscale 7], length 0
      10:12:44.505556 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
      (0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27244,
      offset 0, flags [DF], proto TCP (6), length 60)
          10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
      -> 0x44ee), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
      4294838888 ecr 0,nop,wscale 7], length 0
      
      This connection finally times out.
      
      I've only access to the TG3 hardware in this configuration thus have
      only tested this with TG3 driver.  There are a lot of other drivers
      that do not permit user changes to vlan acceleration features, and
      I don't know if they all suffere from a similar issue.
      
      The patch attempt to fix this another way.  It moves the vlan header
      stipping code out of the vlan module and always builds it into the
      kernel network core.  This way, even if vlan is not supported on
      a virtualizatoin host, the virtual machines running on top of such
      host will still work with VLANs enabled.
      
      CC: Patrick McHardy <kaber@trash.net>
      CC: Nithin Nayak Sujir <nsujir@broadcom.com>
      CC: Michael Chan <mchan@broadcom.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarVladislav Yasevich <vyasevic@redhat.com>
      Acked-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ff81e63f
    • Jiri Benc's avatar
      rtnetlink: fix VF info size · 7f97ac52
      Jiri Benc authored
      [ Upstream commit 945a3676 ]
      
      Commit 1d8faf48 ("net/core: Add VF link state control") added new
      attribute to IFLA_VF_INFO group in rtnl_fill_ifinfo but did not adjust size
      of the allocated memory in if_nlmsg_size/rtnl_vfinfo_size. As the result, we
      may trigger warnings in rtnl_getlink and similar functions when many VF
      links are enabled, as the information does not fit into the allocated skb.
      
      Fixes: 1d8faf48 ("net/core: Add VF link state control")
      Reported-by: default avatarYulong Pei <ypei@redhat.com>
      Signed-off-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7f97ac52
    • Daniel Borkmann's avatar
      netlink: reset network header before passing to taps · c5c2f898
      Daniel Borkmann authored
      [ Upstream commit 4e48ed88 ]
      
      netlink doesn't set any network header offset thus when the skb is
      being passed to tap devices via dev_queue_xmit_nit(), it emits klog
      false positives due to it being unset like:
      
        ...
        [  124.990397] protocol 0000 is buggy, dev nlmon0
        [  124.990411] protocol 0000 is buggy, dev nlmon0
        ...
      
      So just reset the network header before passing to the device; for
      packet sockets that just means nothing will change - mac and net
      offset hold the same value just as before.
      Reported-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c5c2f898
  2. 09 Oct, 2014 26 commits
    • Greg Kroah-Hartman's avatar
      Linux 3.14.21 · 89161fe9
      Greg Kroah-Hartman authored
      89161fe9
    • Linus Torvalds's avatar
      mm: don't pointlessly use BUG_ON() for sanity check · c56af023
      Linus Torvalds authored
      commit 50f5aa8a upstream.
      
      BUG_ON() is a big hammer, and should be used _only_ if there is some
      major corruption that you cannot possibly recover from, making it
      imperative that the current process (and possibly the whole machine) be
      terminated with extreme prejudice.
      
      The trivial sanity check in the vmacache code is *not* such a fatal
      error.  Recovering from it is absolutely trivial, and using BUG_ON()
      just makes it harder to debug for no actual advantage.
      
      To make matters worse, the placement of the BUG_ON() (only if the range
      check matched) actually makes it harder to hit the sanity check to begin
      with, so _if_ there is a bug (and we just got a report from Srivatsa
      Bhat that this can indeed trigger), it is harder to debug not just
      because the machine is possibly dead, but because we don't have better
      coverage.
      
      BUG_ON() must *die*.  Maybe we should add a checkpatch warning for it,
      because it is simply just about the worst thing you can ever do if you
      hit some "this cannot happen" situation.
      Reported-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c56af023
    • Davidlohr Bueso's avatar
      mm: per-thread vma caching · efb5fea2
      Davidlohr Bueso authored
      commit 615d6e87 upstream.
      
      This patch is a continuation of efforts trying to optimize find_vma(),
      avoiding potentially expensive rbtree walks to locate a vma upon faults.
      The original approach (https://lkml.org/lkml/2013/11/1/410), where the
      largest vma was also cached, ended up being too specific and random,
      thus further comparison with other approaches were needed.  There are
      two things to consider when dealing with this, the cache hit rate and
      the latency of find_vma().  Improving the hit-rate does not necessarily
      translate in finding the vma any faster, as the overhead of any fancy
      caching schemes can be too high to consider.
      
      We currently cache the last used vma for the whole address space, which
      provides a nice optimization, reducing the total cycles in find_vma() by
      up to 250%, for workloads with good locality.  On the other hand, this
      simple scheme is pretty much useless for workloads with poor locality.
      Analyzing ebizzy runs shows that, no matter how many threads are
      running, the mmap_cache hit rate is less than 2%, and in many situations
      below 1%.
      
      The proposed approach is to replace this scheme with a small per-thread
      cache, maximizing hit rates at a very low maintenance cost.
      Invalidations are performed by simply bumping up a 32-bit sequence
      number.  The only expensive operation is in the rare case of a seq
      number overflow, where all caches that share the same address space are
      flushed.  Upon a miss, the proposed replacement policy is based on the
      page number that contains the virtual address in question.  Concretely,
      the following results are seen on an 80 core, 8 socket x86-64 box:
      
      1) System bootup: Most programs are single threaded, so the per-thread
         scheme does improve ~50% hit rate by just adding a few more slots to
         the cache.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 50.61%   | 19.90            |
      | patched        | 73.45%   | 13.58            |
      +----------------+----------+------------------+
      
      2) Kernel build: This one is already pretty good with the current
         approach as we're dealing with good locality.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 75.28%   | 11.03            |
      | patched        | 88.09%   | 9.31             |
      +----------------+----------+------------------+
      
      3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 70.66%   | 17.14            |
      | patched        | 91.15%   | 12.57            |
      +----------------+----------+------------------+
      
      4) Ebizzy: There's a fair amount of variation from run to run, but this
         approach always shows nearly perfect hit rates, while baseline is just
         about non-existent.  The amounts of cycles can fluctuate between
         anywhere from ~60 to ~116 for the baseline scheme, but this approach
         reduces it considerably.  For instance, with 80 threads:
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 1.06%    | 91.54            |
      | patched        | 99.97%   | 14.18            |
      +----------------+----------+------------------+
      
      [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
      [akpm@linux-foundation.org: document vmacache_valid() logic]
      [akpm@linux-foundation.org: attempt to untangle header files]
      [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
      [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: adjust and enhance comments]
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      efb5fea2
    • Christoph Lameter's avatar
      vmscan: reclaim_clean_pages_from_list() must use mod_zone_page_state() · 264a8ae7
      Christoph Lameter authored
      commit 83da7510 upstream.
      
      Seems to be called with preemption enabled.  Therefore it must use
      mod_zone_page_state instead.
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Reported-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Tested-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      264a8ae7
    • Vladimir Davydov's avatar
      mm: vmscan: shrink_slab: rename max_pass -> freeable · 8e524793
      Vladimir Davydov authored
      commit d5bc5fd3 upstream.
      
      The name `max_pass' is misleading, because this variable actually keeps
      the estimate number of freeable objects, not the maximal number of
      objects we can scan in this pass, which can be twice that.  Rename it to
      reflect its actual meaning.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8e524793
    • Vladimir Davydov's avatar
      mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim · 12f2f0ba
      Vladimir Davydov authored
      commit 99120b77 upstream.
      
      When direct reclaim is executed by a process bound to a set of NUMA
      nodes, we should scan only those nodes when possible, but currently we
      will scan kmem from all online nodes even if the kmem shrinker is NUMA
      aware.  That said, binding a process to a particular NUMA node won't
      prevent it from shrinking inode/dentry caches from other nodes, which is
      not good.  Fix this.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      12f2f0ba
    • Jens Axboe's avatar
      mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT · 6591981a
      Jens Axboe authored
      commit 7fcbbaf1 upstream.
      
      In some testing I ran today (some fio jobs that spread over two nodes),
      we end up spending 40% of the time in filemap_check_errors().  That
      smells fishy.  Looking further, this is basically what happens:
      
      blkdev_aio_read()
          generic_file_aio_read()
              filemap_write_and_wait_range()
                  if (!mapping->nr_pages)
                      filemap_check_errors()
      
      and filemap_check_errors() always attempts two test_and_clear_bit() on
      the mapping flags, thus dirtying it for every single invocation.  The
      patch below tests each of these bits before clearing them, avoiding this
      issue.  In my test case (4-socket box), performance went from 1.7M IOPS
      to 4.0M IOPS.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6591981a
    • Mel Gorman's avatar
      mm: optimize put_mems_allowed() usage · 29c2a881
      Mel Gorman authored
      commit d26914d1 upstream.
      
      Since put_mems_allowed() is strictly optional, its a seqcount retry, we
      don't need to evaluate the function if the allocation was in fact
      successful, saving a smp_rmb some loads and comparisons on some relative
      fast-paths.
      
      Since the naming, get/put_mems_allowed() does suggest a mandatory
      pairing, rename the interface, as suggested by Mel, to resemble the
      seqcount interface.
      
      This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
      where it is important to note that the return value of the latter call
      is inverted from its previous incarnation.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      29c2a881
    • Raghavendra K T's avatar
      mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages · d4995db1
      Raghavendra K T authored
      commit 6d2be915 upstream.
      
      Currently max_sane_readahead() returns zero on the cpu whose NUMA node
      has no local memory which leads to readahead failure.  Fix this
      readahead failure by returning minimum of (requested pages, 512).  Users
      running applications on a memory-less cpu which needs readahead such as
      streaming application see considerable boost in the performance.
      
      Result:
      
      fadvise experiment with FADV_WILLNEED on a PPC machine having memoryless
      CPU with 1GB testfile (12 iterations) yielded around 46.66% improvement.
      
      fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB
      testfile 32GB* 4G RAM numa machine (12 iterations) showed no impact on
      the normal NUMA cases w/ patch.
      
        Kernel       Avg  Stddev
        base      7.4975   3.92%
        patched   7.4174   3.26%
      
      [Andrew: making return value PAGE_SIZE independent]
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4995db1
    • David Rientjes's avatar
      mm, compaction: ignore pageblock skip when manually invoking compaction · 67c58ce6
      David Rientjes authored
      commit 91ca9186 upstream.
      
      The cached pageblock hint should be ignored when triggering compaction
      through /proc/sys/vm/compact_memory so all eligible memory is isolated.
      Manually invoking compaction is known to be expensive, there's no need
      to skip pageblocks based on heuristics (mainly for debugging).
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      67c58ce6
    • David Rientjes's avatar
      mm, compaction: determine isolation mode only once · e8cd5b56
      David Rientjes authored
      commit da1c67a7 upstream.
      
      The conditions that control the isolation mode in
      isolate_migratepages_range() do not change during the iteration, so
      extract them out and only define the value once.
      
      This actually does have an effect, gcc doesn't optimize it itself because
      of cc->sync.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e8cd5b56
    • Joonsoo Kim's avatar
      mm/compaction: clean-up code on success of ballon isolation · ca82ea2e
      Joonsoo Kim authored
      commit b6c75016 upstream.
      
      It is just for clean-up to reduce code size and improve readability.
      There is no functional change.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ca82ea2e
    • Joonsoo Kim's avatar
      mm/compaction: check pageblock suitability once per pageblock · 6128cc05
      Joonsoo Kim authored
      commit c122b208 upstream.
      
      isolation_suitable() and migrate_async_suitable() is used to be sure
      that this pageblock range is fine to be migragted.  It isn't needed to
      call it on every page.  Current code do well if not suitable, but, don't
      do well when suitable.
      
      1) It re-checks isolation_suitable() on each page of a pageblock that was
         already estabilished as suitable.
      2) It re-checks migrate_async_suitable() on each page of a pageblock that
         was not entered through the next_pageblock: label, because
         last_pageblock_nr is not otherwise updated.
      
      This patch fixes situation by 1) calling isolation_suitable() only once
      per pageblock and 2) always updating last_pageblock_nr to the pageblock
      that was just checked.
      
      Additionally, move PageBuddy() check after pageblock unit check, since
      pageblock check is the first thing we should do and makes things more
      simple.
      
      [vbabka@suse.cz: rephrase commit description]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6128cc05
    • Joonsoo Kim's avatar
      mm/compaction: change the timing to check to drop the spinlock · 4fff5ca7
      Joonsoo Kim authored
      commit be1aa03b upstream.
      
      It is odd to drop the spinlock when we scan (SWAP_CLUSTER_MAX - 1) th
      pfn page.  This may results in below situation while isolating
      migratepage.
      
      1. try isolate 0x0 ~ 0x200 pfn pages.
      2. When low_pfn is 0x1ff, ((low_pfn+1) % SWAP_CLUSTER_MAX) == 0, so drop
         the spinlock.
      3. Then, to complete isolating, retry to aquire the lock.
      
      I think that it is better to use SWAP_CLUSTER_MAX th pfn for checking the
      criteria about dropping the lock.  This has no harm 0x0 pfn, because, at
      this time, locked variable would be false.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4fff5ca7
    • Lars Ellenberg's avatar
      drbd: fix regression 'out of mem, failed to invoke fence-peer helper' · cc85d67f
      Lars Ellenberg authored
      commit bbc1c5e8 upstream.
      
      Since linux kernel 3.13, kthread_run() internally uses
      wait_for_completion_killable().  We sometimes may use kthread_run()
      while we still have a signal pending, which we used to kick our threads
      out of potentially blocking network functions, causing kthread_run() to
      mistake that as a new fatal signal and fail.
      
      Fix: flush_signals() before kthread_run().
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cc85d67f
    • Joonsoo Kim's avatar
      mm/compaction: do not call suitable_migration_target() on every page · e292d9ad
      Joonsoo Kim authored
      commit 01ead534 upstream.
      
      suitable_migration_target() checks that pageblock is suitable for
      migration target.  In isolate_freepages_block(), it is called on every
      page and this is inefficient.  So make it called once per pageblock.
      
      suitable_migration_target() also checks if page is highorder or not, but
      it's criteria for highorder is pageblock order.  So calling it once
      within pageblock range has no problem.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e292d9ad
    • Joonsoo Kim's avatar
      mm/compaction: disallow high-order page for migration target · 96b3fde4
      Joonsoo Kim authored
      commit 7d348b9e upstream.
      
      Purpose of compaction is to get a high order page.  Currently, if we
      find high-order page while searching migration target page, we break it
      to order-0 pages and use them as migration target.  It is contrary to
      purpose of compaction, so disallow high-order page to be used for
      migration target.
      
      Additionally, clean-up logic in suitable_migration_target() to simplify
      the code.  There is no functional changes from this clean-up.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      96b3fde4
    • David Rientjes's avatar
      mm, compaction: avoid isolating pinned pages · 1190e5c6
      David Rientjes authored
      commit 119d6d59 upstream.
      
      Page migration will fail for memory that is pinned in memory with, for
      example, get_user_pages().  In this case, it is unnecessary to take
      zone->lru_lock or isolating the page and passing it to page migration
      which will ultimately fail.
      
      This is a racy check, the page can still change from under us, but in
      that case we'll just fail later when attempting to move the page.
      
      This avoids very expensive memory compaction when faulting transparent
      hugepages after pinning a lot of memory with a Mellanox driver.
      
      On a 128GB machine and pinning ~120GB of memory, before this patch we
      see the enormous disparity in the number of page migration failures
      because of the pinning (from /proc/vmstat):
      
      	compact_pages_moved 8450
      	compact_pagemigrate_failed 15614415
      
      0.05% of pages isolated are successfully migrated and explicitly
      triggering memory compaction takes 102 seconds.  After the patch:
      
      	compact_pages_moved 9197
      	compact_pagemigrate_failed 7
      
      99.9% of pages isolated are now successfully migrated in this
      configuration and memory compaction takes less than one second.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1190e5c6
    • Dan Streetman's avatar
      swap: change swap_list_head to plist, add swap_avail_head · d540b168
      Dan Streetman authored
      commit 18ab4d4c upstream.
      
      Originally get_swap_page() started iterating through the singly-linked
      list of swap_info_structs using swap_list.next or highest_priority_index,
      which both were intended to point to the highest priority active swap
      target that was not full.  The first patch in this series changed the
      singly-linked list to a doubly-linked list, and removed the logic to start
      at the highest priority non-full entry; it starts scanning at the highest
      priority entry each time, even if the entry is full.
      
      Replace the manually ordered swap_list_head with a plist, swap_active_head.
      Add a new plist, swap_avail_head.  The original swap_active_head plist
      contains all active swap_info_structs, as before, while the new
      swap_avail_head plist contains only swap_info_structs that are active and
      available, i.e. not full.  Add a new spinlock, swap_avail_lock, to protect
      the swap_avail_head list.
      
      Mel Gorman suggested using plists since they internally handle ordering
      the list entries based on priority, which is exactly what swap was doing
      manually.  All the ordering code is now removed, and swap_info_struct
      entries and simply added to their corresponding plist and automatically
      ordered correctly.
      
      Using a new plist for available swap_info_structs simplifies and
      optimizes get_swap_page(), which no longer has to iterate over full
      swap_info_structs.  Using a new spinlock for swap_avail_head plist
      allows each swap_info_struct to add or remove themselves from the
      plist when they become full or not-full; previously they could not
      do so because the swap_info_struct->lock is held when they change
      from full<->not-full, and the swap_lock protecting the main
      swap_active_head must be ordered before any swap_info_struct->lock.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d540b168
    • Dan Streetman's avatar
      lib/plist: add plist_requeue · ae604916
      Dan Streetman authored
      commit a75f232c upstream.
      
      Add plist_requeue(), which moves the specified plist_node after all other
      same-priority plist_nodes in the list.  This is essentially an optimized
      plist_del() followed by plist_add().
      
      This is needed by swap, which (with the next patch in this set) uses a
      plist of available swap devices.  When a swap device (either a swap
      partition or swap file) are added to the system with swapon(), the device
      is added to a plist, ordered by the swap device's priority.  When swap
      needs to allocate a page from one of the swap devices, it takes the page
      from the first swap device on the plist, which is the highest priority
      swap device.  The swap device is left in the plist until all its pages are
      used, and then removed from the plist when it becomes full.
      
      However, as described in man 2 swapon, swap must allocate pages from swap
      devices with the same priority in round-robin order; to do this, on each
      swap page allocation, swap uses a page from the first swap device in the
      plist, and then calls plist_requeue() to move that swap device entry to
      after any other same-priority swap devices.  The next swap page allocation
      will again use a page from the first swap device in the plist and requeue
      it, and so on, resulting in round-robin usage of equal-priority swap
      devices.
      
      Also add plist_test_requeue() test function, for use by plist_test() to
      test plist_requeue() function.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ae604916
    • Dan Streetman's avatar
      lib/plist: add helper functions · fd6d61cc
      Dan Streetman authored
      commit fd16618e upstream.
      
      Add PLIST_HEAD() to plist.h, equivalent to LIST_HEAD() from list.h, to
      define and initialize a struct plist_head.
      
      Add plist_for_each_continue() and plist_for_each_entry_continue(),
      equivalent to list_for_each_continue() and list_for_each_entry_continue(),
      to iterate over a plist continuing after the current position.
      
      Add plist_prev() and plist_next(), equivalent to (struct list_head*)->prev
      and ->next, implemented by list_prev_entry() and list_next_entry(), to
      access the prev/next struct plist_node entry.  These are needed because
      unlike struct list_head, direct access of the prev/next struct plist_node
      isn't possible; the list must be navigated via the contained struct
      list_head.  e.g.  instead of accessing the prev by list_prev_entry(node,
      node_list) it can be accessed by plist_prev(node).
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fd6d61cc
    • Dan Streetman's avatar
      swap: change swap_info singly-linked list to list_head · bcbfe6fd
      Dan Streetman authored
      commit adfab836 upstream.
      
      The logic controlling the singly-linked list of swap_info_struct entries
      for all active, i.e.  swapon'ed, swap targets is rather complex, because:
      
       - it stores the entries in priority order
       - there is a pointer to the highest priority entry
       - there is a pointer to the highest priority not-full entry
       - there is a highest_priority_index variable set outside the swap_lock
       - swap entries of equal priority should be used equally
      
      this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
      where different priority swap targets are incorrectly used equally.
      
      That bug probably could be solved with the existing singly-linked lists,
      but I think it would only add more complexity to the already difficult to
      understand get_swap_page() swap_list iteration logic.
      
      The first patch changes from a singly-linked list to a doubly-linked list
      using list_heads; the highest_priority_index and related code are removed
      and get_swap_page() starts each iteration at the highest priority
      swap_info entry, even if it's full.  While this does introduce unnecessary
      list iteration (i.e.  Schlemiel the painter's algorithm) in the case where
      one or more of the highest priority entries are full, the iteration and
      manipulation code is much simpler and behaves correctly re: the above bug;
      and the fourth patch removes the unnecessary iteration.
      
      The second patch adds some minor plist helper functions; nothing new
      really, just functions to match existing regular list functions.  These
      are used by the next two patches.
      
      The third patch adds plist_requeue(), which is used by get_swap_page() in
      the next patch - it performs the requeueing of same-priority entries
      (which moves the entry to the end of its priority in the plist), so that
      all equal-priority swap_info_structs get used equally.
      
      The fourth patch converts the main list into a plist, and adds a new plist
      that contains only swap_info entries that are both active and not full.
      As Mel suggested using plists allows removing all the ordering code from
      swap - plists handle ordering automatically.  The list naming is also
      clarified now that there are two lists, with the original list changed
      from swap_list_head to swap_active_head and the new list named
      swap_avail_head.  A new spinlock is also added for the new list, so
      swap_info entries can be added or removed from the new list immediately as
      they become full or not full.
      
      This patch (of 4):
      
      Replace the singly-linked list tracking active, i.e.  swapon'ed,
      swap_info_struct entries with a doubly-linked list using struct
      list_heads.  Simplify the logic iterating and manipulating the list of
      entries, especially get_swap_page(), by using standard list_head
      functions, and removing the highest priority iteration logic.
      
      The change fixes the bug:
      https://lkml.org/lkml/2014/2/13/181
      in which different priority swap entries after the highest priority entry
      are incorrectly used equally in pairs.  The swap behavior is now as
      advertised, i.e. different priority swap entries are used in order, and
      equal priority swap targets are used concurrently.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bcbfe6fd
    • Michal Hocko's avatar
      mm: exclude memoryless nodes from zone_reclaim · a4c51bde
      Michal Hocko authored
      commit 70ef57e6 upstream.
      
      We had a report about strange OOM killer strikes on a PPC machine
      although there was a lot of swap free and a tons of anonymous memory
      which could be swapped out.  In the end it turned out that the OOM was a
      side effect of zone reclaim which wasn't unmapping and swapping out and
      so the system was pushed to the OOM.  Although this sounds like a bug
      somewhere in the kswapd vs.  zone reclaim vs.  direct reclaim
      interaction numactl on the said hardware suggests that the zone reclaim
      should not have been set in the first place:
      
        node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
        node 0 size: 0 MB
        node 0 free: 0 MB
        node 2 cpus:
        node 2 size: 7168 MB
        node 2 free: 6019 MB
        node distances:
        node   0   2
        0:  10  40
        2:  40  10
      
      So all the CPUs are associated with Node0 which doesn't have any memory
      while Node2 contains all the available memory.  Node distances cause an
      automatic zone_reclaim_mode enabling.
      
      Zone reclaim is intended to keep the allocations local but this doesn't
      make any sense on the memoryless nodes.  So let's exclude such nodes for
      init_zone_allows_reclaim which evaluates zone reclaim behavior and
      suitable reclaim_nodes.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Tested-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a4c51bde
    • Andrew Hunter's avatar
      jiffies: Fix timeval conversion to jiffies · 5c0f0c01
      Andrew Hunter authored
      commit d78c9300 upstream.
      
      timeval_to_jiffies tried to round a timeval up to an integral number
      of jiffies, but the logic for doing so was incorrect: intervals
      corresponding to exactly N jiffies would become N+1. This manifested
      itself particularly repeatedly stopping/starting an itimer:
      
      setitimer(ITIMER_PROF, &val, NULL);
      setitimer(ITIMER_PROF, NULL, &val);
      
      would add a full tick to val, _even if it was exactly representable in
      terms of jiffies_ (say, the result of a previous rounding.)  Doing
      this repeatedly would cause unbounded growth in val.  So fix the math.
      
      Here's what was wrong with the conversion: we essentially computed
      (eliding seconds)
      
      jiffies = usec  * (NSEC_PER_USEC/TICK_NSEC)
      
      by using scaling arithmetic, which took the best approximation of
      NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
      x/(2^USEC_JIFFIE_SC), and computed:
      
      jiffies = (usec * x) >> USEC_JIFFIE_SC
      
      and rounded this calculation up in the intermediate form (since we
      can't necessarily exactly represent TICK_NSEC in usec.) But the
      scaling arithmetic is a (very slight) *over*approximation of the true
      value; that is, instead of dividing by (1 usec/ 1 jiffie), we
      effectively divided by (1 usec/1 jiffie)-epsilon (rounding
      down). This would normally be fine, but we want to round timeouts up,
      and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
      would be fine if our division was exact, but dividing this by the
      slightly smaller factor was equivalent to adding just _over_ 1 to the
      final result (instead of just _under_ 1, as desired.)
      
      In particular, with HZ=1000, we consistently computed that 10000 usec
      was 11 jiffies; the same was true for any exact multiple of
      TICK_NSEC.
      
      We could possibly still round in the intermediate form, adding
      something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
      convert usec->nsec, round in nanoseconds, and then convert using
      time*spec*_to_jiffies.  This adds one constant multiplication, and is
      not observably slower in microbenchmarks on recent x86 hardware.
      
      Tested: the following program:
      
      int main() {
        struct itimerval zero = {{0, 0}, {0, 0}};
        /* Initially set to 10 ms. */
        struct itimerval initial = zero;
        initial.it_interval.tv_usec = 10000;
        setitimer(ITIMER_PROF, &initial, NULL);
        /* Save and restore several times. */
        for (size_t i = 0; i < 10; ++i) {
          struct itimerval prev;
          setitimer(ITIMER_PROF, &zero, &prev);
          /* on old kernels, this goes up by TICK_USEC every iteration */
          printf("previous value: %ld %ld %ld %ld\n",
                 prev.it_interval.tv_sec, prev.it_interval.tv_usec,
                 prev.it_value.tv_sec, prev.it_value.tv_usec);
          setitimer(ITIMER_PROF, &prev, NULL);
        }
          return 0;
      }
      
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Reviewed-by: default avatarPaul Turner <pjt@google.com>
      Reported-by: default avatarAaron Jacobs <jacobsa@google.com>
      Signed-off-by: default avatarAndrew Hunter <ahh@google.com>
      [jstultz: Tweaked to apply to 3.17-rc]
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      [bwh: Backported to 3.16: adjust filename]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5c0f0c01
    • Hans Verkuil's avatar
      media: vb2: fix VBI/poll regression · cbb87efb
      Hans Verkuil authored
      commit 58d75f4b upstream.
      
      The recent conversion of saa7134 to vb2 unconvered a poll() bug that
      broke the teletext applications alevt and mtt. These applications
      expect that calling poll() without having called VIDIOC_STREAMON will
      cause poll() to return POLLERR. That did not happen in vb2.
      
      This patch fixes that behavior. It also fixes what should happen when
      poll() is called when STREAMON is called but no buffers have been
      queued. In that case poll() will also return POLLERR, but only for
      capture queues since output queues will always return POLLOUT
      anyway in that situation.
      
      This brings the vb2 behavior in line with the old videobuf behavior.
      Signed-off-by: default avatarHans Verkuil <hans.verkuil@cisco.com>
      Acked-by: default avatarLaurent Pinchart <laurent.pinchart@ideasonboard.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@osg.samsung.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cbb87efb
    • Mel Gorman's avatar
      mm: numa: Do not mark PTEs pte_numa when splitting huge pages · 5f50c44d
      Mel Gorman authored
      commit abc40bd2 upstream.
      
      This patch reverts 1ba6e0b5 ("mm: numa: split_huge_page: transfer the
      NUMA type from the pmd to the pte"). If a huge page is being split due
      a protection change and the tail will be in a PROT_NONE vma then NUMA
      hinting PTEs are temporarily created in the protected VMA.
      
       VM_RW|VM_PROTNONE
      |-----------------|
            ^
            split here
      
      In the specific case above, it should get fixed up by change_pte_range()
      but there is a window of opportunity for weirdness to happen. Similarly,
      if a huge page is shrunk and split during a protection update but before
      pmd_numa is cleared then a pte_numa can be left behind.
      
      Instead of adding complexity trying to deal with the case, this patch
      will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults
      will not be triggered which is marginal in comparison to the complexity
      in dealing with the corner cases during THP split.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5f50c44d