1. 30 Aug, 2012 9 commits
    • Patrick McHardy's avatar
      netfilter: ipv6: add IPv6 NAT support · 58a317f1
      Patrick McHardy authored
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      58a317f1
    • Patrick McHardy's avatar
      net: core: add function for incremental IPv6 pseudo header checksum updates · 2cf545e8
      Patrick McHardy authored
      Add inet_proto_csum_replace16 for incrementally updating IPv6 pseudo header
      checksums for IPv6 NAT.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      2cf545e8
    • Patrick McHardy's avatar
      netfilter: ipv6: expand skb head in ip6_route_me_harder after oif change · 0ad352cb
      Patrick McHardy authored
      Expand the skb headroom if the oif changed due to rerouting similar to
      how IPv4 packets are handled.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      0ad352cb
    • Patrick McHardy's avatar
      netfilter: add protocol independent NAT core · c7232c99
      Patrick McHardy authored
      Convert the IPv4 NAT implementation to a protocol independent core and
      address family specific modules.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      c7232c99
    • Patrick McHardy's avatar
      netfilter: nf_nat: add protoff argument to packet mangling functions · 051966c0
      Patrick McHardy authored
      For mangling IPv6 packets the protocol header offset needs to be known
      by the NAT packet mangling functions. Add a so far unused protoff argument
      and convert the conntrack and NAT helpers to use it in preparation of
      IPv6 NAT.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      051966c0
    • Patrick McHardy's avatar
      netfilter: nf_conntrack: restrict NAT helper invocation to IPv4 · 811927cc
      Patrick McHardy authored
      The NAT helpers currently only handle IPv4 packets correctly. Restrict
      invocation of the helpers to IPv4 in preparation of IPv6 NAT.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      811927cc
    • Patrick McHardy's avatar
      netfilter: nf_conntrack_ipv6: fix tracking of ICMPv6 error messages containing fragments · 2b60af01
      Patrick McHardy authored
      ICMPv6 error messages are tracked by extracting the conntrack tuple of
      the inner packet and looking up the corresponding conntrack entry. Tuple
      extraction uses the ->get_l4proto() callback, which in case of fragments
      returns NEXTHDR_FRAGMENT instead of the upper protocol, even for the
      first fragment when the entire next header is present, resulting in a
      failure to find the correct connection tracking entry.
      
      This patch changes ipv6_get_l4proto() to use ipv6_skip_exthdr() instead
      of nf_ct_ipv6_skip_exthdr() in order to skip fragment headers when the
      fragment offset is zero.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      2b60af01
    • Patrick McHardy's avatar
      netfilter: nf_conntrack_ipv6: improve fragmentation handling · 4cdd3408
      Patrick McHardy authored
      The IPv6 conntrack fragmentation currently has a couple of shortcomings.
      Fragmentes are collected in PREROUTING/OUTPUT, are defragmented, the
      defragmented packet is then passed to conntrack, the resulting conntrack
      information is attached to each original fragment and the fragments then
      continue their way through the stack.
      
      Helper invocation occurs in the POSTROUTING hook, at which point only
      the original fragments are available. The result of this is that
      fragmented packets are never passed to helpers.
      
      This patch improves the situation in the following way:
      
      - If a reassembled packet belongs to a connection that has a helper
        assigned, the reassembled packet is passed through the stack instead
        of the original fragments.
      
      - During defragmentation, the largest received fragment size is stored.
        On output, the packet is refragmented if required. If the largest
        received fragment size exceeds the outgoing MTU, a "packet too big"
        message is generated, thus behaving as if the original fragments
        were passed through the stack from an outside point of view.
      
      - The ipv6_helper() hook function can't receive fragments anymore for
        connections using a helper, so it is switched to use ipv6_skip_exthdr()
        instead of the netfilter specific nf_ct_ipv6_skip_exthdr() and the
        reassembled packets are passed to connection tracking helpers.
      
      The result of this is that we can properly track fragmented packets, but
      still generate ICMPv6 Packet too big messages if we would have before.
      
      This patch is also required as a precondition for IPv6 NAT, where NAT
      helpers might enlarge packets up to a point that they require
      fragmentation. In that case we can't generate Packet too big messages
      since the proper MTU can't be calculated in all cases (f.i. when
      changing textual representation of a variable amount of addresses),
      so the packet is transparently fragmented iff the original packet or
      fragments would have fit the outgoing MTU.
      
      IPVS parts by Jesper Dangaard Brouer <brouer@redhat.com>.
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      4cdd3408
    • Jesper Dangaard Brouer's avatar
      ipvs: IPv6 MTU checking cleanup and bugfix · 590e3f79
      Jesper Dangaard Brouer authored
      Cleaning up the IPv6 MTU checking in the IPVS xmit code, by using
      a common helper function __mtu_check_toobig_v6().
      
      The MTU check for tunnel mode can also use this helper as
      ntohs(old_iph->payload_len) + sizeof(struct ipv6hdr) is qual to
      skb->len.  And the 'mtu' variable have been adjusted before
      calling helper.
      
      Notice, this also fixes a bug, as the the MTU check in ip_vs_dr_xmit_v6()
      were missing a check for skb_is_gso().
      
      This bug e.g. caused issues for KVM IPVS setups, where different
      Segmentation Offloading techniques are utilized, between guests,
      via the virtio driver.  This resulted in very bad performance,
      due to the ICMPv6 "too big" messages didn't affect the sender.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      590e3f79
  2. 26 Aug, 2012 1 commit
  3. 23 Aug, 2012 9 commits
    • Pavel Emelyanov's avatar
      packet: Protect packet sk list with mutex (v2) · 0fa7fa98
      Pavel Emelyanov authored
      Change since v1:
      
      * Fixed inuse counters access spotted by Eric
      
      In patch eea68e2f (packet: Report socket mclist info via diag module) I've
      introduced a "scheduling in atomic" problem in packet diag module -- the
      socket list is traversed under rcu_read_lock() while performed under it sk
      mclist access requires rtnl lock (i.e. -- mutex) to be taken.
      
      [152363.820563] BUG: scheduling while atomic: crtools/12517/0x10000002
      [152363.820573] 4 locks held by crtools/12517:
      [152363.820581]  #0:  (sock_diag_mutex){+.+.+.}, at: [<ffffffff81a2dcb5>] sock_diag_rcv+0x1f/0x3e
      [152363.820613]  #1:  (sock_diag_table_mutex){+.+.+.}, at: [<ffffffff81a2de70>] sock_diag_rcv_msg+0xdb/0x11a
      [152363.820644]  #2:  (nlk->cb_mutex){+.+.+.}, at: [<ffffffff81a67d01>] netlink_dump+0x23/0x1ab
      [152363.820693]  #3:  (rcu_read_lock){.+.+..}, at: [<ffffffff81b6a049>] packet_diag_dump+0x0/0x1af
      
      Similar thing was then re-introduced by further packet diag patches (fanount
      mutex and pgvec mutex for rings) :(
      
      Apart from being terribly sorry for the above, I propose to change the packet
      sk list protection from spinlock to mutex. This lock currently protects two
      modifications:
      
      * sklist
      * prot inuse counters
      
      The sklist modifications can be just reprotected with mutex since they already
      occur in a sleeping context. The inuse counters modifications are trickier -- the
      __this_cpu_-s are used inside, thus requiring the caller to handle the potential
      issues with contexts himself. Since packet sockets' counters are modified in two
      places only (packet_create and packet_release) we only need to protect the context
      from being preempted. BH disabling is not required in this case.
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0fa7fa98
    • Allan, Bruce W's avatar
      mdio: translation of MMD EEE registers to/from ethtool settings · b32607dd
      Allan, Bruce W authored
      The helper functions which translate IEEE MDIO Manageable Device (MMD)
      Energy-Efficient Ethernet (EEE) registers 3.20, 7.60 and 7.61 to and from
      the comparable ethtool supported/advertised settings will be needed by
      drivers other than those in PHYLIB (e.g. e1000e in a follow-on patch).
      
      In the same fashion as similar translation functions in linux/mii.h, move
      these functions from the PHYLIB core to the linux/mdio.h header file so the
      code will not have to be duplicated in each driver needing MMD-to-ethtool
      (and vice-versa) translations.  The function and some variable names have
      been renamed to be more descriptive.
      
      Not tested on the only hardware that currently calls the related functions,
      stmmac, because I don't have access to any.  Has been compile tested and
      the translations have been tested on a locally modified version of e1000e.
      Signed-off-by: default avatarBruce Allan <bruce.w.allan@intel.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b32607dd
    • danborkmann@iogearbox.net's avatar
      af_packet: use define instead of constant · 9e67030a
      danborkmann@iogearbox.net authored
      Instead of using a hard-coded value for the status variable, it would make
      the code more readable to use its destined define from linux/if_packet.h.
      
      Signed-off-by: daniel.borkmann@tik.ee.ethz.ch
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9e67030a
    • Ying Xue's avatar
      rds: Don't disable BH on BH context · bfdc587c
      Ying Xue authored
      Since we have already in BH context when *_write_space(),
      *_data_ready() as well as *_state_change() are called, it's
      unnecessary to disable BH.
      Signed-off-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bfdc587c
    • John Eaglesham's avatar
      bonding: support for IPv6 transmit hashing · 6b923cb7
      John Eaglesham authored
      Currently the "bonding" driver does not support load balancing outgoing
      traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
      are currently supported; this patch adds transmit hashing for IPv6 (and
      TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
      bonding driver. In addition, bounds checking has been added to all
      transmit hashing functions.
      
      The algorithm chosen (xor'ing the bottom three quads of the source and
      destination addresses together, then xor'ing each byte of that result into
      the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
      was selected after testing almost 400,000 unique IPv6 addresses harvested
      from server logs. This algorithm had the most even distribution for both
      big- and little-endian architectures while still using few instructions. Its
      behavior also attempts to closely match that of the IPv4 algorithm.
      
      The IPv6 flow label was intentionally not included in the hash as it appears
      to be unset in the vast majority of IPv6 traffic sampled, and the current
      algorithm not using the flow label already offers a very even distribution.
      
      Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
      ie, they are not balanced based on layer 4 information. Additionally,
      IPv6 packets with intermediate headers are not balanced based on layer
      4 information. In practice these intermediate headers are not common and
      this should not cause any problems, and the alternative (a packet-parsing
      loop and look-up table) seemed slow and complicated for little gain.
      Tested-by: default avatarJohn Eaglesham <linux@8192.net>
      Signed-off-by: default avatarJohn Eaglesham <linux@8192.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b923cb7
    • Eric Dumazet's avatar
      ipv6: gre: fix ip6gre_err() · b87fb39e
      Eric Dumazet authored
      ip6gre_err() miscomputes grehlen (sizeof(ipv6h) is 4 or 8,
      not 40 as expected), and should take into account 'offset' parameter.
      
      Also uses pskb_may_pull() to cope with some fragged skbs
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Dmitry Kozlov <xeb@mail.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b87fb39e
    • Eric Dumazet's avatar
      xfrm: fix RCU bugs · ef8531b6
      Eric Dumazet authored
      This patch reverts commit 56892261 (xfrm: Use rcu_dereference_bh to
      deference pointer protected by rcu_read_lock_bh), and fixes bugs
      introduced in commit 418a99ac ( Replace rwlock on xfrm_policy_afinfo
      with rcu )
      
      1) We properly use RCU variant in this file, not a mix of RCU/RCU_BH
      
      2) We must defer some writes after the synchronize_rcu() call or a reader
       can crash dereferencing NULL pointer.
      
      3) Now we use the xfrm_policy_afinfo_lock spinlock only from process
      context, we no longer need to block BH in xfrm_policy_register_afinfo()
      and xfrm_policy_unregister_afinfo()
      
      4) Can use RCU_INIT_POINTER() instead of rcu_assign_pointer() in
      xfrm_policy_unregister_afinfo()
      
      5) Remove a forward inline declaration (xfrm_policy_put_afinfo()),
        and also move xfrm_policy_get_afinfo() declaration.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Fan Du <fan.du@windriver.com>
      Cc: Priyanka Jain <Priyanka.Jain@freescale.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef8531b6
    • Eric Dumazet's avatar
      net: remove delay at device dismantle · 0115e8e3
      Eric Dumazet authored
      I noticed extra one second delay in device dismantle, tracked down to
      a call to dst_dev_event() while some call_rcu() are still in RCU queues.
      
      These call_rcu() were posted by rt_free(struct rtable *rt) calls.
      
      We then wait a little (but one second) in netdev_wait_allrefs() before
      kicking again NETDEV_UNREGISTER.
      
      As the call_rcu() are now completed, dst_dev_event() can do the needed
      device swap on busy dst.
      
      To solve this problem, add a new NETDEV_UNREGISTER_FINAL, called
      after a rcu_barrier(), but outside of RTNL lock.
      
      Use NETDEV_UNREGISTER_FINAL with care !
      
      Change dst_dev_event() handler to react to NETDEV_UNREGISTER_FINAL
      
      Also remove NETDEV_UNREGISTER_BATCH, as its not used anymore after
      IP cache removal.
      
      With help from Gao feng
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0115e8e3
    • David S. Miller's avatar
      Merge git://1984.lsi.us.es/nf-next · bf277b0c
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      This is the first batch of Netfilter and IPVS updates for your
      net-next tree. Mostly cleanups for the Netfilter side. They are:
      
      * Remove unnecessary RTNL locking now that we have support
        for namespace in nf_conntrack, from Patrick McHardy.
      
      * Cleanup to eliminate unnecessary goto in the initialization
        path of several Netfilter tables, from Jean Sacren.
      
      * Another cleanup from Wu Fengguang, this time to PTR_RET instead
        of if IS_ERR then return PTR_ERR.
      
      * Use list_for_each_entry_continue_rcu in nf_iterate, from
        Michael Wang.
      
      * Add pmtu_disc sysctl option to disable PMTU in their tunneling
        transmitter, from Julian Anastasov.
      
      * Generalize application protocol registration in IPVS and modify
        IPVS FTP helper to use it, from Julian Anastasov.
      
      * update Kconfig. The IPVS FTP helper depends on the Netfilter FTP
        helper for NAT support, from Julian Anastasov.
      
      * Add logic to update PMTU for IPIP packets in IPVS, again
        from Julian Anastasov.
      
      * A couple of sparse warning fixes for IPVS and Netfilter from
        Claudiu Ghioc and Patrick McHardy respectively.
      
      Patrick's IPv6 NAT changes will follow after this batch, I need
      to flush this batch first before refreshing my tree.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf277b0c
  4. 22 Aug, 2012 5 commits
  5. 21 Aug, 2012 16 commits
    • Linus Torvalds's avatar
      Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · a484147a
      Linus Torvalds authored
      Pull media fixes from Mauro Carvalho Chehab:
       "For bug fixes, at soc_camera, si470x, uvcvideo, iguanaworks IR driver,
        radio_shark Kbuild fixes, and at the V4L2 core (radio fixes)."
      
      * 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
        [media] media: soc_camera: don't clear pix->sizeimage in JPEG mode
        [media] media: mx2_camera: Fix clock handling for i.MX27
        [media] video: mx2_camera: Use clk_prepare_enable/clk_disable_unprepare
        [media] video: mx1_camera: Use clk_prepare_enable/clk_disable_unprepare
        [media] media: mx3_camera: buf_init() add buffer state check
        [media] radio-shark2: Only compile led support when CONFIG_LED_CLASS is set
        [media] radio-shark: Only compile led support when CONFIG_LED_CLASS is set
        [media] radio-shark*: Call cancel_work_sync from disconnect rather then release
        [media] radio-shark*: Remove work-around for dangling pointer in usb intfdata
        [media] Add USB dependency for IguanaWorks USB IR Transceiver
        [media] Add missing logging for rangelow/high of hwseek
        [media] VIDIOC_ENUM_FREQ_BANDS fix
        [media] mem2mem_testdev: fix querycap regression
        [media] si470x: v4l2-compliance fixes
        [media] DocBook: Remove a spurious character
        [media] uvcvideo: Reset the bytesused field when recycling an erroneous buffer
      a484147a
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 8f8ba75e
      Linus Torvalds authored
      Pull networking update from David Miller:
       "A couple weeks of bug fixing in there.  The largest chunk is all the
        broken crap Amerigo Wang found in the netpoll layer."
      
       1) netpoll and it's users has several serious bugs:
          a) uses GFP_KERNEL with locks held
          b) interfaces requiring interrupts disabled are called with them
             enabled
          c) and vice versa
          d) VLAN tag demuxing, as per all other RX packet input paths, is not
             applied
      
          All from Amerigo Wang.
      
       2) Hopefully cure the ipv4 mapped ipv6 address TCP early demux bugs for
          good, from Neal Cardwell.
      
       3) Unlike AF_UNIX, AF_PACKET sockets don't set a default credentials
          when the user doesn't specify one explicitly during sendmsg().
          Instead we attach an empty (zero) SCM credential block which is
          definitely not what we want.  Fix from Eric Dumazet.
      
       4) IPv6 illegally invokes netdevice notifiers with RCU lock held, fix
          from Ben Hutchings.
      
       5) inet_csk_route_child_sock() checks wrong inet options pointer, fix
          from Christoph Paasch.
      
       6) When AF_PACKET is used for transmit, packet loopback doesn't behave
          properly when a socket fanout is enabled, from Eric Leblond.
      
       7) On bluetooth l2cap channel create failure, we leak the socket, from
          Jaganath Kanakkassery.
      
       8) Fix all the netprio file handling bugs found by Al Viro, from John
          Fastabend.
      
       9) Several error return and NULL deref bug fixes in networking drivers
          from Julia Lawall.
      
      10) A large smattering of struct padding et al.  kernel memory leaks to
          userspace found of Mathias Krause.
      
      11) Conntrack expections in netfilter can access an uninitialized timer,
          fix from Pablo Neira Ayuso.
      
      12) Several netfilter SIP tracker bug fixes from Patrick McHardy.
      
      13) IPSEC ipv6 routes are not initialized correctly all the time,
          resulting in an OOPS in inet_putpeer().  Also from Patrick McHardy.
      
      14) Bridging does rcu_dereference() outside of RCU protected area, from
          Stephen Hemminger.
      
      15) Fix routing cache removal performance regression when looking up
          output routes that have a local destination.  From Zheng Yan.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
        af_netlink: force credentials passing [CVE-2012-3520]
        ipv4: fix ip header ident selection in __ip_make_skb()
        ipv4: Use newinet->inet_opt in inet_csk_route_child_sock()
        tcp: fix possible socket refcount problem
        net: tcp: move sk_rx_dst_set call after tcp_create_openreq_child()
        net/core/dev.c: fix kernel-doc warning
        netconsole: remove a redundant netconsole_target_put()
        net: ipv6: fix oops in inet_putpeer()
        net/stmmac: fix issue of clk_get for Loongson1B.
        caif: Do not dereference NULL in chnl_recv_cb()
        af_packet: don't emit packet on orig fanout group
        drivers/net/irda: fix error return code
        drivers/net/wan/dscc4.c: fix error return code
        drivers/net/wimax/i2400m/fw.c: fix error return code
        smsc75xx: add missing entry to MAINTAINERS
        net: qmi_wwan: new devices: UML290 and K5006-Z
        net: sh_eth: Add eth support for R8A7779 device
        netdev/phy: skip disabled mdio-mux nodes
        dt: introduce for_each_available_child_of_node, of_get_next_available_child
        net: netprio: fix cgrp create and write priomap race
        ...
      8f8ba75e
    • Mel Gorman's avatar
      mm: compaction: Abort async compaction if locks are contended or taking too long · c67fe375
      Mel Gorman authored
      Jim Schutt reported a problem that pointed at compaction contending
      heavily on locks.  The workload is straight-forward and in his own words;
      
      	The systems in question have 24 SAS drives spread across 3 HBAs,
      	running 24 Ceph OSD instances, one per drive.  FWIW these servers
      	are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
      	Ceph Linux clients doing dd simultaneously to a Ceph file system
      	backed by 12 of these servers.
      
      Early in the test everything looks fine
      
        procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
         r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
        31 15          0     287216        576   38606628    0    0     2  1158    2   14   1  3  95  0  0
        27 15          0     225288        576   38583384    0    0    18 2222016 203357 134876  11 56  17 15  0
        28 17          0     219256        576   38544736    0    0    11 2305932 203141 146296  11 49  23 17  0
         6 18          0     215596        576   38552872    0    0     7 2363207 215264 166502  12 45  22 20  0
        22 18          0     226984        576   38596404    0    0     3 2445741 223114 179527  12 43  23 22  0
      
      and then it goes to pot
      
        procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
         r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
        163  8          0     464308        576   36791368    0    0    11 22210  866  536   3 13  79  4  0
        207 14          0     917752        576   36181928    0    0   712 1345376 134598 47367   7 90   1  2  0
        123 12          0     685516        576   36296148    0    0   429 1386615 158494 60077   8 84   5  3  0
        123 12          0     598572        576   36333728    0    0  1107 1233281 147542 62351   7 84   5  4  0
        622  7          0     660768        576   36118264    0    0   557 1345548 151394 59353   7 85   4  3  0
        223 11          0     283960        576   36463868    0    0    46 1107160 121846 33006   6 93   1  1  0
      
      Note that system CPU usage is very high blocks being written out has
      dropped by 42%. He analysed this with perf and found
      
        perf record -g -a sleep 10
        perf report --sort symbol --call-graph fractal,5
          34.63%  [k] _raw_spin_lock_irqsave
                  |
                  |--97.30%-- isolate_freepages
                  |          compaction_alloc
                  |          unmap_and_move
                  |          migrate_pages
                  |          compact_zone
                  |          compact_zone_order
                  |          try_to_compact_pages
                  |          __alloc_pages_direct_compact
                  |          __alloc_pages_slowpath
                  |          __alloc_pages_nodemask
                  |          alloc_pages_vma
                  |          do_huge_pmd_anonymous_page
                  |          handle_mm_fault
                  |          do_page_fault
                  |          page_fault
                  |          |
                  |          |--87.39%-- skb_copy_datagram_iovec
                  |          |          tcp_recvmsg
                  |          |          inet_recvmsg
                  |          |          sock_recvmsg
                  |          |          sys_recvfrom
                  |          |          system_call
                  |          |          __recv
                  |          |          |
                  |          |           --100.00%-- (nil)
                  |          |
                  |           --12.61%-- memcpy
                   --2.70%-- [...]
      
      There was other data but primarily it is all showing that compaction is
      contended heavily on the zone->lock and zone->lru_lock.
      
      commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
      while isolating pages for migration] noted that it was possible for
      migration to hold the lru_lock for an excessive amount of time. Very
      broadly speaking this patch expands the concept.
      
      This patch introduces compact_checklock_irqsave() to check if a lock
      is contended or the process needs to be scheduled. If either condition
      is true then async compaction is aborted and the caller is informed.
      The page allocator will fail a THP allocation if compaction failed due
      to contention. This patch also introduces compact_trylock_irqsave()
      which will acquire the lock only if it is not contended and the process
      does not need to schedule.
      Reported-by: default avatarJim Schutt <jaschut@sandia.gov>
      Tested-by: default avatarJim Schutt <jaschut@sandia.gov>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c67fe375
    • Mel Gorman's avatar
      mm: have order > 0 compaction start near a pageblock with free pages · de74f1cc
      Mel Gorman authored
      Commit 7db8889a ("mm: have order > 0 compaction start off where it
      left") introduced a caching mechanism to reduce the amount work the free
      page scanner does in compaction.  However, it has a problem.  Consider
      two process simultaneously scanning free pages
      
      					    			C
      	Process A		M     S     			F
      			|---------------------------------------|
      	Process B		M 	FS
      
      	C is zone->compact_cached_free_pfn
      	S is cc->start_pfree_pfn
      	M is cc->migrate_pfn
      	F is cc->free_pfn
      
      In this diagram, Process A has just reached its migrate scanner, wrapped
      around and updated compact_cached_free_pfn accordingly.
      
      Simultaneously, Process B finishes isolating in a block and updates
      compact_cached_free_pfn again to the location of its free scanner.
      
      Process A moves to "end_of_zone - one_pageblock" and runs this check
      
                      if (cc->order > 0 && (!cc->wrapped ||
                                            zone->compact_cached_free_pfn >
                                            cc->start_free_pfn))
                              pfn = min(pfn, zone->compact_cached_free_pfn);
      
      compact_cached_free_pfn is above where it started so the free scanner
      skips almost the entire space it should have scanned.  When there are
      multiple processes compacting it can end in a situation where the entire
      zone is not being scanned at all.  Further, it is possible for two
      processes to ping-pong update to compact_cached_free_pfn which is just
      random.
      
      Overall, the end result wrecks allocation success rates.
      
      There is not an obvious way around this problem without introducing new
      locking and state so this patch takes a different approach.
      
      First, it gets rid of the skip logic because it's not clear that it
      matters if two free scanners happen to be in the same block but with
      racing updates it's too easy for it to skip over blocks it should not.
      
      Second, it updates compact_cached_free_pfn in a more limited set of
      circumstances.
      
      If a scanner has wrapped, it updates compact_cached_free_pfn to the end
      	of the zone. When a wrapped scanner isolates a page, it updates
      	compact_cached_free_pfn to point to the highest pageblock it
      	can isolate pages from.
      
      If a scanner has not wrapped when it has finished isolated pages it
      	checks if compact_cached_free_pfn is pointing to the end of the
      	zone. If so, the value is updated to point to the highest
      	pageblock that pages were isolated from. This value will not
      	be updated again until a free page scanner wraps and resets
      	compact_cached_free_pfn.
      
      This is not optimal and it can still race but the compact_cached_free_pfn
      will be pointing to or very near a pageblock with free pages.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de74f1cc
    • Alexandre Bounine's avatar
      rapidio/tsi721: fix unused variable compiler warning · 9a9a9a7a
      Alexandre Bounine authored
      Fix unused variable compiler warning when built with CONFIG_RAPIDIO_DEBUG
      option off.
      
      This patch is applicable to kernel versions starting from v3.2
      Signed-off-by: default avatarAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a9a9a7a
    • Alexandre Bounine's avatar
      rapidio/tsi721: fix inbound doorbell interrupt handling · 3670e7e1
      Alexandre Bounine authored
      Make sure that there is no doorbell messages left behind due to disabled
      interrupts during inbound doorbell processing.
      
      The most common case for this bug is loss of rionet JOIN messages in
      systems with three or more rionet participants and MSI or MSI-X enabled.
      As result, requests for packet transfers may finish with "destination
      unreachable" error message.
      
      This patch is applicable to kernel versions starting from v3.2.
      Signed-off-by: default avatarAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3670e7e1
    • Atsushi Nemoto's avatar
      drivers/rtc/rtc-rs5c348.c: fix hour decoding in 12-hour mode · 7dbfb315
      Atsushi Nemoto authored
      Correct the offset by subtracting 20 from tm_hour before taking the
      modulo 12.
      
      [ "Why 20?" I hear you ask. Or at least I did.
      
        Here's the reason why: RS5C348_BIT_PM is 32, and is - stupidly -
        included in the RS5C348_HOURS_MASK define.  So it's really subtracting
        out that bit to get "hour+12".  But then because it does things modulo
        12, it needs to add the 12 in again afterwards anyway.
      
        This code is confused.  It would be much clearer if RS5C348_HOURS_MASK
        just didn't include the RS5C348_BIT_PM bit at all, then it wouldn't
        need to do the silly subtract either.
      
        Whatever. It's all just math, the end result is the same.   - Linus ]
      Reported-by: default avatarJames Nute <newten82@gmail.com>
      Tested-by: default avatarJames Nute <newten82@gmail.com>
      Signed-off-by: default avatarAtsushi Nemoto <anemo@mba.ocn.ne.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7dbfb315
    • Alex Shi's avatar
      mm: correct page->pfmemalloc to fix deactivate_slab regression · b121186a
      Alex Shi authored
      Commit cfd19c5a ("mm: only set page->pfmemalloc when
      ALLOC_NO_WATERMARKS was used") tried to narrow down page->pfmemalloc
      setting, but it missed some places the pfmemalloc should be set.
      
      So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS
      cause incorrect deactivate_slab() on our core2 server:
      
          64.73%           fio  [kernel.kallsyms]     [k] _raw_spin_lock
                           |
                           --- _raw_spin_lock
                              |
                              |---0.34%-- deactivate_slab
                              |          __slab_alloc
                              |          kmem_cache_alloc
                              |          |
      
      That causes our fio sync write performance to have a 40% regression.
      
      Move the checking in get_page_from_freelist() which resolves this issue.
      Signed-off-by: default avatarAlex Shi <alex.shi@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Tested-by: default avatarSage Weil <sage@inktank.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b121186a
    • Ilya Shchepetkov's avatar
      drivers/rtc/rtc-pcf2123.c: initialize dynamic sysfs attributes · 5ed12f12
      Ilya Shchepetkov authored
      Dynamically allocated sysfs attributes must be initialized using
      sysfs_attr_init(), otherwise lockdep complains: BUG: key <address> not in
      .data!
      
      Found by Linux Driver Verification project (linuxtesting.org).
      Signed-off-by: default avatarIlya Shchepetkov <shchepetkov@ispras.ru>
      Cc: Chris Verges <chrisv@cyberswitching.com>
      Cc: Christian Pellegrin <chripell@fsfe.org>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ed12f12
    • Minchan Kim's avatar
      mm/compaction.c: fix deferring compaction mistake · c81758fb
      Minchan Kim authored
      Commit aff62249 ("vmscan: only defer compaction for failed order and
      higher") fixed bad deferring policy but made mistake about checking
      compact_order_failed in __compact_pgdat().  So it can't update
      compact_order_failed with the new order.  This ends up preventing
      correct operation of policy deferral.  This patch fixes it.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c81758fb
    • Robin Holt's avatar
      drivers/misc/sgi-xp/xpc_uv.c: SGI XPC fails to load when cpu 0 is out of IRQ resources · 7838f994
      Robin Holt authored
      On many of our larger systems, CPU 0 has had all of its IRQ resources
      consumed before XPC loads.  Worst cases on machines with multiple 10
      GigE cards and multiple IB cards have depleted the entire first socket
      of IRQs.
      
      This patch makes selecting the node upon which IRQs are allocated (as
      well as all the other GRU Message Queue structures) specifiable as a
      module load param and has a default behavior of searching all nodes/cpus
      for an available resources.
      
      [akpm@linux-foundation.org: fix build: include cpu.h and module.h]
      Signed-off-by: default avatarRobin Holt <holt@sgi.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7838f994
    • WANG Cong's avatar
      string: do not export memweight() to userspace · c3a5ce04
      WANG Cong authored
      Fix the following warning:
      
        usr/include/linux/string.h:8: userspace cannot reference function or variable defined in the kernel
      Signed-off-by: default avatarWANG Cong <xiyou.wangcong@gmail.com>
      Acked-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3a5ce04
    • Zhouping Liu's avatar
      hugetlb: update hugetlbpage.txt · d46f3d86
      Zhouping Liu authored
      Commit f0f57b2b ("mm: move hugepage test examples to
      tools/testing/selftests/vm") moved map_hugetlb.c, hugepage-shm.c and
      hugepage-mmap.c tests into tools/testing/selftests/vm/ directory, but it
      didn't update hugetlbpage.txt
      Signed-off-by: default avatarZhouping Liu <sanweidaying@gmail.com>
      Acked-by: default avatarDave Young <dyoung@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d46f3d86
    • Joe Perches's avatar
      checkpatch: add control statement test to SINGLE_STATEMENT_DO_WHILE_MACRO · ac8e97f8
      Joe Perches authored
      Commit b13edf7f ("checkpatch: add checks for do {} while (0) macro
      misuses") added a test that is overly simplistic for single statement
      macros.
      
      Macros that start with control tests should be enclosed in a do {} while
      (0) loop.
      
      Add the necessary control tests to the check.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Acked-by: default avatarAndy Whitcroft <apw@canonical.com>
      Tested-by: default avatarFranz Schrober <franzschrober@yahoo.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac8e97f8
    • Michal Hocko's avatar
      mm: hugetlbfs: correctly populate shared pmd · eb48c071
      Michal Hocko authored
      Each page mapped in a process's address space must be correctly
      accounted for in _mapcount.  Normally the rules for this are
      straightforward but hugetlbfs page table sharing is different.  The page
      table pages at the PMD level are reference counted while the mapcount
      remains the same.
      
      If this accounting is wrong, it causes bugs like this one reported by
      Larry Woodman:
      
        kernel BUG at mm/filemap.c:135!
        invalid opcode: 0000 [#1] SMP
        CPU 22
        Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
        Pid: 18001, comm: mpitest Tainted: G        W    3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
        RIP: 0010:[<ffffffff8112cfed>]  [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170
        Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
        Call Trace:
          delete_from_page_cache+0x40/0x80
          truncate_hugepages+0x115/0x1f0
          hugetlbfs_evict_inode+0x18/0x30
          evict+0x9f/0x1b0
          iput_final+0xe3/0x1e0
          iput+0x3e/0x50
          d_kill+0xf8/0x110
          dput+0xe2/0x1b0
          __fput+0x162/0x240
      
      During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
      shared page tables with the check dst_pte == src_pte.  The logic is if
      the PMD page is the same, they must be shared.  This assumes that the
      sharing is between the parent and child.  However, if the sharing is
      with a different process entirely then this check fails as in this
      diagram:
      
        parent
          |
          ------------>pmd
                       src_pte----------> data page
                                              ^
        other--------->pmd--------------------|
                        ^
        child-----------|
                       dst_pte
      
      For this situation to occur, it must be possible for Parent and Other to
      have faulted and failed to share page tables with each other.  This is
      possible due to the following style of race.
      
        PROC A                                          PROC B
        copy_hugetlb_page_range                         copy_hugetlb_page_range
          src_pte == huge_pte_offset                      src_pte == huge_pte_offset
          !src_pte so no sharing                          !src_pte so no sharing
      
        (time passes)
      
        hugetlb_fault                                   hugetlb_fault
          huge_pte_alloc                                  huge_pte_alloc
            huge_pmd_share                                 huge_pmd_share
              LOCK(i_mmap_mutex)
              find nothing, no sharing
              UNLOCK(i_mmap_mutex)
                                                            LOCK(i_mmap_mutex)
                                                            find nothing, no sharing
                                                            UNLOCK(i_mmap_mutex)
            pmd_alloc                                       pmd_alloc
            LOCK(instantiation_mutex)
            fault
            UNLOCK(instantiation_mutex)
                                                        LOCK(instantiation_mutex)
                                                        fault
                                                        UNLOCK(instantiation_mutex)
      
      These two processes are not poing to the same data page but are not
      sharing page tables because the opportunity was missed.  When either
      process later forks, the src_pte == dst pte is potentially insufficient.
      As the check falls through, the wrong PTE information is copied in
      (harmless but wrong) and the mapcount is bumped for a page mapped by a
      shared page table leading to the BUG_ON.
      
      This patch addresses the issue by moving pmd_alloc into huge_pmd_share
      which guarantees that the shared pud is populated in the same critical
      section as pmd.  This also means that huge_pte_offset test in
      huge_pmd_share is serialized correctly now which in turn means that the
      success of the sharing will be higher as the racing tasks see the pud
      and pmd populated together.
      
      Race identified and changelog written mostly by Mel Gorman.
      
      {akpm@linux-foundation.org: attempt to make the huge_pmd_share() comment comprehensible, clean up coding style]
      Reported-by: default avatarLarry Woodman <lwoodman@redhat.com>
      Tested-by: default avatarLarry Woodman <lwoodman@redhat.com>
      Reviewed-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb48c071
    • Stephen M. Cameron's avatar
      cciss: fix incorrect scsi status reporting · b0cf0b11
      Stephen M. Cameron authored
      Delete code which sets SCSI status incorrectly as it's already been set
      correctly above this incorrect code.  The bug was introduced in 2009 by
      commit b0e15f6d ("cciss: fix typo that causes scsi status to be
      lost.")
      Signed-off-by: default avatarStephen M. Cameron <scameron@beardog.cce.hp.com>
      Reported-by: default avatarRoel van Meer <roel.vanmeer@bokxing.nl>
      Tested-by: default avatarRoel van Meer <roel.vanmeer@bokxing.nl>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b0cf0b11