1. 22 Mar, 2014 10 commits
    • Emmanuel Grumbach's avatar
      mac80211: fix AP powersave TX vs. wakeup race · 58d43105
      Emmanuel Grumbach authored
      commit 1d147bfa upstream.
      
      There is a race between the TX path and the STA wakeup: while
      a station is sleeping, mac80211 buffers frames until it wakes
      up, then the frames are transmitted. However, the RX and TX
      path are concurrent, so the packet indicating wakeup can be
      processed while a packet is being transmitted.
      
      This can lead to a situation where the buffered frames list
      is emptied on the one side, while a frame is being added on
      the other side, as the station is still seen as sleeping in
      the TX path.
      
      As a result, the newly added frame will not be send anytime
      soon. It might be sent much later (and out of order) when the
      station goes to sleep and wakes up the next time.
      
      Additionally, it can lead to the crash below.
      
      Fix all this by synchronising both paths with a new lock.
      Both path are not fastpath since they handle PS situations.
      
      In a later patch we'll remove the extra skb queue locks to
      reduce locking overhead.
      
      BUG: unable to handle kernel
      NULL pointer dereference at 000000b0
      IP: [<ff6f1791>] ieee80211_report_used_skb+0x11/0x3e0 [mac80211]
      *pde = 00000000
      Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      EIP: 0060:[<ff6f1791>] EFLAGS: 00210282 CPU: 1
      EIP is at ieee80211_report_used_skb+0x11/0x3e0 [mac80211]
      EAX: e5900da0 EBX: 00000000 ECX: 00000001 EDX: 00000000
      ESI: e41d00c0 EDI: e5900da0 EBP: ebe458e4 ESP: ebe458b0
       DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
      CR0: 8005003b CR2: 000000b0 CR3: 25a78000 CR4: 000407d0
      DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
      DR6: ffff0ff0 DR7: 00000400
      Process iperf (pid: 3934, ti=ebe44000 task=e757c0b0 task.ti=ebe44000)
      iwlwifi 0000:02:00.0: I iwl_pcie_enqueue_hcmd Sending command LQ_CMD (#4e), seq: 0x0903, 92 bytes at 3[3]:9
      Stack:
       e403b32c ebe458c4 00200002 00200286 e403b338 ebe458cc c10960bb e5900da0
       ff76a6ec ebe458d8 00000000 e41d00c0 e5900da0 ebe458f0 ff6f1b75 e403b210
       ebe4598c ff723dc1 00000000 ff76a6ec e597c978 e403b758 00000002 00000002
      Call Trace:
       [<ff6f1b75>] ieee80211_free_txskb+0x15/0x20 [mac80211]
       [<ff723dc1>] invoke_tx_handlers+0x1661/0x1780 [mac80211]
       [<ff7248a5>] ieee80211_tx+0x75/0x100 [mac80211]
       [<ff7249bf>] ieee80211_xmit+0x8f/0xc0 [mac80211]
       [<ff72550e>] ieee80211_subif_start_xmit+0x4fe/0xe20 [mac80211]
       [<c149ef70>] dev_hard_start_xmit+0x450/0x950
       [<c14b9aa9>] sch_direct_xmit+0xa9/0x250
       [<c14b9c9b>] __qdisc_run+0x4b/0x150
       [<c149f732>] dev_queue_xmit+0x2c2/0xca0
      Reported-by: default avatarYaara Rozenblum <yaara.rozenblum@intel.com>
      Signed-off-by: default avatarEmmanuel Grumbach <emmanuel.grumbach@intel.com>
      Reviewed-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      [reword commit log, use a separate lock]
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      58d43105
    • Felix Fietkau's avatar
      mac80211: send control port protocol frames to the VO queue · b6f6dfa1
      Felix Fietkau authored
      commit 1bf4bbb4 upstream.
      
      Improves reliability of wifi connections with WPA, since authentication
      frames are prioritized over normal traffic and also typically exempt
      from aggregation.
      Signed-off-by: default avatarFelix Fietkau <nbd@openwrt.org>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b6f6dfa1
    • Alexandre Bounine's avatar
      rapidio/tsi721: fix tasklet termination in dma channel release · 97992a40
      Alexandre Bounine authored
      commit 04379dff upstream.
      
      This patch is a modification of the patch originally proposed by
      Xiaotian Feng <xtfeng@gmail.com>: https://lkml.org/lkml/2012/11/5/413
      This new version disables DMA channel interrupts and ensures that the
      tasklet wil not be scheduled again before calling tasklet_kill().
      
      Unfortunately the updated patch was not released at that time due to
      planned rework of Tsi721 mport driver to use threaded interrupts (which
      has yet to happen).  Recently the issue was reported again:
      https://lkml.org/lkml/2014/2/19/762.
      
      Description from the original Xiaotian's patch:
      
       "Some drivers use tasklet_disable in device remove/release process,
        tasklet_disable will inc tasklet->count and return.  If the tasklet is
        not handled yet under some softirq pressure, the tasklet will be
        placed on the tasklet_vec, never have a chance to be excuted.  This
        might lead to a heavy loaded ksoftirqd, wakeup with pending_softirq,
        but tasklet is disabled.  tasklet_kill should be used in this case."
      
      This patch is applicable to kernel versions starting from v3.5.
      Signed-off-by: default avatarAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Xiaotian Feng <xtfeng@gmail.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Mike Galbraith <bitbucket@online.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      97992a40
    • George McCollister's avatar
      sched: Fix double normalization of vruntime · 560877ed
      George McCollister authored
      commit 791c9e02 upstream.
      
      dequeue_entity() is called when p->on_rq and sets se->on_rq = 0
      which appears to guarentee that the !se->on_rq condition is met.
      If the task has done set_current_state(TASK_INTERRUPTIBLE) without
      schedule() the second condition will be met and vruntime will be
      incorrectly adjusted twice.
      
      In certain cases this can result in the task's vruntime never increasing
      past the vruntime of other tasks on the CFS' run queue, starving them of
      CPU time.
      
      This patch changes switched_from_fair() to use !p->on_rq instead of
      !se->on_rq.
      
      I'm able to cause a task with a priority of 120 to starve all other
      tasks with the same priority on an ARM platform running 3.2.51-rt72
      PREEMPT RT by writing one character at time to a serial tty (16550 UART)
      in a tight loop. I'm also able to verify making this change corrects the
      problem on that platform and kernel version.
      Signed-off-by: default avatarGeorge McCollister <george.mccollister@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1392767811-28916-1-git-send-email-george.mccollister@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      560877ed
    • Hugh Dickins's avatar
      memcg: fix endless loop in __mem_cgroup_iter_next() · 9d865d44
      Hugh Dickins authored
      commit ce48225f upstream.
      
      Commit 0eef6156 ("memcg: fix css reference leak and endless loop in
      mem_cgroup_iter") got the interaction with the commit a few before it
      d8ad3055 ("mm/memcg: iteration skip memcgs not yet fully
      initialized") slightly wrong, and we didn't notice at the time.
      
      It's elusive, and harder to get than the original, but for a couple of
      days before rc1, I several times saw a endless loop similar to that
      supposedly being fixed.
      
      This time it was a tighter loop in __mem_cgroup_iter_next(): because we
      can get here when our root has already been offlined, and the ordering
      of conditions was such that we then just cycled around forever.
      
      Fixes: 0eef6156 ("memcg: fix css reference leak and endless loop in mem_cgroup_iter").
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9d865d44
    • Al Viro's avatar
      ocfs2 syncs the wrong range... · 313bb242
      Al Viro authored
      commit 1b56e989 upstream.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      313bb242
    • Jan Kara's avatar
      ocfs2: fix quota file corruption · 5a131c1d
      Jan Kara authored
      commit 15c34a76 upstream.
      
      Global quota files are accessed from different nodes.  Thus we cannot
      cache offset of quota structure in the quota file after we drop our node
      reference count to it because after that moment quota structure may be
      freed and reallocated elsewhere by a different node resulting in
      corruption of quota file.
      
      Fix the problem by clearing dq_off when we are releasing dquot structure.
      We also remove the DB_READ_B handling because it is useless -
      DQ_ACTIVE_B is set iff DQ_READ_B is set.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5a131c1d
    • Vlastimil Babka's avatar
      mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid m(un)locking · ae865ad6
      Vlastimil Babka authored
      commit 9050d7eb upstream.
      
      Daniel Borkmann reported a VM_BUG_ON assertion failing:
      
        ------------[ cut here ]------------
        kernel BUG at mm/mlock.c:528!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: ccm arc4 iwldvm [...]
         video
        CPU: 3 PID: 2266 Comm: netsniff-ng Not tainted 3.14.0-rc2+ #8
        Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
        task: ffff8801f87f9820 ti: ffff88002cb44000 task.ti: ffff88002cb44000
        RIP: 0010:[<ffffffff81171ad0>]  [<ffffffff81171ad0>] munlock_vma_pages_range+0x2e0/0x2f0
        Call Trace:
          do_munmap+0x18f/0x3b0
          vm_munmap+0x41/0x60
          SyS_munmap+0x22/0x30
          system_call_fastpath+0x1a/0x1f
        RIP   munlock_vma_pages_range+0x2e0/0x2f0
        ---[ end trace a0088dcf07ae10f2 ]---
      
      because munlock_vma_pages_range() thinks it's unexpectedly in the middle
      of a THP page.  This can be reproduced with default config since 3.11
      kernels.  A reproducer can be found in the kernel's selftest directory
      for networking by running ./psock_tpacket.
      
      The problem is that an order=2 compound page (allocated by
      alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped
      by packet_mmap()) and mistaken for a THP page and assumed to be order=9.
      
      The checks for THP in munlock came with commit ff6a6da6 ("mm:
      accelerate munlock() treatment of THP pages"), i.e.  since 3.9, but did
      not trigger a bug.  It just makes munlock_vma_pages_range() skip such
      compound pages until the next 512-pages-aligned page, when it encounters
      a head page.  This is however not a problem for vma's where mlocking has
      no effect anyway, but it can distort the accounting.
      
      Since commit 7225522b ("mm: munlock: batch non-THP page isolation
      and munlock+putback using pagevec") this can trigger a VM_BUG_ON in
      PageTransHuge() check.
      
      This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL, a
      list of flags that make vma's non-mlockable and non-mergeable.  The
      reasoning is that VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is
      already on the VM_SPECIAL list, and both are intended for non-LRU pages
      where mlocking makes no sense anyway.  Related Lkml discussion can be
      found in [2].
      
       [1] tools/testing/selftests/net/psock_tpacket
       [2] https://lkml.org/lkml/2014/1/10/427Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Reported-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Tested-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: John David Anglin <dave.anglin@bell.net>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Tested-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      ae865ad6
    • Johannes Weiner's avatar
      mm: page_alloc: exempt GFP_THISNODE allocations from zone fairness · edbe8157
      Johannes Weiner authored
      commit 27329369 upstream.
      
      Jan Stancek reports manual page migration encountering allocation
      failures after some pages when there is still plenty of memory free, and
      bisected the problem down to commit 81c0a2bb ("mm: page_alloc: fair
      zone allocator policy").
      
      The problem is that GFP_THISNODE obeys the zone fairness allocation
      batches on one hand, but doesn't reset them and wake kswapd on the other
      hand.  After a few of those allocations, the batches are exhausted and
      the allocations fail.
      
      Fixing this means either having GFP_THISNODE wake up kswapd, or
      GFP_THISNODE not participating in zone fairness at all.  The latter
      seems safer as an acute bugfix, we can clean up later.
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      edbe8157
    • Minchan Kim's avatar
      zram: avoid null access when fail to alloc meta · faf65d4d
      Minchan Kim authored
      commit db5d711e upstream.
      
      zram_meta_alloc could fail so caller should check it.  Otherwise, your
      system will hang.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      faf65d4d
  2. 21 Mar, 2014 2 commits
    • Rob Clark's avatar
      drm/ttm: don't oops if no invalidate_caches() · e00feda0
      Rob Clark authored
      commit 9ef7506f upstream.
      
      A few of the simpler TTM drivers (cirrus, ast, mgag200) do not implement
      this function.  Yet can end up somehow with an evicted bo:
      
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: [<          (null)>]           (null)
        PGD 16e761067 PUD 16e6cf067 PMD 0
        Oops: 0010 [#1] SMP
        Modules linked in: bnep bluetooth rfkill fuse ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables sg btrfs zlib_deflate raid6_pq xor dm_queue_length iTCO_wdt iTCO_vendor_support coretemp kvm dcdbas dm_service_time microcode serio_raw pcspkr lpc_ich mfd_core i7core_edac edac_core ses enclosure ipmi_si ipmi_msghandler shpchp acpi_power_meter mperf nfsd auth_rpcgss nfs_acl lockd uinput sunrpc dm_multipath xfs libcrc32c ata_generic pata_acpi sr_mod cdrom
         sd_mod usb_storage mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit lpfc drm_kms_helper ttm crc32c_intel ata_piix bfa drm ixgbe libata i2c_core mdio crc_t10dif ptp crct10dif_common pps_core scsi_transport_fc dca scsi_tgt megaraid_sas bnx2 dm_mirror dm_region_hash dm_log dm_mod
        CPU: 16 PID: 2572 Comm: X Not tainted 3.10.0-86.el7.x86_64 #1
        Hardware name: Dell Inc. PowerEdge R810/0H235N, BIOS 0.3.0 11/14/2009
        task: ffff8801799dabc0 ti: ffff88016c884000 task.ti: ffff88016c884000
        RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
        RSP: 0018:ffff88016c885ad8  EFLAGS: 00010202
        RAX: ffffffffa04e94c0 RBX: ffff880178937a20 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000240004 RDI: ffff880178937a00
        RBP: ffff88016c885b60 R08: 00000000000171a0 R09: ffff88007cf171a0
        R10: ffffea0005842540 R11: ffffffff810487b9 R12: ffff880178937b30
        R13: ffff880178937a00 R14: ffff88016c885b78 R15: ffff880179929400
        FS:  00007f81ba2ef980(0000) GS:ffff88007cf00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000016e763000 CR4: 00000000000007e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Stack:
         ffffffffa0306fae ffff8801799295c0 0000000000260004 0000000000000001
         ffff88016c885b60 ffffffffa0307669 00ff88007cf17738 ffff88017cf17700
         ffff880178937a00 ffff880100000000 ffff880100000000 0000000079929400
        Call Trace:
         [<ffffffffa0306fae>] ? ttm_bo_handle_move_mem+0x54e/0x5b0 [ttm]
         [<ffffffffa0307669>] ? ttm_bo_mem_space+0x169/0x340 [ttm]
         [<ffffffffa0307bd7>] ttm_bo_move_buffer+0x117/0x130 [ttm]
         [<ffffffff81130001>] ? perf_event_init_context+0x141/0x220
         [<ffffffffa0307cb1>] ttm_bo_validate+0xc1/0x130 [ttm]
         [<ffffffffa04e7377>] mgag200_bo_pin+0x87/0xc0 [mgag200]
         [<ffffffffa04e56c4>] mga_crtc_cursor_set+0x474/0xbb0 [mgag200]
         [<ffffffff811971d2>] ? __mem_cgroup_commit_charge+0x152/0x3b0
         [<ffffffff815c4182>] ? mutex_lock+0x12/0x2f
         [<ffffffffa0201433>] drm_mode_cursor_common+0x123/0x170 [drm]
         [<ffffffffa0205231>] drm_mode_cursor_ioctl+0x41/0x50 [drm]
         [<ffffffffa01f5ca2>] drm_ioctl+0x502/0x630 [drm]
         [<ffffffff815cbab4>] ? __do_page_fault+0x1f4/0x510
         [<ffffffff8101cb68>] ? __restore_xstate_sig+0x218/0x4f0
         [<ffffffff811b4445>] do_vfs_ioctl+0x2e5/0x4d0
         [<ffffffff8124488e>] ? file_has_perm+0x8e/0xa0
         [<ffffffff811b46b1>] SyS_ioctl+0x81/0xa0
         [<ffffffff815d05d9>] system_call_fastpath+0x16/0x1b
        Code:  Bad RIP value.
        RIP  [<          (null)>]           (null)
         RSP <ffff88016c885ad8>
        CR2: 0000000000000000
      Signed-off-by: default avatarRob Clark <rclark@redhat.com>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarThomas Hellstrom <thellstrom@vmware.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e00feda0
    • Filipe Brandenburger's avatar
      memcg: reparent charges of children before processing parent · fda958d5
      Filipe Brandenburger authored
      commit 4fb1a86f upstream.
      
      Sometimes the cleanup after memcg hierarchy testing gets stuck in
      mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
      
      There may turn out to be several causes, but a major cause is this: the
      workitem to offline parent can get run before workitem to offline child;
      parent's mem_cgroup_reparent_charges() circles around waiting for the
      child's pages to be reparented to its lrus, but it's holding
      cgroup_mutex which prevents the child from reaching its
      mem_cgroup_reparent_charges().
      
      Further testing showed that an ordered workqueue for cgroup_destroy_wq
      is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
      stage on the way can mess up the order before reaching the workqueue.
      
      Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on
      all its children (and grandchildren, in the correct order) to have their
      charges reparented first.
      
      Fixes: e5fca243 ("cgroup: use a dedicated workqueue for cgroup destruction")
      Signed-off-by: default avatarFilipe Brandenburger <filbranden@google.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[v3.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      fda958d5
  3. 14 Mar, 2014 4 commits
  4. 13 Mar, 2014 11 commits
    • Daniel Borkmann's avatar
      net: sctp: fix sctp_sf_do_5_1D_ce to verify if we/peer is AUTH capable · 00c53b02
      Daniel Borkmann authored
      [ Upstream commit ec0223ec ]
      
      RFC4895 introduced AUTH chunks for SCTP; during the SCTP
      handshake RANDOM; CHUNKS; HMAC-ALGO are negotiated (CHUNKS
      being optional though):
      
        ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
        <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
        -------------------- COOKIE-ECHO -------------------->
        <-------------------- COOKIE-ACK ---------------------
      
      A special case is when an endpoint requires COOKIE-ECHO
      chunks to be authenticated:
      
        ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
        <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
        ------------------ AUTH; COOKIE-ECHO ---------------->
        <-------------------- COOKIE-ACK ---------------------
      
      RFC4895, section 6.3. Receiving Authenticated Chunks says:
      
        The receiver MUST use the HMAC algorithm indicated in
        the HMAC Identifier field. If this algorithm was not
        specified by the receiver in the HMAC-ALGO parameter in
        the INIT or INIT-ACK chunk during association setup, the
        AUTH chunk and all the chunks after it MUST be discarded
        and an ERROR chunk SHOULD be sent with the error cause
        defined in Section 4.1. [...] If no endpoint pair shared
        key has been configured for that Shared Key Identifier,
        all authenticated chunks MUST be silently discarded. [...]
      
        When an endpoint requires COOKIE-ECHO chunks to be
        authenticated, some special procedures have to be followed
        because the reception of a COOKIE-ECHO chunk might result
        in the creation of an SCTP association. If a packet arrives
        containing an AUTH chunk as a first chunk, a COOKIE-ECHO
        chunk as the second chunk, and possibly more chunks after
        them, and the receiver does not have an STCB for that
        packet, then authentication is based on the contents of
        the COOKIE-ECHO chunk. In this situation, the receiver MUST
        authenticate the chunks in the packet by using the RANDOM
        parameters, CHUNKS parameters and HMAC_ALGO parameters
        obtained from the COOKIE-ECHO chunk, and possibly a local
        shared secret as inputs to the authentication procedure
        specified in Section 6.3. If authentication fails, then
        the packet is discarded. If the authentication is successful,
        the COOKIE-ECHO and all the chunks after the COOKIE-ECHO
        MUST be processed. If the receiver has an STCB, it MUST
        process the AUTH chunk as described above using the STCB
        from the existing association to authenticate the
        COOKIE-ECHO chunk and all the chunks after it. [...]
      
      Commit bbd0d598 introduced the possibility to receive
      and verification of AUTH chunk, including the edge case for
      authenticated COOKIE-ECHO. On reception of COOKIE-ECHO,
      the function sctp_sf_do_5_1D_ce() handles processing,
      unpacks and creates a new association if it passed sanity
      checks and also tests for authentication chunks being
      present. After a new association has been processed, it
      invokes sctp_process_init() on the new association and
      walks through the parameter list it received from the INIT
      chunk. It checks SCTP_PARAM_RANDOM, SCTP_PARAM_HMAC_ALGO
      and SCTP_PARAM_CHUNKS, and copies them into asoc->peer
      meta data (peer_random, peer_hmacs, peer_chunks) in case
      sysctl -w net.sctp.auth_enable=1 is set. If in INIT's
      SCTP_PARAM_SUPPORTED_EXT parameter SCTP_CID_AUTH is set,
      peer_random != NULL and peer_hmacs != NULL the peer is to be
      assumed asoc->peer.auth_capable=1, in any other case
      asoc->peer.auth_capable=0.
      
      Now, if in sctp_sf_do_5_1D_ce() chunk->auth_chunk is
      available, we set up a fake auth chunk and pass that on to
      sctp_sf_authenticate(), which at latest in
      sctp_auth_calculate_hmac() reliably dereferences a NULL pointer
      at position 0..0008 when setting up the crypto key in
      crypto_hash_setkey() by using asoc->asoc_shared_key that is
      NULL as condition key_id == asoc->active_key_id is true if
      the AUTH chunk was injected correctly from remote. This
      happens no matter what net.sctp.auth_enable sysctl says.
      
      The fix is to check for net->sctp.auth_enable and for
      asoc->peer.auth_capable before doing any operations like
      sctp_sf_authenticate() as no key is activated in
      sctp_auth_asoc_init_active_key() for each case.
      
      Now as RFC4895 section 6.3 states that if the used HMAC-ALGO
      passed from the INIT chunk was not used in the AUTH chunk, we
      SHOULD send an error; however in this case it would be better
      to just silently discard such a maliciously prepared handshake
      as we didn't even receive a parameter at all. Also, as our
      endpoint has no shared key configured, section 6.3 says that
      MUST silently discard, which we are doing from now onwards.
      
      Before calling sctp_sf_pdiscard(), we need not only to free
      the association, but also the chunk->auth_chunk skb, as
      commit bbd0d598 created a skb clone in that case.
      
      I have tested this locally by using netfilter's nfqueue and
      re-injecting packets into the local stack after maliciously
      modifying the INIT chunk (removing RANDOM; HMAC-ALGO param)
      and the SCTP packet containing the COOKIE_ECHO (injecting
      AUTH chunk before COOKIE_ECHO). Fixed with this patch applied.
      
      Fixes: bbd0d598 ("[SCTP]: Implement the receive and verification of AUTH chunk")
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Cc: Vlad Yasevich <yasevich@gmail.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      00c53b02
    • Xin Long's avatar
      ip_tunnel:multicast process cause panic due to skb->_skb_refdst NULL pointer · 2a01e9e4
      Xin Long authored
      [ Upstream commit 10ddceb2 ]
      
      when ip_tunnel process multicast packets, it may check if the packet is looped
      back packet though 'rt_is_output_route(skb_rtable(skb))' in ip_tunnel_rcv(),
      but before that , skb->_skb_refdst has been dropped in iptunnel_pull_header(),
      so which leads to a panic.
      
      fix the bug: https://bugzilla.kernel.org/show_bug.cgi?id=70681Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2a01e9e4
    • Michael Chan's avatar
      tg3: Don't check undefined error bits in RXBD · baef79a5
      Michael Chan authored
      [ Upstream commit d7b95315 ]
      
      Redefine the RXD_ERR_MASK to include only relevant error bits. This fixes
      a customer reported issue of randomly dropping packets on the 5719.
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      baef79a5
    • Hans Schillstrom's avatar
      ipv6: ipv6_find_hdr restore prev functionality · 537fc749
      Hans Schillstrom authored
      [ Upstream commit accfe0e3 ]
      
      The commit 9195bb8e ("ipv6: improve
      ipv6_find_hdr() to skip empty routing headers") broke ipv6_find_hdr().
      
      When a target is specified like IPPROTO_ICMPV6 ipv6_find_hdr()
      returns -ENOENT when it's found, not the header as expected.
      
      A part of IPVS is broken and possible also nft_exthdr_eval().
      When target is -1 which it is most cases, it works.
      
      This patch exits the do while loop if the specific header is found
      so the nexthdr could be returned as expected.
      Reported-by: default avatarArt -kwaak- van Breemen <ard@telegraafnet.nl>
      Signed-off-by: default avatarHans Schillstrom <hans@schillstrom.com>
      CC:Ansis Atteka <aatteka@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      537fc749
    • Edward Cree's avatar
      sfc: check for NULL efx->ptp_data in efx_ptp_event · 1c026764
      Edward Cree authored
      [ Upstream commit 8f355e5c ]
      
      If we receive a PTP event from the NIC when we haven't set up PTP state
      in the driver, we attempt to read through a NULL pointer efx->ptp_data,
      triggering a panic.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Acked-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      1c026764
    • Hannes Frederic Sowa's avatar
      ipv6: reuse ip6_frag_id from ip6_ufo_append_data · 3bbb02a1
      Hannes Frederic Sowa authored
      [ Upstream commit 916e4cf4 ]
      
      Currently we generate a new fragmentation id on UFO segmentation. It
      is pretty hairy to identify the correct net namespace and dst there.
      Especially tunnels use IFF_XMIT_DST_RELEASE and thus have no skb_dst
      available at all.
      
      This causes unreliable or very predictable ipv6 fragmentation id
      generation while segmentation.
      
      Luckily we already have pregenerated the ip6_frag_id in
      ip6_ufo_append_data and can use it here.
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      3bbb02a1
    • Jason Wang's avatar
      virtio-net: alloc big buffers also when guest can receive UFO · 251ed2ca
      Jason Wang authored
      [ Upstream commit 0e7ede80 ]
      
      We should alloc big buffers also when guest can receive UFO
      packets to let the big packets fit into guest rx buffer.
      
      Fixes 5c516751
      (virtio-net: Allow UFO feature to be set and advertised.)
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Sridhar Samudrala <sri@us.ibm.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      251ed2ca
    • Duan Jiong's avatar
      neigh: recompute reachabletime before returning from neigh_periodic_work() · df3839f9
      Duan Jiong authored
      [ Upstream commit feff9ab2 ]
      
      If the neigh table's entries is less than gc_thresh1, the function
      will return directly, and the reachabletime will not be recompute,
      so the reachabletime can be guessed.
      Signed-off-by: default avatarDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      df3839f9
    • Eric Dumazet's avatar
      net-tcp: fastopen: fix high order allocations · 10bbbd58
      Eric Dumazet authored
      [ Upstream commit f5ddcbbb ]
      
      This patch fixes two bugs in fastopen :
      
      1) The tcp_sendmsg(...,  @size) argument was ignored.
      
         Code was relying on user not fooling the kernel with iovec mismatches
      
      2) When MTU is about 64KB, tcp_send_syn_data() attempts order-5
      allocations, which are likely to fail when memory gets fragmented.
      
      Fixes: 783237e8 ("net-tcp: Fast Open client - sending SYN-data")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Tested-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      10bbbd58
    • Fernando Luis Vazquez Cao's avatar
      tun: remove bogus hardware vlan acceleration flags from vlan_features · cbf5a3f6
      Fernando Luis Vazquez Cao authored
      [ Upstream commit 6671b224 ]
      
      Even though only the outer vlan tag can be HW accelerated in the transmission
      path, in the TUN/TAP driver vlan_features mirrors hw_features, which happens
      to have the NETIF_F_HW_VLAN_?TAG_TX flags set. Because of this, during packet
      tranmisssion through a stacked vlan device dev_hard_start_xmit, (incorrectly)
      assuming that the vlan device supports hardware vlan acceleration, does not
      add the vlan header to the skb payload and the inner vlan tags are lost
      (vlan_tci contains the outer vlan tag when userspace reads the packet from
      the tap device).
      Signed-off-by: default avatarFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      cbf5a3f6
    • Toshiaki Makita's avatar
      veth: Fix vlan_features so as to be able to use stacked vlan interfaces · cc9f71b2
      Toshiaki Makita authored
      [ Upstream commit 8d0d21f4 ]
      
      Even if we create a stacked vlan interface such as veth0.10.20, it sends
      single tagged frames (tagged with only vid 10).
      Because vlan_features of a veth interface has the
      NETIF_F_HW_VLAN_[CTAG/STAG]_TX bits, veth0.10 also has that feature, so
      dev_hard_start_xmit(veth0.10) doesn't call __vlan_put_tag() and
      vlan_dev_hard_start_xmit(veth0.10) overwrites vlan_tci.
      This prevents us from using a combination of 802.1ad and 802.1Q
      in containers, etc.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarFlavio Leitner <fbl@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      cc9f71b2
  5. 12 Mar, 2014 13 commits
    • Waiman Long's avatar
      SELinux: Increase ebitmap_node size for 64-bit configuration · 17d633ca
      Waiman Long authored
      commit a767f680 upstream.
      
      Currently, the ebitmap_node structure has a fixed size of 32 bytes. On
      a 32-bit system, the overhead is 8 bytes, leaving 24 bytes for being
      used as bitmaps. The overhead ratio is 1/4.
      
      On a 64-bit system, the overhead is 16 bytes. Therefore, only 16 bytes
      are left for bitmap purpose and the overhead ratio is 1/2. With a
      3.8.2 kernel, a boot-up operation will cause the ebitmap_get_bit()
      function to be called about 9 million times. The average number of
      ebitmap_node traversal is about 3.7.
      
      This patch increases the size of the ebitmap_node structure to 64
      bytes for 64-bit system to keep the overhead ratio at 1/4. This may
      also improve performance a little bit by making node to node traversal
      less frequent (< 2) as more bits are available in each node.
      Signed-off-by: default avatarWaiman Long <Waiman.Long@hp.com>
      Acked-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: default avatarPaul Moore <pmoore@redhat.com>
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      17d633ca
    • Waiman Long's avatar
      SELinux: Reduce overhead of mls_level_isvalid() function call · 3b32517b
      Waiman Long authored
      commit fee71142 upstream.
      
      While running the high_systime workload of the AIM7 benchmark on
      a 2-socket 12-core Westmere x86-64 machine running 3.10-rc4 kernel
      (with HT on), it was found that a pretty sizable amount of time was
      spent in the SELinux code. Below was the perf trace of the "perf
      record -a -s" of a test run at 1500 users:
      
        5.04%            ls  [kernel.kallsyms]     [k] ebitmap_get_bit
        1.96%            ls  [kernel.kallsyms]     [k] mls_level_isvalid
        1.95%            ls  [kernel.kallsyms]     [k] find_next_bit
      
      The ebitmap_get_bit() was the hottest function in the perf-report
      output.  Both the ebitmap_get_bit() and find_next_bit() functions
      were, in fact, called by mls_level_isvalid(). As a result, the
      mls_level_isvalid() call consumed 8.95% of the total CPU time of
      all the 24 virtual CPUs which is quite a lot. The majority of the
      mls_level_isvalid() function invocations come from the socket creation
      system call.
      
      Looking at the mls_level_isvalid() function, it is checking to see
      if all the bits set in one of the ebitmap structure are also set in
      another one as well as the highest set bit is no bigger than the one
      specified by the given policydb data structure. It is doing it in
      a bit-by-bit manner. So if the ebitmap structure has many bits set,
      the iteration loop will be done many times.
      
      The current code can be rewritten to use a similar algorithm as the
      ebitmap_contains() function with an additional check for the
      highest set bit. The ebitmap_contains() function was extended to
      cover an optional additional check for the highest set bit, and the
      mls_level_isvalid() function was modified to call ebitmap_contains().
      
      With that change, the perf trace showed that the used CPU time drop
      down to just 0.08% (ebitmap_contains + mls_level_isvalid) of the
      total which is about 100X less than before.
      
        0.07%            ls  [kernel.kallsyms]     [k] ebitmap_contains
        0.05%            ls  [kernel.kallsyms]     [k] ebitmap_get_bit
        0.01%            ls  [kernel.kallsyms]     [k] mls_level_isvalid
        0.01%            ls  [kernel.kallsyms]     [k] find_next_bit
      
      The remaining ebitmap_get_bit() and find_next_bit() functions calls
      are made by other kernel routines as the new mls_level_isvalid()
      function will not call them anymore.
      
      This patch also improves the high_systime AIM7 benchmark result,
      though the improvement is not as impressive as is suggested by the
      reduction in CPU time spent in the ebitmap functions. The table below
      shows the performance change on the 2-socket x86-64 system (with HT
      on) mentioned above.
      
      +--------------+---------------+----------------+-----------------+
      |   Workload   | mean % change | mean % change  | mean % change   |
      |              | 10-100 users  | 200-1000 users | 1100-2000 users |
      +--------------+---------------+----------------+-----------------+
      | high_systime |     +0.1%     |     +0.9%      |     +2.6%       |
      +--------------+---------------+----------------+-----------------+
      Signed-off-by: default avatarWaiman Long <Waiman.Long@hp.com>
      Acked-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: default avatarPaul Moore <pmoore@redhat.com>
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      3b32517b
    • Mel Gorman's avatar
      mm: do not walk all of system memory during show_mem · 67e204a0
      Mel Gorman authored
      commit c78e9363 upstream.
      
      It has been reported on very large machines that show_mem is taking almost
      5 minutes to display information.  This is a serious problem if there is
      an OOM storm.  The bulk of the cost is in show_mem doing a very expensive
      PFN walk to give us the following information
      
        Total RAM:       Also available as totalram_pages
        Highmem pages:   Also available as totalhigh_pages
        Reserved pages:  Can be inferred from the zone structure
        Shared pages:    PFN walk required
        Unshared pages:  PFN walk required
        Quick pages:     Per-cpu walk required
      
      Only the shared/unshared pages requires a full PFN walk but that
      information is useless.  It is also inaccurate as page pins of unshared
      pages would be accounted for as shared.  Even if the information was
      accurate, I'm struggling to think how the shared/unshared information
      could be useful for debugging OOM conditions.  Maybe it was useful before
      rmap existed when reclaiming shared pages was costly but it is less
      relevant today.
      
      The PFN walk could be optimised a bit but why bother as the information is
      useless.  This patch deletes the PFN walker and infers the total RAM,
      highmem and reserved pages count from struct zone.  It omits the
      shared/unshared page usage on the grounds that it is useless.  It also
      corrects the reporting of HighMem as HighMem/MovableOnly as ZONE_MOVABLE
      has similar problems to HighMem with respect to lowmem/highmem exhaustion.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      67e204a0
    • Jason Baron's avatar
      epoll: do not take the nested ep->mtx on EPOLL_CTL_DEL · 230a558c
      Jason Baron authored
      commit 4ff36ee9 upstream.
      
      The EPOLL_CTL_DEL path of epoll contains a classic, ab-ba deadlock.
      That is, epoll_ctl(a, EPOLL_CTL_DEL, b, x), will deadlock with
      epoll_ctl(b, EPOLL_CTL_DEL, a, x).  The deadlock was introduced with
      commmit 67347fe4 ("epoll: do not take global 'epmutex' for simple
      topologies").
      
      The acquistion of the ep->mtx for the destination 'ep' was added such
      that a concurrent EPOLL_CTL_ADD operation would see the correct state of
      the ep (Specifically, the check for '!list_empty(&f.file->f_ep_links')
      
      However, by simply not acquiring the lock, we do not serialize behind
      the ep->mtx from the add path, and thus may perform a full path check
      when if we had waited a little longer it may not have been necessary.
      However, this is a transient state, and performing the full loop
      checking in this case is not harmful.
      
      The important point is that we wouldn't miss doing the full loop
      checking when required, since EPOLL_CTL_ADD always locks any 'ep's that
      its operating upon.  The reason we don't need to do lock ordering in the
      add path, is that we are already are holding the global 'epmutex'
      whenever we do the double lock.  Further, the original posting of this
      patch, which was tested for the intended performance gains, did not
      perform this additional locking.
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Cc: Nathan Zimmer <nzimmer@sgi.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Nelson Elhage <nelhage@nelhage.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      230a558c
    • Jason Baron's avatar
      epoll: do not take global 'epmutex' for simple topologies · 107c1943
      Jason Baron authored
      commit 67347fe4 upstream.
      
      When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached
      directly to a wakeup source, we do not need to take the global 'epmutex',
      unless the epoll file descriptor is nested.  The purpose of taking the
      'epmutex' on add is to prevent complex topologies such as loops and deep
      wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD
      operations.  However, for the simple case of an epoll file descriptor
      attached directly to a wakeup source (with no nesting), we do not need to
      hold the 'epmutex'.
      
      This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves
      scalability on larger systems.  Quoting Nathan Zimmer's mail on SPECjbb
      performance:
      
      "On the 16 socket run the performance went from 35k jOPS to 125k jOPS.  In
      addition the benchmark when from scaling well on 10 sockets to scaling
      well on just over 40 sockets.
      
      ...
      
      Currently the benchmark stops scaling at around 40-44 sockets but it seems like
      I found a second unrelated bottleneck."
      
      [akpm@linux-foundation.org: use `bool' for boolean variables, remove unneeded/undesirable cast of void*, add missed ep_scan_ready_list() kerneldoc]
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Tested-by: default avatarNathan Zimmer <nzimmer@sgi.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Nelson Elhage <nelhage@nelhage.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      107c1943
    • Jason Baron's avatar
      epoll: optimize EPOLL_CTL_DEL using rcu · 81ff0d3b
      Jason Baron authored
      commit ae10b2b4 upstream.
      
      Nathan Zimmer found that once we get over 10+ cpus, the scalability of
      SPECjbb falls over due to the contention on the global 'epmutex', which is
      taken in on EPOLL_CTL_ADD and EPOLL_CTL_DEL operations.
      
      Patch #1 removes the 'epmutex' lock completely from the EPOLL_CTL_DEL path
      by using rcu to guard against any concurrent traversals.
      
      Patch #2 remove the 'epmutex' lock from EPOLL_CTL_ADD operations for
      simple topologies.  IE when adding a link from an epoll file descriptor to
      a wakeup source, where the epoll file descriptor is not nested.
      
      This patch (of 2):
      
      Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by
      converting the file->f_ep_links list into an rcu one.  In this way, we can
      traverse the epoll network on the add path in parallel with deletes.
      Since deletes can't create loops or worse wakeup paths, this is safe.
      
      This patch in combination with the patch "epoll: Do not take global 'epmutex'
      for simple topologies", shows a dramatic performance improvement in
      scalability for SPECjbb.
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Tested-by: default avatarNathan Zimmer <nzimmer@sgi.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Nelson Elhage <nelhage@nelhage.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      CC: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      81ff0d3b
    • Jiri Slaby's avatar
      x86/dumpstack: Fix printk_address for direct addresses · dd7ab812
      Jiri Slaby authored
      commit 5f01c988 upstream.
      
      Consider a kernel crash in a module, simulated the following way:
      
       static int my_init(void)
       {
               char *map = (void *)0x5;
               *map = 3;
               return 0;
       }
       module_init(my_init);
      
      When we turn off FRAME_POINTERs, the very first instruction in
      that function causes a BUG. The problem is that we print IP in
      the BUG report using %pB (from printk_address). And %pB
      decrements the pointer by one to fix printing addresses of
      functions with tail calls.
      
      This was added in commit 71f9e598 ("x86, dumpstack: Use
      %pB format specifier for stack trace") to fix the call stack
      printouts.
      
      So instead of correct output:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000005
        IP: [<ffffffffa01ac000>] my_init+0x0/0x10 [pb173]
      
      We get:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000005
        IP: [<ffffffffa0152000>] 0xffffffffa0151fff
      
      To fix that, we use %pS only for stack addresses printouts (via
      newly added printk_stack_address) and %pB for regs->ip (via
      printk_address). I.e. we revert to the old behaviour for all
      except call stacks. And since from all those reliable is 1, we
      remove that parameter from printk_address.
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: joe@perches.com
      Cc: jirislaby@gmail.com
      Link: http://lkml.kernel.org/r/1382706418-8435-1-git-send-email-jslaby@suse.czSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      dd7ab812
    • NeilBrown's avatar
      SUNRPC: close a rare race in xs_tcp_setup_socket. · 6456bd97
      NeilBrown authored
      commit 93dc41bd upstream.
      
      We have one report of a crash in xs_tcp_setup_socket.
      The call path to the crash is:
      
        xs_tcp_setup_socket -> inet_stream_connect -> lock_sock_nested.
      
      The 'sock' passed to that last function is NULL.
      
      The only way I can see this happening is a concurrent call to
      xs_close:
      
        xs_close -> xs_reset_transport -> sock_release -> inet_release
      
      inet_release sets:
         sock->sk = NULL;
      inet_stream_connect calls
         lock_sock(sock->sk);
      which gets NULL.
      
      All calls to xs_close are protected by XPRT_LOCKED as are most
      activations of the workqueue which runs xs_tcp_setup_socket.
      The exception is xs_tcp_schedule_linger_timeout.
      
      So presumably the timeout queued by the later fires exactly when some
      other code runs xs_close().
      
      To protect against this we can move the cancel_delayed_work_sync()
      call from xs_destory() to xs_close().
      
      As xs_close is never called from the worker scheduled on
      ->connect_worker, this can never deadlock.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      [Trond: Make it safe to call cancel_delayed_work_sync() on AF_LOCAL sockets]
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      6456bd97
    • Shawn Bohrer's avatar
      sched/rt: Remove redundant nr_cpus_allowed test · 4aef0b11
      Shawn Bohrer authored
      commit 6bfa687c upstream.
      
      In 76854c7e ("sched: Use
      rt.nr_cpus_allowed to recover select_task_rq() cycles") an
      optimization was added to select_task_rq_rt() that immediately
      returns when p->nr_cpus_allowed == 1 at the beginning of the
      function.
      
      This makes the latter p->nr_cpus_allowed > 1 check redundant,
      which can now be removed.
      Signed-off-by: default avatarShawn Bohrer <sbohrer@rgmadvisors.com>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Mike Galbraith <mgalbraith@suse.de>
      Cc: tomk@rgmadvisors.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1380914693-24634-1-git-send-email-shawn.bohrer@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4aef0b11
    • Peter Zijlstra's avatar
      sched/rt: Add missing rmb() · d171cbfc
      Peter Zijlstra authored
      commit 7c3f2ab7 upstream.
      
      While discussing the proposed SCHED_DEADLINE patches which in parts
      mimic the existing FIFO code it was noticed that the wmb in
      rt_set_overloaded() didn't have a matching barrier.
      
      The only site using rt_overloaded() to test the rto_count is
      pull_rt_task() and we should issue a matching rmb before then assuming
      there's an rto_mask bit set.
      
      Without that smp_rmb() in there we could actually miss seeing the
      rto_mask bit.
      
      Also, change to using smp_[wr]mb(), even though this is SMP only code;
      memory barriers without smp_ always make me think they're against
      hardware of some sort.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: vincent.guittot@linaro.org
      Cc: luca.abeni@unitn.it
      Cc: bruce.ashfield@windriver.com
      Cc: dhaval.giani@gmail.com
      Cc: rostedt@goodmis.org
      Cc: hgu1972@gmail.com
      Cc: oleg@redhat.com
      Cc: fweisbec@gmail.com
      Cc: darren@dvhart.com
      Cc: johan.eker@ericsson.com
      Cc: p.faure@akatech.ch
      Cc: paulmck@linux.vnet.ibm.com
      Cc: raistlin@linux.it
      Cc: claudio@evidence.eu.com
      Cc: insop.song@gmail.com
      Cc: michael@amarulasolutions.com
      Cc: liming.wang@windriver.com
      Cc: fchecconi@gmail.com
      Cc: jkacur@redhat.com
      Cc: tommaso.cucinotta@sssup.it
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Cc: harald.gustafsson@ericsson.com
      Cc: nicola.manica@disi.unitn.it
      Cc: tglx@linutronix.de
      Link: http://lkml.kernel.org/r/20131015103507.GF10651@twins.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d171cbfc
    • Mel Gorman's avatar
      sched: Assign correct scheduling domain to 'sd_llc' · 4edad9c1
      Mel Gorman authored
      commit 5d4cf996 upstream.
      
      Commit 42eb088e (sched: Avoid NULL dereference on sd_busy) corrected a NULL
      dereference on sd_busy but the fix also altered what scheduling domain it
      used for the 'sd_llc' percpu variable.
      
      One impact of this is that a task selecting a runqueue may consider
      idle CPUs that are not cache siblings as candidates for running.
      Tasks are then running on CPUs that are not cache hot.
      
      This was found through bisection where ebizzy threads were not seeing equal
      performance and it looked like a scheduling fairness issue. This patch
      mitigates but does not completely fix the problem on all machines tested
      implying there may be an additional bug or a common root cause. Here are
      the average range of performance seen by individual ebizzy threads. It
      was tested on top of candidate patches related to x86 TLB range flushing.
      
      	4-core machine
      			    3.13.0-rc3            3.13.0-rc3
      			       vanilla            fixsd-v3r3
      	Mean   1        0.00 (  0.00%)        0.00 (  0.00%)
      	Mean   2        0.34 (  0.00%)        0.10 ( 70.59%)
      	Mean   3        1.29 (  0.00%)        0.93 ( 27.91%)
      	Mean   4        7.08 (  0.00%)        0.77 ( 89.12%)
      	Mean   5      193.54 (  0.00%)        2.14 ( 98.89%)
      	Mean   6      151.12 (  0.00%)        2.06 ( 98.64%)
      	Mean   7      115.38 (  0.00%)        2.04 ( 98.23%)
      	Mean   8      108.65 (  0.00%)        1.92 ( 98.23%)
      
      	8-core machine
      	Mean   1         0.00 (  0.00%)        0.00 (  0.00%)
      	Mean   2         0.40 (  0.00%)        0.21 ( 47.50%)
      	Mean   3        23.73 (  0.00%)        0.89 ( 96.25%)
      	Mean   4        12.79 (  0.00%)        1.04 ( 91.87%)
      	Mean   5        13.08 (  0.00%)        2.42 ( 81.50%)
      	Mean   6        23.21 (  0.00%)       69.46 (-199.27%)
      	Mean   7        15.85 (  0.00%)      101.72 (-541.77%)
      	Mean   8       109.37 (  0.00%)       19.13 ( 82.51%)
      	Mean   12      124.84 (  0.00%)       28.62 ( 77.07%)
      	Mean   16      113.50 (  0.00%)       24.16 ( 78.71%)
      
      It's eliminated for one machine and reduced for another.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Alex Shi <alex.shi@linaro.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: H Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20131217092124.GV11295@suse.deSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4edad9c1
    • Peter Zijlstra's avatar
      sched: Initialize power_orig for overlapping groups · 8dc051a7
      Peter Zijlstra authored
      commit 8e8339a3 upstream.
      
      Yinghai reported that he saw a /0 in sg_capacity on his EX parts.
      Make sure to always initialize power_orig now that we actually use it.
      
      Ideally build_sched_domains() -> init_sched_groups_power() would also
      initialize this; but for some yet unexplained reason some setups seem
      to miss updates there.
      Reported-by: default avatarYinghai Lu <yinghai@kernel.org>
      Tested-by: default avatarYinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-l8ng2m9uml6fhibln8wqpom7@git.kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8dc051a7
    • Peter Zijlstra's avatar
      sched: Avoid NULL dereference on sd_busy · a2198407
      Peter Zijlstra authored
      commit 42eb088e upstream.
      
      Commit 37dc6b50 ("sched: Remove unnecessary iteration over sched
      domains to update nr_busy_cpus") forgot to clear 'sd_busy' under some
      conditions leading to a possible NULL deref in set_cpu_sd_state_idle().
      Reported-by: default avatarAnton Blanchard <anton@samba.org>
      Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20131118113701.GF3866@twins.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a2198407