1. 06 May, 2015 18 commits
    • Filipe Manana's avatar
      Btrfs: fix inode eviction infinite loop after cloning into it · 449b4627
      Filipe Manana authored
      commit ccccf3d6 upstream.
      
      If we attempt to clone a 0 length region into a file we can end up
      inserting a range in the inode's extent_io tree with a start offset
      that is greater then the end offset, which triggers immediately the
      following warning:
      
      [ 3914.619057] WARNING: CPU: 17 PID: 4199 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
      [ 3914.620886] BTRFS: end < start 4095 4096
      (...)
      [ 3914.638093] Call Trace:
      [ 3914.638636]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
      [ 3914.639620]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
      [ 3914.640789]  [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs]
      [ 3914.642041]  [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
      [ 3914.643236]  [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs]
      [ 3914.644441]  [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs]
      [ 3914.645711]  [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs]
      [ 3914.646914]  [<ffffffff8142b2fb>] ? _raw_spin_unlock+0x28/0x33
      [ 3914.648058]  [<ffffffffa03cbac4>] ? test_range_bit+0xcc/0xde [btrfs]
      [ 3914.650105]  [<ffffffffa03cb3c3>] lock_extent+0x13/0x15 [btrfs]
      [ 3914.651361]  [<ffffffffa03db39e>] lock_extent_range+0x3d/0xcd [btrfs]
      [ 3914.652761]  [<ffffffffa03de1fe>] btrfs_ioctl_clone+0x278/0x388 [btrfs]
      [ 3914.654128]  [<ffffffff811226dd>] ? might_fault+0x58/0xb5
      [ 3914.655320]  [<ffffffffa03e0909>] btrfs_ioctl+0xb51/0x2195 [btrfs]
      (...)
      [ 3914.669271] ---[ end trace 14843d3e2e622fc1 ]---
      
      This later makes the inode eviction handler enter an infinite loop that
      keeps dumping the following warning over and over:
      
      [ 3915.117629] WARNING: CPU: 22 PID: 4228 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
      [ 3915.119913] BTRFS: end < start 4095 4096
      (...)
      [ 3915.137394] Call Trace:
      [ 3915.137913]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
      [ 3915.139154]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
      [ 3915.140316]  [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs]
      [ 3915.141505]  [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
      [ 3915.142709]  [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs]
      [ 3915.143849]  [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs]
      [ 3915.145120]  [<ffffffffa038c1e3>] ? btrfs_kill_super+0x17/0x23 [btrfs]
      [ 3915.146352]  [<ffffffff811548f6>] ? deactivate_locked_super+0x3b/0x50
      [ 3915.147565]  [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs]
      [ 3915.148785]  [<ffffffff8142b7e2>] ? _raw_write_unlock+0x28/0x33
      [ 3915.149931]  [<ffffffffa03bc325>] btrfs_evict_inode+0x196/0x482 [btrfs]
      [ 3915.151154]  [<ffffffff81168904>] evict+0xa0/0x148
      [ 3915.152094]  [<ffffffff811689e5>] dispose_list+0x39/0x43
      [ 3915.153081]  [<ffffffff81169564>] evict_inodes+0xdc/0xeb
      [ 3915.154062]  [<ffffffff81154418>] generic_shutdown_super+0x49/0xef
      [ 3915.155193]  [<ffffffff811546d1>] kill_anon_super+0x13/0x1e
      [ 3915.156274]  [<ffffffffa038c1e3>] btrfs_kill_super+0x17/0x23 [btrfs]
      (...)
      [ 3915.167404] ---[ end trace 14843d3e2e622fc2 ]---
      
      So just bail out of the clone ioctl if the length of the region to clone
      is zero, without locking any extent range, in order to prevent this issue
      (same behaviour as a pwrite with a 0 length for example).
      
      This is trivial to reproduce. For example, the steps for the test I just
      made for fstests:
      
        mkfs.btrfs -f SCRATCH_DEV
        mount SCRATCH_DEV $SCRATCH_MNT
      
        touch $SCRATCH_MNT/foo
        touch $SCRATCH_MNT/bar
      
        $CLONER_PROG -s 0 -d 4096 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar
        umount $SCRATCH_MNT
      
      A test case for fstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      449b4627
    • David Sterba's avatar
      btrfs: don't accept bare namespace as a valid xattr · 1bb2835e
      David Sterba authored
      commit 3c3b04d1 upstream.
      
      Due to insufficient check in btrfs_is_valid_xattr, this unexpectedly
      works:
      
       $ touch file
       $ setfattr -n user. -v 1 file
       $ getfattr -d file
      user.="1"
      
      ie. the missing attribute name after the namespace.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=94291Reported-by: default avatarWilliam Douglas <william.douglas@intel.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1bb2835e
    • Filipe Manana's avatar
      Btrfs: fix log tree corruption when fs mounted with -o discard · 3909e5a9
      Filipe Manana authored
      commit dcc82f47 upstream.
      
      While committing a transaction we free the log roots before we write the
      new super block. Freeing the log roots implies marking the disk location
      of every node/leaf (metadata extent) as pinned before the new super block
      is written. This is to prevent the disk location of log metadata extents
      from being reused before the new super block is written, otherwise we
      would have a corrupted log tree if before the new super block is written
      a crash/reboot happens and the location of any log tree metadata extent
      ended up being reused and rewritten.
      
      Even though we pinned the log tree's metadata extents, we were issuing a
      discard against them if the fs was mounted with the -o discard option,
      resulting in corruption of the log tree if a crash/reboot happened before
      writing the new super block - the next time the fs was mounted, during
      the log replay process we would find nodes/leafs of the log btree with
      a content full of zeroes, causing the process to fail and require the
      use of the tool btrfs-zero-log to wipeout the log tree (and all data
      previously fsynced becoming lost forever).
      
      Fix this by not doing a discard when pinning an extent. The discard will
      be done later when it's safe (after the new super block is committed) at
      extent-tree.c:btrfs_finish_extent_commit().
      
      Fixes: e688b725 (Btrfs: fix extent pinning bugs in the tree log)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3909e5a9
    • Nadav Amit's avatar
      KVM: x86: Fix MSR_IA32_BNDCFGS in msrs_to_save · 702a71cf
      Nadav Amit authored
      commit 9e9c3fe4 upstream.
      
      kvm_init_msr_list is currently called before hardware_setup. As a result,
      vmx_mpx_supported always returns false when kvm_init_msr_list checks whether to
      save MSR_IA32_BNDCFGS.
      
      Move kvm_init_msr_list after vmx_hardware_setup is called to fix this issue.
      Signed-off-by: default avatarNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      Message-Id: <1428864435-4732-1-git-send-email-namit@cs.technion.ac.il>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      702a71cf
    • Peter Zijlstra's avatar
      perf/x86/intel: Fix Core2,Atom,NHM,WSM cycles:pp events · 0d2b51f3
      Peter Zijlstra authored
      commit 517e6341 upstream.
      
      Ingo reported that cycles:pp didn't work for him on some machines.
      
      It turns out that in this commit:
      
        af4bdcf6 perf/x86/intel: Disallow flags for most Core2/Atom/Nehalem/Westmere events
      
      Andi forgot to explicitly allow that event when he
      disabled event flags for PEBS on those uarchs.
      Reported-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Fixes: af4bdcf6 ("perf/x86/intel: Disallow flags for most Core2/Atom/Nehalem/Westmere events")
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0d2b51f3
    • Mike Galbraith's avatar
      sched/idle/x86: Optimize unnecessary mwait_idle() resched IPIs · a166e4b1
      Mike Galbraith authored
      commit f8e617f4 upstream.
      
      To fully take advantage of MWAIT, apparently the CLFLUSH instruction needs
      another quirk on certain CPUs: proper barriers around it on certain machines.
      
      On a Q6600 SMP system, pipe-test scheduling performance, cross core,
      improves significantly:
      
        3.8.13                   487.2 KHz    1.000
        3.13.0-master            415.5 KHz     .852
        3.13.0-master+           415.2 KHz     .852     + restore mwait_idle
        3.13.0-master++          488.5 KHz    1.002     + restore mwait_idle + IPI fix
      
      Since X86_BUG_CLFLUSH_MONITOR is already a quirk, don't create a separate
      quirk for the extra smp_mb()s.
      Signed-off-by: default avatarMike Galbraith <bitbucket@online.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ian Malone <ibmalone@gmail.com>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1390061684.5566.4.camel@marge.simpson.net
      [ Ported to recent kernel, added comments about the quirk. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a166e4b1
    • Len Brown's avatar
      sched/idle/x86: Restore mwait_idle() to fix boot hangs, to improve power... · 0e126cfa
      Len Brown authored
      sched/idle/x86: Restore mwait_idle() to fix boot hangs, to improve power savings and to improve performance
      
      commit b253149b upstream.
      
      In Linux-3.9 we removed the mwait_idle() loop:
      
        69fb3676 ("x86 idle: remove mwait_idle() and "idle=mwait" cmdline param")
      
      The reasoning was that modern machines should be sufficiently
      happy during the boot process using the default_idle() HALT
      loop, until cpuidle loads and either acpi_idle or intel_idle
      invoke the newer MWAIT-with-hints idle loop.
      
      But two machines reported problems:
      
       1. Certain Core2-era machines support MWAIT-C1 and HALT only.
          MWAIT-C1 is preferred for optimal power and performance.
          But if they support just C1, cpuidle never loads and
          so they use the boot-time default idle loop forever.
      
       2. Some laptops will boot-hang if HALT is used,
          but will boot successfully if MWAIT is used.
          This appears to be a hidden assumption in BIOS SMI,
          that is presumably valid on the proprietary OS
          where the BIOS was validated.
      
             https://bugzilla.kernel.org/show_bug.cgi?id=60770
      
      So here we effectively revert the patch above, restoring
      the mwait_idle() loop.  However, we don't bother restoring
      the idle=mwait cmdline parameter, since it appears to add
      no value.
      
      Maintainer notes:
      
        For 3.9, simply revert 69fb3676
        for 3.10, patch -F3 applies, fuzz needed due to __cpuinit use in
        context For 3.11, 3.12, 3.13, this patch applies cleanly
      Tested-by: default avatarMike Galbraith <bitbucket@online.de>
      Signed-off-by: default avatarLen Brown <len.brown@intel.com>
      Acked-by: default avatarMike Galbraith <bitbucket@online.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ian Malone <ibmalone@gmail.com>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/345254a551eb5a6a866e048d7ab570fd2193aca4.1389763084.git.len.brown@intel.com
      [ Ported to recent kernels. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      0e126cfa
    • Andy Lutomirski's avatar
      x86/asm/decoder: Fix and enforce max instruction size in the insn decoder · 30d7277d
      Andy Lutomirski authored
      commit 91e5ed49 upstream.
      
      x86 instructions cannot exceed 15 bytes, and the instruction
      decoder should enforce that.  Prior to 6ba48ff4, the
      instruction length limit was implicitly set to 16, which was an
      approximation of 15, but there is currently no limit at all.
      
      Fix MAX_INSN_SIZE (it should be 15, not 16), and fix the decoder
      to reject instructions that exceed MAX_INSN_SIZE.
      
      Other than potentially confusing some of the decoder sanity
      checks, I'm not aware of any actual problems that omitting this
      check would cause, nor am I aware of any practical problems
      caused by the MAX_INSN_SIZE error.
      Signed-off-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Acked-by: default avatarMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Fixes: 6ba48ff4 ("x86: Remove arbitrary instruction size limit ...
      Link: http://lkml.kernel.org/r/f8f0bc9b8c58cfd6830f7d88400bf1396cbdcd0f.1422403511.git.luto@amacapital.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      30d7277d
    • Gu Zheng's avatar
      md: fix md io stats accounting broken · e637b3ec
      Gu Zheng authored
      commit 74672d06 upstream.
      
      Simon reported the md io stats accounting issue:
      "
      I'm seeing "iostat -x -k 1" print this after a RAID1 rebuild on 4.0-rc5.
      It's not abnormal other than it's 3-disk, with one being SSD (sdc) and
      the other two being write-mostly:
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00   345.00    0.00    0.00    0.00   0.00 100.00
      md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00 58779.00    0.00    0.00    0.00   0.00 100.00
      md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00    12.00    0.00    0.00    0.00   0.00 100.00
      "
      The cause is commit "18c0b223" uses the
      generic_start_io_acct to account the disk stats rather than the open code,
      but it also introduced the increase to .in_flight[rw] which is needless to
      md. So we re-use the open code here to fix it.
      Reported-by: default avatarSimon Kirby <sim@hostway.ca>
      Signed-off-by: default avatarGu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e637b3ec
    • Amir Vadai's avatar
      net/mlx4_en: Prevent setting invalid RSS hash function · 8272d13a
      Amir Vadai authored
      [ Upstream commit b3706909 ]
      
      mlx4_en_check_rxfh_func() was checking for hardware support before
      setting a known RSS hash function, but didn't do any check before
      setting unknown RSS hash function. Need to make it fail on such values.
      In this occasion, moved the actual setting of the new value from the
      check function into mlx4_en_set_rxfh().
      
      Fixes: 947cbb0a ("net/mlx4_en: Support for configurable RSS hash function")
      Signed-off-by: default avatarAmir Vadai <amirv@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8272d13a
    • Alexey Khoroshilov's avatar
      pxa168: fix double deallocation of managed resources · b3b8ae82
      Alexey Khoroshilov authored
      [ Upstream commit 0e03fd3e ]
      
      Commit 43d3ddf8 ("net: pxa168_eth: add device tree support") starts
      to use managed resources by adding devm_clk_get() and
      devm_ioremap_resource(), but it leaves explicit iounmap() and clock_put()
      in pxa168_eth_remove() and in failure handling code of pxa168_eth_probe().
      As a result double free can happen.
      
      The patch removes explicit resource deallocation. Also it converts
      clk_disable() to clk_disable_unprepare() to make it symmetrical with
      clk_prepare_enable().
      
      Found by Linux Driver Verification project (linuxtesting.org).
      Signed-off-by: default avatarAlexey Khoroshilov <khoroshilov@ispras.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b3b8ae82
    • Eric Dumazet's avatar
      net: fix crash in build_skb() · 5e2b1498
      Eric Dumazet authored
      [ Upstream commit 2ea2f62c ]
      
      When I added pfmemalloc support in build_skb(), I forgot netlink
      was using build_skb() with a vmalloc() area.
      
      In this patch I introduce __build_skb() for netlink use,
      and build_skb() is a wrapper handling both skb->head_frag and
      skb->pfmemalloc
      
      This means netlink no longer has to hack skb->head_frag
      
      [ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
      [ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      [ 1567.700067] Dumping ftrace buffer:
      [ 1567.700067]    (ftrace buffer empty)
      [ 1567.700067] Modules linked in:
      [ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
      [ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
      [ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
      [ 1567.700067] RSP: 0018:ffff8802467779d8  EFLAGS: 00010202
      [ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
      [ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
      [ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
      [ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
      [ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
      [ 1567.700067] FS:  00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
      [ 1567.700067] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
      [ 1567.700067] Stack:
      [ 1567.700067]  ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
      [ 1567.700067]  ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
      [ 1567.700067]  ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
      [ 1567.700067] Call Trace:
      [ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
      [ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
      [ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
      [ 1567.774369] sock_write_iter (net/socket.c:823)
      [ 1567.774369] ? sock_sendmsg (net/socket.c:806)
      [ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
      [ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
      [ 1567.774369] ? default_llseek (fs/read_write.c:487)
      [ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
      [ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
      [ 1567.774369] vfs_write (fs/read_write.c:539)
      [ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
      [ 1567.774369] ? SyS_read (fs/read_write.c:577)
      [ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
      [ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
      [ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
      [ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)
      
      Fixes: 79930f58 ("net: do not deplete pfmemalloc reserve")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5e2b1498
    • Eric Dumazet's avatar
      net: do not deplete pfmemalloc reserve · ac375adc
      Eric Dumazet authored
      [ Upstream commit 79930f58 ]
      
      build_skb() should look at the page pfmemalloc status.
      If set, this means page allocator allocated this page in the
      expectation it would help to free other pages. Networking
      stack can do that only if skb->pfmemalloc is also set.
      
      Also, we must refrain using high order pages from the pfmemalloc
      reserve, so __page_frag_refill() must also use __GFP_NOMEMALLOC for
      them. Under memory pressure, using order-0 pages is probably the best
      strategy.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ac375adc
    • Eric Dumazet's avatar
      tcp: avoid looping in tcp_send_fin() · ecff0913
      Eric Dumazet authored
      [ Upstream commit 845704a5 ]
      
      Presence of an unbound loop in tcp_send_fin() had always been hard
      to explain when analyzing crash dumps involving gigantic dying processes
      with millions of sockets.
      
      Lets try a different strategy :
      
      In case of memory pressure, try to add the FIN flag to last packet
      in write queue, even if packet was already sent. TCP stack will
      be able to deliver this FIN after a timeout event. Note that this
      FIN being delivered by a retransmit, it also carries a Push flag
      given our current implementation.
      
      By checking sk_under_memory_pressure(), we anticipate that cooking
      many FIN packets might deplete tcp memory.
      
      In the case we could not allocate a packet, even with __GFP_WAIT
      allocation, then not sending a FIN seems quite reasonable if it allows
      to get rid of this socket, free memory, and not block the process from
      eventually doing other useful work.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ecff0913
    • Eric Dumazet's avatar
      tcp: fix possible deadlock in tcp_send_fin() · 275b7c40
      Eric Dumazet authored
      [ Upstream commit d83769a5 ]
      
      Using sk_stream_alloc_skb() in tcp_send_fin() is dangerous in
      case a huge process is killed by OOM, and tcp_mem[2] is hit.
      
      To be able to free memory we need to make progress, so this
      patch allows FIN packets to not care about tcp_mem[2], if
      skb allocation succeeded.
      
      In a follow-up patch, we might abort tcp_send_fin() infinite loop
      in case TIF_MEMDIE is set on this thread, as memory allocator
      did its best getting extra memory already.
      
      This patch reverts d22e1537 ("tcp: fix tcp fin memory accounting")
      
      Fixes: d22e1537 ("tcp: fix tcp fin memory accounting")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      275b7c40
    • Tom Herbert's avatar
      ppp: call skb_checksum_complete_unset in ppp_receive_frame · c0a79051
      Tom Herbert authored
      [ Upstream commit 3dfb0534 ]
      
      Call checksum_complete_unset in PPP receive to discard checksum-complete
      value. PPP does not pull checksum for headers and also modifies packet
      as in VJ compression.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c0a79051
    • Tom Herbert's avatar
      net: add skb_checksum_complete_unset · 38fd84e6
      Tom Herbert authored
      [ Upstream commit 4e18b9ad ]
      
      This function changes ip_summed to CHECKSUM_NONE if CHECKSUM_COMPLETE
      is set. This is called to discard checksum-complete when packet
      is being modified and checksum is not pulled for headers in a layer.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      38fd84e6
    • Sebastian Pöhn's avatar
      ip_forward: Drop frames with attached skb->sk · cf5bab3a
      Sebastian Pöhn authored
      [ Upstream commit 2ab95749 ]
      
      Initial discussion was:
      [FYI] xfrm: Don't lookup sk_policy for timewait sockets
      
      Forwarded frames should not have a socket attached. Especially
      tw sockets will lead to panics later-on in the stack.
      
      This was observed with TPROXY assigning a tw socket and broken
      policy routing (misconfigured). As a result frame enters
      forwarding path instead of input. We cannot solve this in
      TPROXY as it cannot know that policy routing is broken.
      
      v2:
      Remove useless comment
      Signed-off-by: default avatarSebastian Poehn <sebastian.poehn@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cf5bab3a
  2. 29 Apr, 2015 22 commits
    • Greg Kroah-Hartman's avatar
      Linux 3.19.6 · 3c464c73
      Greg Kroah-Hartman authored
      3c464c73
    • Jann Horn's avatar
      fs: take i_mutex during prepare_binprm for set[ug]id executables · b23d104f
      Jann Horn authored
      commit 8b01fc86 upstream.
      
      This prevents a race between chown() and execve(), where chowning a
      setuid-user binary to root would momentarily make the binary setuid
      root.
      
      This patch was mostly written by Linus Torvalds.
      Signed-off-by: default avatarJann Horn <jann@thejh.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b23d104f
    • Troy Tan's avatar
      rtlwifi: rtl8192ee: Fix handling of new style descriptors · 9fe9c37c
      Troy Tan authored
      commit d0311314 upstream.
      
      The hardware and firmware for the RTL8192EE utilize a FIFO list of
      descriptors. There were some problems with the initial implementation.
      The worst of these failed to detect that the FIFO was becoming full,
      which led to the device needing to be power cycled. As this condition
      is not relevant to most of the devices supported by rtlwifi, a callback
      routine was added to detect this situation. This patch implements the
      necessary changes in the pci handler, and the linkage into the appropriate
      rtl8192ee routine.
      Signed-off-by: default avatarTroy Tan <troy_tan@realsil.com.cn>
      Signed-off-by: default avatarLarry Finger <Larry.Finger@lwfinger.net>
      Cc: Stable <stable@vger.kernel.org> [V3.18]
      Signed-off-by: default avatarKalle Valo <kvalo@codeaurora.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9fe9c37c
    • Naoya Horiguchi's avatar
      mm/hugetlb: take page table lock in follow_huge_pmd() · 014275c8
      Naoya Horiguchi authored
      commit e66f17ff upstream.
      
      We have a race condition between move_pages() and freeing hugepages, where
      move_pages() calls follow_page(FOLL_GET) for hugepages internally and
      tries to get its refcount without preventing concurrent freeing.  This
      race crashes the kernel, so this patch fixes it by moving FOLL_GET code
      for hugepages into follow_huge_pmd() with taking the page table lock.
      
      This patch intentionally removes page==NULL check after pte_page.
      This is justified because pte_page() never returns NULL for any
      architectures or configurations.
      
      This patch changes the behavior of follow_huge_pmd() for tail pages and
      then tail pages can be pinned/returned.  So the caller must be changed to
      properly handle the returned tail pages.
      
      We could have a choice to add the similar locking to
      follow_huge_(addr|pud) for consistency, but it's not necessary because
      currently these functions don't support FOLL_GET flag, so let's leave it
      for future development.
      
      Here is the reproducer:
      
        $ cat movepages.c
        #include <stdio.h>
        #include <stdlib.h>
        #include <numaif.h>
      
        #define ADDR_INPUT      0x700000000000UL
        #define HPS             0x200000
        #define PS              0x1000
      
        int main(int argc, char *argv[]) {
                int i;
                int nr_hp = strtol(argv[1], NULL, 0);
                int nr_p  = nr_hp * HPS / PS;
                int ret;
                void **addrs;
                int *status;
                int *nodes;
                pid_t pid;
      
                pid = strtol(argv[2], NULL, 0);
                addrs  = malloc(sizeof(char *) * nr_p + 1);
                status = malloc(sizeof(char *) * nr_p + 1);
                nodes  = malloc(sizeof(char *) * nr_p + 1);
      
                while (1) {
                        for (i = 0; i < nr_p; i++) {
                                addrs[i] = (void *)ADDR_INPUT + i * PS;
                                nodes[i] = 1;
                                status[i] = 0;
                        }
                        ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                              MPOL_MF_MOVE_ALL);
                        if (ret == -1)
                                err("move_pages");
      
                        for (i = 0; i < nr_p; i++) {
                                addrs[i] = (void *)ADDR_INPUT + i * PS;
                                nodes[i] = 0;
                                status[i] = 0;
                        }
                        ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                              MPOL_MF_MOVE_ALL);
                        if (ret == -1)
                                err("move_pages");
                }
                return 0;
        }
      
        $ cat hugepage.c
        #include <stdio.h>
        #include <sys/mman.h>
        #include <string.h>
      
        #define ADDR_INPUT      0x700000000000UL
        #define HPS             0x200000
      
        int main(int argc, char *argv[]) {
                int nr_hp = strtol(argv[1], NULL, 0);
                char *p;
      
                while (1) {
                        p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
                                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
                        if (p != (void *)ADDR_INPUT) {
                                perror("mmap");
                                break;
                        }
                        memset(p, 0, nr_hp * HPS);
                        munmap(p, nr_hp * HPS);
                }
        }
      
        $ sysctl vm.nr_hugepages=40
        $ ./hugepage 10 &
        $ ./movepages 10 $(pgrep -f hugepage)
      
      
      [n-horiguchi@ah.jp.nec.com: resolve conflict to apply to v3.19.1]
      Fixes: e632a938 ("mm: migrate: add hugepage migration code to move_pages()")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      014275c8
    • Naoya Horiguchi's avatar
      mm/hugetlb: reduce arch dependent code around follow_huge_* · a15d5146
      Naoya Horiguchi authored
      commit 61f77eda upstream.
      
      Currently we have many duplicates in definitions around
      follow_huge_addr(), follow_huge_pmd(), and follow_huge_pud(), so this
      patch tries to remove the m.  The basic idea is to put the default
      implementation for these functions in mm/hugetlb.c as weak symbols
      (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETL B), and to implement
      arch-specific code only when the arch needs it.
      
      For follow_huge_addr(), only powerpc and ia64 have their own
      implementation, and in all other architectures this function just returns
      ERR_PTR(-EINVAL).  So this patch sets returning ERR_PTR(-EINVAL) as
      default.
      
      As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to
      always return 0 in your architecture (like in ia64 or sparc,) it's never
      called (the callsite is optimized away) no matter how implemented it is.
      So in such architectures, we don't need arch-specific implementation.
      
      In some architecture (like mips, s390 and tile,) their current
      arch-specific follow_huge_(pmd|pud)() are effectively identical with the
      common code, so this patch lets these architecture use the common code.
      
      One exception is metag, where pmd_huge() could return non-zero but it
      expects follow_huge_pmd() to always return NULL.  This means that we need
      arch-specific implementation which returns NULL.  This behavior looks
      strange to me (because non-zero pmd_huge() implies that the architecture
      supports PMD-based hugepage, so follow_huge_pmd() can/should return some
      relevant value,) but that's beyond this cleanup patch, so let's keep it.
      
      Justification of non-trivial changes:
      - in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this
        patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE
        is true when follow_huge_pmd() can be called (note that pmd_huge() has
        the same check and always returns 0 for !MACHINE_HAS_HPAGE.)
      - in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common
        code. This patch forces these archs use PMD_MASK, but it's OK because
        they are identical in both archs.
        In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20.
        In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and
        PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but
        PTE_ORDER is always 0, so these are identical.
      
      [n-horiguchi@ah.jp.nec.com: resolve conflict to apply to v3.19.1]
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a15d5146
    • Ian Abbott's avatar
      staging: comedi: adv_pci1710: fix AI INSN_READ for non-zero channel · b52d8696
      Ian Abbott authored
      commit abe46b89 upstream.
      
      Reading of analog input channels by the `INSN_READ` comedi instruction
      is broken for all except channel 0.  `pci171x_ai_insn_read()` calls
      `pci171x_ai_read_sample()` with the wrong value for the third parameter.
      It is supposed to be the current index in a channel list (which is
      always of length 1 in this case, so the index should be 0), but instead
      it is passing the actual channel number.  `pci171x_ai_read_sample()`
      checks the channel number encoded in the raw sample value read from the
      hardware matches the channel number stored in the specified index of the
      previously set up channel list and returns `-ENODATA` if it doesn't
      match.  Since the index should always be 0 in this case, the match will
      fail unless the channel number is also 0.  Fix it by passing 0 as the
      channel index.
      
      Note that when the bug first appeared, it was `pci171x_ai_dropout()`
      that was called with the wrong parameter value.  `pci171x_ai_dropout()`
      got replaced with `pci171x_ai_read_sample()` in commit 7fd2dae2
      ("staging: comedi: adv_pci1710: introduce pci171x_ai_read_sample()").
      
      Fixes: 16c7eb60 ("staging: comedi: adv_pci1710: always enable PCI171x_PARANOIDCHECK code")
      Signed-off-by: default avatarIan Abbott <abbotti@mev.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      b52d8696
    • Radim Krčmář's avatar
      KVM: nVMX: mask unrestricted_guest if disabled on L0 · 1b413722
      Radim Krčmář authored
      commit 0790ec17 upstream.
      
      If EPT was enabled, unrestricted_guest was allowed in L1 regardless of
      L0.  L1 triple faulted when running L2 guest that required emulation.
      
      Another side effect was 'WARN_ON_ONCE(vmx->nested.nested_run_pending)'
      in L0's dmesg:
        WARNING: CPU: 0 PID: 0 at arch/x86/kvm/vmx.c:9190 nested_vmx_vmexit+0x96e/0xb00 [kvm_intel] ()
      
      Prevent this scenario by masking SECONDARY_EXEC_UNRESTRICTED_GUEST when
      the host doesn't have it enabled.
      
      Fixes: 78051e3b ("KVM: nVMX: Disable unrestricted mode if ept=0")
      Cc: stable@vger.kernel.org
      Tested-By: default avatarKashyap Chamarthy <kchamart@redhat.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      1b413722
    • Jun'ichi Nomura \\\\(NEC\\\\)'s avatar
      tg3: Hold tp->lock before calling tg3_halt() from tg3_init_one() · edc7cc6c
      Jun'ichi Nomura \\\\(NEC\\\\) authored
      [ Upstream commit d0af71a3 ]
      
      tg3_init_one() calls tg3_halt() without tp->lock despite its assumption
      and causes deadlock.
      If lockdep is enabled, a warning like this shows up before the stall:
      
        [ BUG: bad unlock balance detected! ]
        3.19.0test #3 Tainted: G            E
        -------------------------------------
        insmod/369 is trying to release lock (&(&tp->lock)->rlock) at:
        [<ffffffffa02d5a1d>] tg3_chip_reset+0x14d/0x780 [tg3]
        but there are no more locks to release!
      
      tg3_init_one() doesn't call tg3_halt() under normal situation but
      during kexec kdump I hit this problem.
      
      Fixes: 932f19de ("tg3: Release tp->lock before invoking synchronize_irq()")
      Signed-off-by: default avatarJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      edc7cc6c
    • Ben Hutchings's avatar
      usbnet: Fix tx_bytes statistic running backward in cdc_ncm · f1cb2a0f
      Ben Hutchings authored
      [ Upstream commit 7a1e890e ]
      
      cdc_ncm disagrees with usbnet about how much framing overhead should
      be counted in the tx_bytes statistics, and tries 'fix' this by
      decrementing tx_bytes on the transmit path.  But statistics must never
      be decremented except due to roll-over; this will thoroughly confuse
      user-space.  Also, tx_bytes is only incremented by usbnet in the
      completion path.
      
      Fix this by requiring drivers that set FLAG_MULTI_FRAME to set a
      tx_bytes delta along with the tx_packets count.
      
      Fixes: beeecd42 ("net: cdc_ncm/cdc_mbim: adding NCM protocol statistics")
      Signed-off-by: default avatarBen Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: default avatarBjørn Mork <bjorn@mork.no>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f1cb2a0f
    • Ben Hutchings's avatar
      usbnet: Fix tx_packets stat for FLAG_MULTI_FRAME drivers · 3d206780
      Ben Hutchings authored
      [ Upstream commit 1e9e39f4 ]
      
      Currently the usbnet core does not update the tx_packets statistic for
      drivers with FLAG_MULTI_PACKET and there is no hook in the TX
      completion path where they could do this.
      
      cdc_ncm and dependent drivers are bumping tx_packets stat on the
      transmit path while asix and sr9800 aren't updating it at all.
      
      Add a packet count in struct skb_data so these drivers can fill it
      in, initialise it to 1 for other drivers, and add the packet count
      to the tx_packets statistic on completion.
      Signed-off-by: default avatarBen Hutchings <ben.hutchings@codethink.co.uk>
      Tested-by: default avatarBjørn Mork <bjorn@mork.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3d206780
    • Jesse Gross's avatar
      udptunnels: Call handle_offloads after inserting vlan tag. · 174fbb30
      Jesse Gross authored
      [ Upstream commit b736a623 ]
      
      handle_offloads() calls skb_reset_inner_headers() to store
      the layer pointers to the encapsulated packet. However, we
      currently push the vlag tag (if there is one) onto the packet
      afterwards. This changes the MAC header for the encapsulated
      packet but it is not reflected in skb->inner_mac_header, which
      breaks GSO and drivers which attempt to use this for encapsulation
      offloads.
      
      Fixes: 1eaa8178 ("vxlan: Add tx-vlan offload support.")
      Signed-off-by: default avatarJesse Gross <jesse@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      174fbb30
    • Herbert Xu's avatar
      skbuff: Do not scrub skb mark within the same name space · d385d003
      Herbert Xu authored
      [ Upstream commit 213dd74a ]
      
      On Wed, Apr 15, 2015 at 05:41:26PM +0200, Nicolas Dichtel wrote:
      > Le 15/04/2015 15:57, Herbert Xu a écrit :
      > >On Wed, Apr 15, 2015 at 06:22:29PM +0800, Herbert Xu wrote:
      > [snip]
      > >Subject: skbuff: Do not scrub skb mark within the same name space
      > >
      > >The commit ea23192e ("tunnels:
      > Maybe add a Fixes tag?
      > Fixes: ea23192e ("tunnels: harmonize cleanup done on skb on rx path")
      >
      > >harmonize cleanup done on skb on rx path") broke anyone trying to
      > >use netfilter marking across IPv4 tunnels.  While most of the
      > >fields that are cleared by skb_scrub_packet don't matter, the
      > >netfilter mark must be preserved.
      > >
      > >This patch rearranges skb_scurb_packet to preserve the mark field.
      > nit: s/scurb/scrub
      >
      > Else it's fine for me.
      
      Sure.
      
      PS I used the wrong email for James the first time around.  So
      let me repeat the question here.  Should secmark be preserved
      or cleared across tunnels within the same name space? In fact,
      do our security models even support name spaces?
      
      ---8<---
      The commit ea23192e ("tunnels:
      harmonize cleanup done on skb on rx path") broke anyone trying to
      use netfilter marking across IPv4 tunnels.  While most of the
      fields that are cleared by skb_scrub_packet don't matter, the
      netfilter mark must be preserved.
      
      This patch rearranges skb_scrub_packet to preserve the mark field.
      
      Fixes: ea23192e ("tunnels: harmonize cleanup done on skb on rx path")
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d385d003
    • Herbert Xu's avatar
      Revert "net: Reset secmark when scrubbing packet" · 5fe5245d
      Herbert Xu authored
      [ Upstream commit 4c0ee414 ]
      
      This patch reverts commit b8fb4e06
      because the secmark must be preserved even when a packet crosses
      namespace boundaries.  The reason is that security labels apply to
      the system as a whole and is not per-namespace.
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5fe5245d
    • Alexei Starovoitov's avatar
      bpf: fix verifier memory corruption · e3666002
      Alexei Starovoitov authored
      [ Upstream commit c3de6317 ]
      
      Due to missing bounds check the DAG pass of the BPF verifier can corrupt
      the memory which can cause random crashes during program loading:
      
      [8.449451] BUG: unable to handle kernel paging request at ffffffffffffffff
      [8.451293] IP: [<ffffffff811de33d>] kmem_cache_alloc_trace+0x8d/0x2f0
      [8.452329] Oops: 0000 [#1] SMP
      [8.452329] Call Trace:
      [8.452329]  [<ffffffff8116cc82>] bpf_check+0x852/0x2000
      [8.452329]  [<ffffffff8116b7e4>] bpf_prog_load+0x1e4/0x310
      [8.452329]  [<ffffffff811b190f>] ? might_fault+0x5f/0xb0
      [8.452329]  [<ffffffff8116c206>] SyS_bpf+0x806/0xa30
      
      Fixes: f1bca824 ("bpf: add search pruning optimization to verifier")
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e3666002
    • Eric Dumazet's avatar
      bnx2x: Fix busy_poll vs netpoll · 63f49b7f
      Eric Dumazet authored
      [ Upstream commit 074975d0 ]
      
      Commit 9a2620c8 ("bnx2x: prevent WARN during driver unload")
      switched the napi/busy_lock locking mechanism from spin_lock() into
      spin_lock_bh(), breaking inter-operability with netconsole, as netpoll
      disables interrupts prior to calling our napi mechanism.
      
      This switches the driver into using atomic assignments instead of the
      spinlock mechanisms previously employed.
      
      Based on initial patch from Yuval Mintz & Ariel Elior
      
      I basically added softirq starvation avoidance, and mixture
      of atomic operations, plain writes and barriers.
      
      Note this slightly reduces the overhead for this driver when no
      busy_poll sockets are in use.
      
      Fixes: 9a2620c8 ("bnx2x: prevent WARN during driver unload")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      63f49b7f
    • Eric Dumazet's avatar
      tcp: tcp_make_synack() should clear skb->tstamp · 1b6c8d50
      Eric Dumazet authored
      [ Upstream commit b50edd78 ]
      
      I noticed tcpdump was giving funky timestamps for locally
      generated SYNACK messages on loopback interface.
      
      11:42:46.938990 IP 127.0.0.1.48245 > 127.0.0.2.23850: S
      945476042:945476042(0) win 43690 <mss 65495,nop,nop,sackOK,nop,wscale 7>
      
      20:28:58.502209 IP 127.0.0.2.23850 > 127.0.0.1.48245: S
      3160535375:3160535375(0) ack 945476043 win 43690 <mss
      65495,nop,nop,sackOK,nop,wscale 7>
      
      This is because we need to clear skb->tstamp before
      entering lower stack, otherwise net_timestamp_check()
      does not set skb->tstamp.
      
      Fixes: 7faee5c0 ("tcp: remove TCP_SKB_CB(skb)->when")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1b6c8d50
    • Jack Morgenstein's avatar
      net/mlx4_core: Fix error message deprecation for ConnectX-2 cards · 3a0cf55b
      Jack Morgenstein authored
      [ Upstream commit fde913e2 ]
      
      Commit 1daa4303 ("net/mlx4_core: Deprecate error message at
      ConnectX-2 cards startup to debug") did the deprecation only for port 1
      of the card. Need to deprecate for port 2 as well.
      
      Fixes: 1daa4303 ("net/mlx4_core: Deprecate error message at ConnectX-2 cards startup to debug")
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarAmir Vadai <amirv@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a0cf55b
    • hannes@stressinduktion.org's avatar
      ipv6: protect skb->sk accesses from recursive dereference inside the stack · 3fe207e4
      hannes@stressinduktion.org authored
      [ Upstream commit f60e5990 ]
      
      We should not consult skb->sk for output decisions in xmit recursion
      levels > 0 in the stack. Otherwise local socket settings could influence
      the result of e.g. tunnel encapsulation process.
      
      ipv6 does not conform with this in three places:
      
      1) ip6_fragment: we do consult ipv6_npinfo for frag_size
      
      2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
         loop the packet back to the local socket
      
      3) ip6_skb_dst_mtu could query the settings from the user socket and
         force a wrong MTU
      
      Furthermore:
      In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
      PF_PACKET socket ontop of an IPv6-backed vxlan device.
      
      Reuse xmit_recursion as we are currently only interested in protecting
      tunnel devices.
      
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3fe207e4
    • Neal Cardwell's avatar
      tcp: fix FRTO undo on cumulative ACK of SACKed range · 84212e36
      Neal Cardwell authored
      [ Upstream commit 666b8051 ]
      
      On processing cumulative ACKs, the FRTO code was not checking the
      SACKed bit, meaning that there could be a spurious FRTO undo on a
      cumulative ACK of a previously SACKed skb.
      
      The FRTO code should only consider a cumulative ACK to indicate that
      an original/unretransmitted skb is newly ACKed if the skb was not yet
      SACKed.
      
      The effect of the spurious FRTO undo would typically be to make the
      connection think that all previously-sent packets were in flight when
      they really weren't, leading to a stall and an RTO.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Fixes: e33099f9 ("tcp: implement RFC5682 F-RTO")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      84212e36
    • Jonathan Davies's avatar
      xen-netfront: transmit fully GSO-sized packets · 5ce52016
      Jonathan Davies authored
      [ Upstream commit 0c36820e ]
      
      xen-netfront limits transmitted skbs to be at most 44 segments in size. However,
      GSO permits up to 65536 bytes, which means a maximum of 45 segments of 1448
      bytes each. This slight reduction in the size of packets means a slight loss in
      efficiency.
      
      Since c/s 9ecd1a75, xen-netfront sets gso_max_size to
          XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER,
      where XEN_NETIF_MAX_TX_SIZE is 65535 bytes.
      
      The calculation used by tcp_tso_autosize (and also tcp_xmit_size_goal since c/s
      6c09fa09) in determining when to split an skb into two is
          sk->sk_gso_max_size - 1 - MAX_TCP_HEADER.
      
      So the maximum permitted size of an skb is calculated to be
          (XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER) - 1 - MAX_TCP_HEADER.
      
      Intuitively, this looks like the wrong formula -- we don't need two TCP headers.
      Instead, there is no need to deviate from the default gso_max_size of 65536 as
      this already accommodates the size of the header.
      
      Currently, the largest skb transmitted by netfront is 63712 bytes (44 segments
      of 1448 bytes each), as observed via tcpdump. This patch makes netfront send
      skbs of up to 65160 bytes (45 segments of 1448 bytes each).
      
      Similarly, the maximum allowable mtu does not need to subtract MAX_TCP_HEADER as
      it relates to the size of the whole packet, including the header.
      
      Fixes: 9ecd1a75 ("xen-netfront: reduce gso_max_size to account for max TCP header")
      Signed-off-by: default avatarJonathan Davies <jonathan.davies@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5ce52016
    • Thomas Graf's avatar
      openvswitch: Return vport module ref before destruction · 69ed0224
      Thomas Graf authored
      [ Upstream commit fa2d8ff4 ]
      
      Return module reference before invoking the respective vport
      ->destroy() function. This is needed as ovs_vport_del() is not
      invoked inside an RCU read side critical section so the kfree
      can occur immediately before returning to ovs_vport_del().
      
      Returning the module reference before ->destroy() is safe because
      the module unregistration is blocked on ovs_lock which we hold
      while destroying the datapath.
      
      Fixes: 62b9c8d0 ("ovs: Turn vports with dependencies into separate modules")
      Reported-by: default avatarPravin Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Acked-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      69ed0224
    • Anton Nayshtut's avatar
      bonding: Bonding Overriding Configuration logic restored. · 28cc484c
      Anton Nayshtut authored
      [ Upstream commit f5e2dc5d ]
      
      Before commit 3900f290 ("bonding: slight
      optimizztion for bond_slave_override()") the override logic was to send packets
      with non-zero queue_id through the slave with corresponding queue_id, under two
      conditions only - if the slave can transmit and it's up.
      
      The above mentioned commit changed this logic by introducing an additional
      condition - whether the bond is active (indirectly, using the slave_can_tx and
      later - bond_is_active_slave), that prevents the user from implementing more
      complex policies according to the Documentation/networking/bonding.txt.
      Signed-off-by: default avatarAnton Nayshtut <anton@swortex.com>
      Signed-off-by: default avatarAlexey Bogoslavsky <alexey@swortex.com>
      Signed-off-by: default avatarAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      28cc484c