1. 13 Oct, 2015 40 commits
    • Michal Kubeček's avatar
      ipv6: update ip6_rt_last_gc every time GC is run · 24277c76
      Michal Kubeček authored
      commit 49a18d86 upstream.
      
      As pointed out by Eric Dumazet, net->ipv6.ip6_rt_last_gc should
      hold the last time garbage collector was run so that we should
      update it whenever fib6_run_gc() calls fib6_clean_all(), not only
      if we got there from ip6_dst_gc().
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      24277c76
    • Michal Kubeček's avatar
      ipv6: prevent fib6_run_gc() contention · 2491f018
      Michal Kubeček authored
      commit 2ac3ac8f upstream.
      
      On a high-traffic router with many processors and many IPv6 dst
      entries, soft lockup in fib6_run_gc() can occur when number of
      entries reaches gc_thresh.
      
      This happens because fib6_run_gc() uses fib6_gc_lock to allow
      only one thread to run the garbage collector but ip6_dst_gc()
      doesn't update net->ipv6.ip6_rt_last_gc until fib6_run_gc()
      returns. On a system with many entries, this can take some time
      so that in the meantime, other threads pass the tests in
      ip6_dst_gc() (ip6_rt_last_gc is still not updated) and wait for
      the lock. They then have to run the garbage collector one after
      another which blocks them for quite long.
      
      Resolve this by replacing special value ~0UL of expire parameter
      to fib6_run_gc() by explicit "force" parameter to choose between
      spin_lock_bh() and spin_trylock_bh() and call fib6_run_gc() with
      force=false if gc_thresh is reached but not max_size.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      2491f018
    • Kirill A. Shutemov's avatar
      perf tools: Fix build with perl 5.18 · 4e959244
      Kirill A. Shutemov authored
      commit 575bf1d0 upstream.
      
      perl.h from new Perl release doesn't like -Wundef and -Wswitch-default:
      
      /usr/lib/perl5/core_perl/CORE/perl.h:548:5: error: "SILENT_NO_TAINT_SUPPORT" is not defined [-Werror=undef]
       #if SILENT_NO_TAINT_SUPPORT && !defined(NO_TAINT_SUPPORT)
           ^
      /usr/lib/perl5/core_perl/CORE/perl.h:556:5: error: "NO_TAINT_SUPPORT" is not defined [-Werror=undef]
       #if NO_TAINT_SUPPORT
           ^
      In file included from /usr/lib/perl5/core_perl/CORE/perl.h:3471:0,
                       from util/scripting-engines/trace-event-perl.c:30:
      /usr/lib/perl5/core_perl/CORE/sv.h:1455:5: error: "NO_TAINT_SUPPORT" is not defined [-Werror=undef]
       #if NO_TAINT_SUPPORT
           ^
      In file included from /usr/lib/perl5/core_perl/CORE/perl.h:3472:0,
                       from util/scripting-engines/trace-event-perl.c:30:
      /usr/lib/perl5/core_perl/CORE/regexp.h:436:5: error: "NO_TAINT_SUPPORT" is not defined [-Werror=undef]
       #if NO_TAINT_SUPPORT
           ^
      In file included from /usr/lib/perl5/core_perl/CORE/hv.h:592:0,
                       from /usr/lib/perl5/core_perl/CORE/perl.h:3480,
                       from util/scripting-engines/trace-event-perl.c:30:
      /usr/lib/perl5/core_perl/CORE/hv_func.h: In function ‘S_perl_hash_siphash_2_4’:
      /usr/lib/perl5/core_perl/CORE/hv_func.h:222:3: error: switch missing default case [-Werror=switch-default]
         switch( left )
         ^
      /usr/lib/perl5/core_perl/CORE/hv_func.h: In function ‘S_perl_hash_superfast’:
      /usr/lib/perl5/core_perl/CORE/hv_func.h:274:5: error: switch missing default case [-Werror=switch-default]
           switch (rem) { \
           ^
      /usr/lib/perl5/core_perl/CORE/hv_func.h: In function ‘S_perl_hash_murmur3’:
      /usr/lib/perl5/core_perl/CORE/hv_func.h:398:5: error: switch missing default case [-Werror=switch-default]
           switch(bytes_in_carry) { /* how many bytes in carry */
           ^
      
      Let's disable the warnings for code which uses perl.h.
      Signed-off-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1372063394-20126-1-git-send-email-kirill@shutemov.nameSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Cc: Vinson Lee <vlee@twopensource.com>
      4e959244
    • Wilson Kok's avatar
      fib_rules: fix fib rule dumps across multiple skbs · c7e9f97d
      Wilson Kok authored
      [ Upstream commit 41fc0143 ]
      
      dump_rules returns skb length and not error.
      But when family == AF_UNSPEC, the caller of dump_rules
      assumes that it returns an error. Hence, when family == AF_UNSPEC,
      we continue trying to dump on -EMSGSIZE errors resulting in
      incorrect dump idx carried between skbs belonging to the same dump.
      This results in fib rule dump always only dumping rules that fit
      into the first skb.
      
      This patch fixes dump_rules to return error so that we exit correctly
      and idx is correctly maintained between skbs that are part of the
      same dump.
      Signed-off-by: default avatarWilson Kok <wkok@cumulusnetworks.com>
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.2:
       - s/portid/pid/
       - Check whether fib_nl_fill_rule() returns < 0, as it may return > 0 on
         success (thanks to Roland Dreier)]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Cc: Roland Dreier <roland@purestorage.com>
      c7e9f97d
    • Richard Laing's avatar
      net/ipv6: Correct PIM6 mrt_lock handling · ea43243c
      Richard Laing authored
      [ Upstream commit 25b4a44c ]
      
      In the IPv6 multicast routing code the mrt_lock was not being released
      correctly in the MFC iterator, as a result adding or deleting a MIF would
      cause a hang because the mrt_lock could not be acquired.
      
      This fix is a copy of the code for the IPv4 case and ensures that the lock
      is released correctly.
      Signed-off-by: default avatarRichard Laing <richard.laing@alliedtelesis.co.nz>
      Acked-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      ea43243c
    • dingtianhong's avatar
      bonding: correct the MAC address for "follow" fail_over_mac policy · d2e03097
      dingtianhong authored
      [ Upstream commit a951bc1e ]
      
      The "follow" fail_over_mac policy is useful for multiport devices that
      either become confused or incur a performance penalty when multiple
      ports are programmed with the same MAC address, but the same MAC
      address still may happened by this steps for this policy:
      
      1) echo +eth0 > /sys/class/net/bond0/bonding/slaves
         bond0 has the same mac address with eth0, it is MAC1.
      
      2) echo +eth1 > /sys/class/net/bond0/bonding/slaves
         eth1 is backup, eth1 has MAC2.
      
      3) ifconfig eth0 down
         eth1 became active slave, bond will swap MAC for eth0 and eth1,
         so eth1 has MAC1, and eth0 has MAC2.
      
      4) ifconfig eth1 down
         there is no active slave, and eth1 still has MAC1, eth2 has MAC2.
      
      5) ifconfig eth0 up
         the eth0 became active slave again, the bond set eth0 to MAC1.
      
      Something wrong here, then if you set eth1 up, the eth0 and eth1 will have the same
      MAC address, it will break this policy for ACTIVE_BACKUP mode.
      
      This patch will fix this problem by finding the old active slave and
      swap them MAC address before change active slave.
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Tested-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.2:
       - bond_for_each_slave() takes an extra int paramter
       - Use compare_ether_addr() instead of ether_addr_equal()]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      d2e03097
    • Eric Dumazet's avatar
      ipv6: lock socket in ip6_datagram_connect() · c1a7dedb
      Eric Dumazet authored
      [ Upstream commit 03645a11 ]
      
      ip6_datagram_connect() is doing a lot of socket changes without
      socket being locked.
      
      This looks wrong, at least for udp_lib_rehash() which could corrupt
      lists because of concurrent udp_sk(sk)->udp_portaddr_hash accesses.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      c1a7dedb
    • Herbert Xu's avatar
      net: Fix skb csum races when peeking · 58a5897a
      Herbert Xu authored
      [ Upstream commit 89c22d8c ]
      
      When we calculate the checksum on the recv path, we store the
      result in the skb as an optimisation in case we need the checksum
      again down the line.
      
      This is in fact bogus for the MSG_PEEK case as this is done without
      any locking.  So multiple threads can peek and then store the result
      to the same skb, potentially resulting in bogus skb states.
      
      This patch fixes this by only storing the result if the skb is not
      shared.  This preserves the optimisations for the few cases where
      it can be done safely due to locking or other reasons, e.g., SIOCINQ.
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      58a5897a
    • Oleg Nesterov's avatar
      net: pktgen: fix race between pktgen_thread_worker() and kthread_stop() · 7d41d849
      Oleg Nesterov authored
      [ Upstream commit fecdf8be ]
      
      pktgen_thread_worker() is obviously racy, kthread_stop() can come
      between the kthread_should_stop() check and set_current_state().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Reported-by: default avatarMarcelo Leitner <mleitner@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      7d41d849
    • Stephen Smalley's avatar
      net/tipc: initialize security state for new connection socket · 79bff4bc
      Stephen Smalley authored
      [ Upstream commit fdd75ea8 ]
      
      Calling connect() with an AF_TIPC socket would trigger a series
      of error messages from SELinux along the lines of:
      SELinux: Invalid class 0
      type=AVC msg=audit(1434126658.487:34500): avc:  denied  { <unprintable> }
        for pid=292 comm="kworker/u16:5" scontext=system_u:system_r:kernel_t:s0
        tcontext=system_u:object_r:unlabeled_t:s0 tclass=<unprintable>
        permissive=0
      
      This was due to a failure to initialize the security state of the new
      connection sock by the tipc code, leaving it with junk in the security
      class field and an unlabeled secid.  Add a call to security_sk_clone()
      to inherit the security state from the parent socket.
      Reported-by: default avatarTim Shearer <tim.shearer@overturenetworks.com>
      Signed-off-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.2: adjust context, indentation]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      79bff4bc
    • Linus Torvalds's avatar
      Initialize msg/shm IPC objects before doing ipc_addid() · 2ef259c0
      Linus Torvalds authored
      commit b9a53227 upstream.
      
      As reported by Dmitry Vyukov, we really shouldn't do ipc_addid() before
      having initialized the IPC object state.  Yes, we initialize the IPC
      object in a locked state, but with all the lockless RCU lookup work,
      that IPC object lock no longer means that the state cannot be seen.
      
      We already did this for the IPC semaphore code (see commit e8577d1f:
      "ipc/sem.c: fully initialize sem_array before making it visible") but we
      clearly forgot about msg and shm.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.2:
       - Adjust context
       - The error path being moved looks a little different]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      2ef259c0
    • Manfred Spraul's avatar
      ipc/sem.c: fully initialize sem_array before making it visible · 0bdf1e82
      Manfred Spraul authored
      commit e8577d1f upstream.
      
      ipc_addid() makes a new ipc identifier visible to everyone.  New objects
      start as locked, so that the caller can complete the initialization
      after the call.  Within struct sem_array, at least sma->sem_base and
      sma->sem_nsems are accessed without any locks, therefore this approach
      doesn't work.
      
      Thus: Move the ipc_addid() to the end of the initialization.
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Reported-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.2:
       - Adjust context
       - The error path being moved looks a little different]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      0bdf1e82
    • Sasha Levin's avatar
      RDS: verify the underlying transport exists before creating a connection · 987ad6ee
      Sasha Levin authored
      commit 74e98eb0 upstream.
      
      There was no verification that an underlying transport exists when creating
      a connection, this would cause dereferencing a NULL ptr.
      
      It might happen on sockets that weren't properly bound before attempting to
      send a message, which will cause a NULL ptr deref:
      
      [135546.047719] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
      [135546.051270] Modules linked in:
      [135546.051781] CPU: 4 PID: 15650 Comm: trinity-c4 Not tainted 4.2.0-next-20150902-sasha-00041-gbaa1222-dirty #2527
      [135546.053217] task: ffff8800835bc000 ti: ffff8800bc708000 task.ti: ffff8800bc708000
      [135546.054291] RIP: __rds_conn_create (net/rds/connection.c:194)
      [135546.055666] RSP: 0018:ffff8800bc70fab0  EFLAGS: 00010202
      [135546.056457] RAX: dffffc0000000000 RBX: 0000000000000f2c RCX: ffff8800835bc000
      [135546.057494] RDX: 0000000000000007 RSI: ffff8800835bccd8 RDI: 0000000000000038
      [135546.058530] RBP: ffff8800bc70fb18 R08: 0000000000000001 R09: 0000000000000000
      [135546.059556] R10: ffffed014d7a3a23 R11: ffffed014d7a3a21 R12: 0000000000000000
      [135546.060614] R13: 0000000000000001 R14: ffff8801ec3d0000 R15: 0000000000000000
      [135546.061668] FS:  00007faad4ffb700(0000) GS:ffff880252000000(0000) knlGS:0000000000000000
      [135546.062836] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [135546.063682] CR2: 000000000000846a CR3: 000000009d137000 CR4: 00000000000006a0
      [135546.064723] Stack:
      [135546.065048]  ffffffffafe2055c ffffffffafe23fc1 ffffed00493097bf ffff8801ec3d0008
      [135546.066247]  0000000000000000 00000000000000d0 0000000000000000 ac194a24c0586342
      [135546.067438]  1ffff100178e1f78 ffff880320581b00 ffff8800bc70fdd0 ffff880320581b00
      [135546.068629] Call Trace:
      [135546.069028] ? __rds_conn_create (include/linux/rcupdate.h:856 net/rds/connection.c:134)
      [135546.069989] ? rds_message_copy_from_user (net/rds/message.c:298)
      [135546.071021] rds_conn_create_outgoing (net/rds/connection.c:278)
      [135546.071981] rds_sendmsg (net/rds/send.c:1058)
      [135546.072858] ? perf_trace_lock (include/trace/events/lock.h:38)
      [135546.073744] ? lockdep_init (kernel/locking/lockdep.c:3298)
      [135546.074577] ? rds_send_drop_to (net/rds/send.c:976)
      [135546.075508] ? __might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3795)
      [135546.076349] ? __might_fault (mm/memory.c:3795)
      [135546.077179] ? rds_send_drop_to (net/rds/send.c:976)
      [135546.078114] sock_sendmsg (net/socket.c:611 net/socket.c:620)
      [135546.078856] SYSC_sendto (net/socket.c:1657)
      [135546.079596] ? SYSC_connect (net/socket.c:1628)
      [135546.080510] ? trace_dump_stack (kernel/trace/trace.c:1926)
      [135546.081397] ? ring_buffer_unlock_commit (kernel/trace/ring_buffer.c:2479 kernel/trace/ring_buffer.c:2558 kernel/trace/ring_buffer.c:2674)
      [135546.082390] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749)
      [135546.083410] ? trace_event_raw_event_sys_enter (include/trace/events/syscalls.h:16)
      [135546.084481] ? do_audit_syscall_entry (include/trace/events/syscalls.h:16)
      [135546.085438] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749)
      [135546.085515] rds_ib_laddr_check(): addr 36.74.25.172 ret -99 node type -1
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      987ad6ee
    • Jason Wang's avatar
      virtio-net: drop NETIF_F_FRAGLIST · e4afe1f1
      Jason Wang authored
      commit 48900cb6 upstream.
      
      virtio declares support for NETIF_F_FRAGLIST, but assumes
      that there are at most MAX_SKB_FRAGS + 2 fragments which isn't
      always true with a fraglist.
      
      A longer fraglist in the skb will make the call to skb_to_sgvec overflow
      the sg array, leading to memory corruption.
      
      Drop NETIF_F_FRAGLIST so we only get what we can handle.
      
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      e4afe1f1
    • Marcelo Leitner's avatar
      ipv6: addrconf: validate new MTU before applying it · 1c825dac
      Marcelo Leitner authored
      commit 77751427 upstream.
      
      Currently we don't check if the new MTU is valid or not and this allows
      one to configure a smaller than minimum allowed by RFCs or even bigger
      than interface own MTU, which is a problem as it may lead to packet
      drops.
      
      If you have a daemon like NetworkManager running, this may be exploited
      by remote attackers by forging RA packets with an invalid MTU, possibly
      leading to a DoS. (NetworkManager currently only validates for values
      too small, but not for too big ones.)
      
      The fix is just to make sure the new value is valid. That is, between
      IPV6_MIN_MTU and interface's MTU.
      
      Note that similar check is already performed at
      ndisc_router_discovery(), for when kernel itself parses the RA.
      Signed-off-by: default avatarMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      1c825dac
    • Benjamin Randazzo's avatar
      md: use kzalloc() when bitmap is disabled · 06f0f9d8
      Benjamin Randazzo authored
      commit b6878d9e upstream.
      
      In drivers/md/md.c get_bitmap_file() uses kmalloc() for creating a
      mdu_bitmap_file_t called "file".
      
      5769         file = kmalloc(sizeof(*file), GFP_NOIO);
      5770         if (!file)
      5771                 return -ENOMEM;
      
      This structure is copied to user space at the end of the function.
      
      5786         if (err == 0 &&
      5787             copy_to_user(arg, file, sizeof(*file)))
      5788                 err = -EFAULT
      
      But if bitmap is disabled only the first byte of "file" is initialized
      with zero, so it's possible to read some bytes (up to 4095) of kernel
      space memory from user space. This is an information leak.
      
      5775         /* bitmap disabled, zero the first byte and copy out */
      5776         if (!mddev->bitmap_info.file)
      5777                 file->pathname[0] = '\0';
      Signed-off-by: default avatarBenjamin Randazzo <benjamin@randazzo.fr>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      [bwh: Backported to 3.2: patch both possible allocation calls]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      06f0f9d8
    • Johan Hovold's avatar
      USB: whiteheat: fix potential null-deref at probe · cbea5711
      Johan Hovold authored
      commit cbb4be65 upstream.
      
      Fix potential null-pointer dereference at probe by making sure that the
      required endpoints are present.
      
      The whiteheat driver assumes there are at least five pairs of bulk
      endpoints, of which the final pair is used for the "command port". An
      attempt to bind to an interface with fewer bulk endpoints would
      currently lead to an oops.
      
      Fixes CVE-2015-5257.
      Reported-by: default avatarMoein Ghasemzadeh <moein@istuary.com>
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      cbea5711
    • Joseph Qi's avatar
      ocfs2/dlm: fix deadlock when dispatch assert master · 307c6c27
      Joseph Qi authored
      commit 012572d4 upstream.
      
      The order of the following three spinlocks should be:
      dlm_domain_lock < dlm_ctxt->spinlock < dlm_lock_resource->spinlock
      
      But dlm_dispatch_assert_master() is called while holding
      dlm_ctxt->spinlock and dlm_lock_resource->spinlock, and then it calls
      dlm_grab() which will take dlm_domain_lock.
      
      Once another thread (for example, dlm_query_join_handler) has already
      taken dlm_domain_lock, and tries to take dlm_ctxt->spinlock deadlock
      happens.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: "Junxiao Bi" <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      307c6c27
    • Andy Lutomirski's avatar
      x86/paravirt: Replace the paravirt nop with a bona fide empty function · 81fbc9a5
      Andy Lutomirski authored
      commit fc57a7c6 upstream.
      
      PARAVIRT_ADJUST_EXCEPTION_FRAME generates this code (using nmi as an
      example, trimmed for readability):
      
          ff 15 00 00 00 00       callq  *0x0(%rip)        # 2796 <nmi+0x6>
                    2792: R_X86_64_PC32     pv_irq_ops+0x2c
      
      That's a call through a function pointer to regular C function that
      does nothing on native boots, but that function isn't protected
      against kprobes, isn't marked notrace, and is certainly not
      guaranteed to preserve any registers if the compiler is feeling
      perverse.  This is bad news for a CLBR_NONE operation.
      
      Of course, if everything works correctly, once paravirt ops are
      patched, it gets nopped out, but what if we hit this code before
      paravirt ops are patched in?  This can potentially cause breakage
      that is very difficult to debug.
      
      A more subtle failure is possible here, too: if _paravirt_nop uses
      the stack at all (even just to push RBP), it will overwrite the "NMI
      executing" variable if it's called in the NMI prologue.
      
      The Xen case, perhaps surprisingly, is fine, because it's already
      written in asm.
      
      Fix all of the cases that default to paravirt_nop (including
      adjust_exception_frame) with a big hammer: replace paravirt_nop with
      an asm function that is just a ret instruction.
      
      The Xen case may have other problems, so document them.
      
      This is part of a fix for some random crashes that Sasha saw.
      Reported-and-tested-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Link: http://lkml.kernel.org/r/8f5d2ba295f9d73751c33d97fda03e0495d9ade0.1442791737.git.luto@kernel.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      [bwh: Backported to 3.2: adjust filename, context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      81fbc9a5
    • Peter Seiderer's avatar
      cifs: use server timestamp for ntlmv2 authentication · 90bba09c
      Peter Seiderer authored
      commit 98ce94c8 upstream.
      
      Linux cifs mount with ntlmssp against an Mac OS X (Yosemite
      10.10.5) share fails in case the clocks differ more than +/-2h:
      
      digest-service: digest-request: od failed with 2 proto=ntlmv2
      digest-service: digest-request: kdc failed with -1561745592 proto=ntlmv2
      
      Fix this by (re-)using the given server timestamp for the
      ntlmv2 authentication (as Windows 7 does).
      
      A related problem was also reported earlier by Namjae Jaen (see below):
      
      Windows machine has extended security feature which refuse to allow
      authentication when there is time difference between server time and
      client time when ntlmv2 negotiation is used. This problem is prevalent
      in embedded enviornment where system time is set to default 1970.
      
      Modern servers send the server timestamp in the TargetInfo Av_Pair
      structure in the challenge message [see MS-NLMP 2.2.2.1]
      In [MS-NLMP 3.1.5.1.2] it is explicitly mentioned that the client must
      use the server provided timestamp if present OR current time if it is
      not
      Reported-by: default avatarNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: default avatarPeter Seiderer <ps.report@gmx.net>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      90bba09c
    • Mathias Nyman's avatar
      xhci: change xhci 1.0 only restrictions to support xhci 1.1 · e35c94fa
      Mathias Nyman authored
      commit dca77945 upstream.
      
      Some changes between xhci 0.96 and xhci 1.0 specifications forced us to
      check the hci version in code, some of these checks were implemented as
      hci_version == 1.0, which will not work with new xhci 1.1 controllers.
      
      xhci 1.1 behaves similar to xhci 1.0 in these cases, so change these
      checks to hci_version >= 1.0
      Signed-off-by: default avatarMathias Nyman <mathias.nyman@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      e35c94fa
    • Roger Quadros's avatar
      usb: xhci: Clear XHCI_STATE_DYING on start · 88069fda
      Roger Quadros authored
      commit e5bfeab0 upstream.
      
      For whatever reason if XHCI died in the previous instant
      then it will never recover on the next xhci_start unless we
      clear the DYING flag.
      Signed-off-by: default avatarRoger Quadros <rogerq@ti.com>
      Signed-off-by: default avatarMathias Nyman <mathias.nyman@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      88069fda
    • Mathias Nyman's avatar
      xhci: give command abortion one more chance before killing xhci · cce88b82
      Mathias Nyman authored
      commit a6809ffd upstream.
      
      We want to give the command abortion an additional try to stop
      the command ring before we completely hose xhci.
      Tested-by: default avatarVincent Pelletier <plr.vincent@gmail.com>
      Signed-off-by: default avatarMathias Nyman <mathias.nyman@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      [bwh: Backported to 3.2: call handshake() rather than xhci_handshake()]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      cce88b82
    • Mathias Nyman's avatar
      usb: Use the USB_SS_MULT() macro to get the burst multiplier. · 519e5443
      Mathias Nyman authored
      commit ff30cbc8 upstream.
      
      Bits 1:0 of the bmAttributes are used for the burst multiplier.
      The rest of the bits used to be reserved (zero), but USB3.1 takes bit 7
      into use.
      
      Use the existing USB_SS_MULT() macro instead to make sure the mult value
      and hence max packet calculations are correct for USB3.1 devices.
      
      Note that burst multiplier in bmAttributes is zero based and that
      the USB_SS_MULT() macro adds one.
      Signed-off-by: default avatarMathias Nyman <mathias.nyman@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      519e5443
    • Paolo Bonzini's avatar
      KVM: x86: trap AMD MSRs for the TSeg base and mask · 1ddf94af
      Paolo Bonzini authored
      commit 3afb1121 upstream.
      
      These have roughly the same purpose as the SMRR, which we do not need
      to implement in KVM.  However, Linux accesses MSR_K8_TSEG_ADDR at
      boot, which causes problems when running a Xen dom0 under KVM.
      Just return 0, meaning that processor protection of SMRAM is not
      in effect.
      Reported-by: default avatarM A Young <m.a.young@durham.ac.uk>
      Acked-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      1ddf94af
    • Martin Schwidefsky's avatar
      s390/compat: correct uc_sigmask of the compat signal frame · 9bf6bf61
      Martin Schwidefsky authored
      commit 8d4bd0ed upstream.
      
      The uc_sigmask in the ucontext structure is an array of words to keep
      the 64 signal bits (or 1024 if you ask glibc but the kernel sigset_t
      only has 64 bits).
      
      For 64 bit the sigset_t contains a single 8 byte word, but for 31 bit
      there are two 4 byte words. The compat signal handler code uses a
      simple copy of the 64 bit sigset_t to the 31 bit compat_sigset_t.
      As s390 is a big-endian architecture this is incorrect, the two words
      in the 31 bit sigset_t array need to be swapped.
      Reported-by: default avatarStefan Liebler <stli@linux.vnet.ibm.com>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      [bwh: Backported to 3.2:
       - Introduce local compat_sigset_t in setup_frame32()
       - Adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      9bf6bf61
    • Robert Jarzmik's avatar
      ASoC: fix broken pxa SoC support · 1329f22d
      Robert Jarzmik authored
      commit 3c8f7710 upstream.
      
      The previous fix of pxa library support, which was introduced to fix the
      library dependency, broke the previous SoC behavior, where a machine
      code binding pxa2xx-ac97 with a coded relied on :
       - sound/soc/pxa/pxa2xx-ac97.c
       - sound/soc/codecs/XXX.c
      
      For example, the mioa701_wm9713.c machine code is currently broken. The
      "select ARM" statement wrongly selects the soc/arm/pxa2xx-ac97 for
      compilation, as per an unfortunate fate SND_PXA2XX_AC97 is both declared
      in sound/arm/Kconfig and sound/soc/pxa/Kconfig.
      
      Fix this by ensuring that SND_PXA2XX_SOC correctly triggers the correct
      pxa2xx-ac97 compilation.
      
      Fixes: 846172df ("ASoC: fix SND_PXA2XX_LIB Kconfig warning")
      Signed-off-by: default avatarRobert Jarzmik <robert.jarzmik@free.fr>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      1329f22d
    • David Woodhouse's avatar
      x86/platform: Fix Geode LX timekeeping in the generic x86 build · d8e332d4
      David Woodhouse authored
      commit 03da3ff1 upstream.
      
      In 2007, commit 07190a08 ("Mark TSC on GeodeLX reliable")
      bypassed verification of the TSC on Geode LX. However, this code
      (now in the check_system_tsc_reliable() function in
      arch/x86/kernel/tsc.c) was only present if CONFIG_MGEODE_LX was
      set.
      
      OpenWRT has recently started building its generic Geode target
      for Geode GX, not LX, to include support for additional
      platforms. This broke the timekeeping on LX-based devices,
      because the TSC wasn't marked as reliable:
      https://dev.openwrt.org/ticket/20531
      
      By adding a runtime check on is_geode_lx(), we can also include
      the fix if CONFIG_MGEODEGX1 or CONFIG_X86_GENERIC are set, thus
      fixing the problem.
      Signed-off-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Cc: Andres Salomon <dilinger@queued.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Marcelo Tosatti <marcelo@kvack.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1442409003.131189.87.camel@infradead.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      d8e332d4
    • Russell King's avatar
      ARM: fix Thumb2 signal handling when ARMv6 is enabled · 7aa36cdf
      Russell King authored
      commit 9b55613f upstream.
      
      When a kernel is built covering ARMv6 to ARMv7, we omit to clear the
      IT state when entering a signal handler.  This can cause the first
      few instructions to be conditionally executed depending on the parent
      context.
      
      In any case, the original test for >= ARMv7 is broken - ARMv6 can have
      Thumb-2 support as well, and an ARMv6T2 specific build would omit this
      code too.
      
      Relax the test back to ARMv6 or greater.  This results in us always
      clearing the IT state bits in the PSR, even on CPUs where these bits
      are reserved.  However, they're reserved for the IT state, so this
      should cause no harm.
      
      Fixes: d71e1352 ("Clear the IT state when invoking a Thumb-2 signal handler")
      Acked-by: default avatarTony Lindgren <tony@atomide.com>
      Tested-by: default avatarH. Nikolaus Schaller <hns@goldelico.com>
      Tested-by: default avatarGrazvydas Ignotas <notasas@gmail.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      7aa36cdf
    • T.J. Purtell's avatar
      ARM: 7880/1: Clear the IT state independent of the Thumb-2 mode · cf5fdb4a
      T.J. Purtell authored
      commit 6ecf830e upstream.
      
      The ARM architecture reference specifies that the IT state bits in the
      PSR must be all zeros in ARM mode or behavior is unspecified.  On the
      Qualcomm Snapdragon S4/Krait architecture CPUs the processor continues
      to consider the IT state bits while in ARM mode.  This makes it so
      that some instructions are skipped by the CPU.
      Signed-off-by: default avatarT.J. Purtell <tj@mobisocial.us>
      [rmk+kernel@arm.linux.org.uk: fixed whitespace formatting in patch]
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      cf5fdb4a
    • Jeff Mahoney's avatar
      btrfs: skip waiting on ordered range for special files · 6910b173
      Jeff Mahoney authored
      commit a30e577c upstream.
      
      In btrfs_evict_inode, we properly truncate the page cache for evicted
      inodes but then we call btrfs_wait_ordered_range for every inode as well.
      It's the right thing to do for regular files but results in incorrect
      behavior for device inodes for block devices.
      
      filemap_fdatawrite_range gets called with inode->i_mapping which gets
      resolved to the block device inode before getting passed to
      wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi.  What happens
      next depends on whether there's an open file handle associated with the
      inode.  If there is, we write to the block device, which is unexpected
      behavior.  If there isn't, we through normally and inode->i_data is used.
      We can also end up racing against open/close which can result in crashes
      when i_mapping points to a block device inode that has been closed.
      
      Since there can't be any page cache associated with special file inodes,
      it's safe to skip the btrfs_wait_ordered_range call entirely and avoid
      the problem.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911Tested-by: default avatarChristoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      6910b173
    • Filipe Manana's avatar
      Btrfs: fix read corruption of compressed and shared extents · e52ea4cc
      Filipe Manana authored
      commit 005efedf upstream.
      
      If a file has a range pointing to a compressed extent, followed by
      another range that points to the same compressed extent and a read
      operation attempts to read both ranges (either completely or part of
      them), the pages that correspond to the second range are incorrectly
      filled with zeroes.
      
      Consider the following example:
      
        File layout
        [0 - 8K]                      [8K - 24K]
            |                             |
            |                             |
         points to extent X,         points to extent X,
         offset 4K, length of 8K     offset 0, length 16K
      
        [extent X, compressed length = 4K uncompressed length = 16K]
      
      If a readpages() call spans the 2 ranges, a single bio to read the extent
      is submitted - extent_io.c:submit_extent_page() would only create a new
      bio to cover the second range pointing to the extent if the extent it
      points to had a different logical address than the extent associated with
      the first range. This has a consequence of the compressed read end io
      handler (compression.c:end_compressed_bio_read()) finish once the extent
      is decompressed into the pages covering the first range, leaving the
      remaining pages (belonging to the second range) filled with zeroes (done
      by compression.c:btrfs_clear_biovec_end()).
      
      So fix this by submitting the current bio whenever we find a range
      pointing to a compressed extent that was preceded by a range with a
      different extent map. This is the simplest solution for this corner
      case. Making the end io callback populate both ranges (or more, if we
      have multiple pointing to the same extent) is a much more complex
      solution since each bio is tightly coupled with a single extent map and
      the extent maps associated to the ranges pointing to the shared extent
      can have different offsets and lengths.
      
      The following test case for fstests triggers the issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_cloner
      
        rm -f $seqres.full
      
        test_clone_and_read_compressed_extent()
        {
            local mount_opts=$1
      
            _scratch_mkfs >>$seqres.full 2>&1
            _scratch_mount $mount_opts
      
            # Create a test file with a single extent that is compressed (the
            # data we write into it is highly compressible no matter which
            # compression algorithm is used, zlib or lzo).
            $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K"        \
                            -c "pwrite -S 0xbb 4K 8K"        \
                            -c "pwrite -S 0xcc 12K 4K"       \
                            $SCRATCH_MNT/foo | _filter_xfs_io
      
            # Now clone our extent into an adjacent offset.
            $CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
                $SCRATCH_MNT/foo $SCRATCH_MNT/foo
      
            # Same as before but for this file we clone the extent into a lower
            # file offset.
            $XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K"         \
                            -c "pwrite -S 0xbb 12K 8K"        \
                            -c "pwrite -S 0xcc 20K 4K"        \
                            $SCRATCH_MNT/bar | _filter_xfs_io
      
            $CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
                $SCRATCH_MNT/bar $SCRATCH_MNT/bar
      
            echo "File digests before unmounting filesystem:"
            md5sum $SCRATCH_MNT/foo | _filter_scratch
            md5sum $SCRATCH_MNT/bar | _filter_scratch
      
            # Evicting the inode or clearing the page cache before reading
            # again the file would also trigger the bug - reads were returning
            # all bytes in the range corresponding to the second reference to
            # the extent with a value of 0, but the correct data was persisted
            # (it was a bug exclusively in the read path). The issue happened
            # only if the same readpages() call targeted pages belonging to the
            # first and second ranges that point to the same compressed extent.
            _scratch_remount
      
            echo "File digests after mounting filesystem again:"
            # Must match the same digests we got before.
            md5sum $SCRATCH_MNT/foo | _filter_scratch
            md5sum $SCRATCH_MNT/bar | _filter_scratch
        }
      
        echo -e "\nTesting with zlib compression..."
        test_clone_and_read_compressed_extent "-o compress=zlib"
      
        _scratch_unmount
      
        echo -e "\nTesting with lzo compression..."
        test_clone_and_read_compressed_extent "-o compress=lzo"
      
        status=0
        exit
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      [bwh: Backported to 3.2:
       - Maintain prev_em_start in both functions calling __extent_read_full_page()
         in a loop
       - Adjust context and order]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      e52ea4cc
    • Liu.Zhao's avatar
      USB: option: add ZTE PIDs · 254a47ce
      Liu.Zhao authored
      commit 19ab6bc5 upstream.
      
      This is intended to add ZTE device PIDs on kernel.
      Signed-off-by: default avatarLiu.Zhao <lzsos369@163.com>
      [johan: sort the new entries ]
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      254a47ce
    • Arnaldo Carvalho de Melo's avatar
      perf header: Fixup reading of HEADER_NRCPUS feature · 749b5bac
      Arnaldo Carvalho de Melo authored
      commit caa47047 upstream.
      
      The original patch introducing this header wrote the number of CPUs available
      and online in one order and then swapped those values when reading, fix it.
      
      Before:
      
        # perf record usleep 1
        # perf report --header-only | grep 'nrcpus \(online\|avail\)'
        # nrcpus online : 4
        # nrcpus avail : 4
        # echo 0 > /sys/devices/system/cpu/cpu2/online
        # perf record usleep 1
        # perf report --header-only | grep 'nrcpus \(online\|avail\)'
        # nrcpus online : 4
        # nrcpus avail : 3
        # echo 0 > /sys/devices/system/cpu/cpu1/online
        # perf record usleep 1
        # perf report --header-only | grep 'nrcpus \(online\|avail\)'
        # nrcpus online : 4
        # nrcpus avail : 2
      
      After the fix, bringing back the CPUs online:
      
        # perf report --header-only | grep 'nrcpus \(online\|avail\)'
        # nrcpus online : 2
        # nrcpus avail : 4
        # echo 1 > /sys/devices/system/cpu/cpu2/online
        # perf record usleep 1
        # perf report --header-only | grep 'nrcpus \(online\|avail\)'
        # nrcpus online : 3
        # nrcpus avail : 4
        # echo 1 > /sys/devices/system/cpu/cpu1/online
        # perf record usleep 1
        # perf report --header-only | grep 'nrcpus \(online\|avail\)'
        # nrcpus online : 4
        # nrcpus avail : 4
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Wang Nan <wangnan0@huawei.com>
      Fixes: fbe96f29 ("perf tools: Make perf.data more self-descriptive (v8)")
      Link: http://lkml.kernel.org/r/20150911153323.GP23511@kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      [bwh: Backported to 3.2: print_nrcpus() reads and prints these fields
       immediately, so read both of them into an array before printing them in
       reverse order.]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      749b5bac
    • Hin-Tak Leung's avatar
      hfs: fix B-tree corruption after insertion at position 0 · d46a3490
      Hin-Tak Leung authored
      commit b4cc0efe upstream.
      
      Fix B-tree corruption when a new record is inserted at position 0 in the
      node in hfs_brec_insert().
      
      This is an identical change to the corresponding hfs b-tree code to Sergei
      Antonov's "hfsplus: fix B-tree corruption after insertion at position 0",
      to keep similar code paths in the hfs and hfsplus drivers in sync, where
      appropriate.
      Signed-off-by: default avatarHin-Tak Leung <htl10@users.sourceforge.net>
      Cc: Sergei Antonov <saproj@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Reviewed-by: default avatarVyacheslav Dubeyko <slava@dubeyko.com>
      Cc: Anton Altaparmakov <anton@tuxera.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      d46a3490
    • Hin-Tak Leung's avatar
      hfs,hfsplus: cache pages correctly between bnode_create and bnode_free · dd04e674
      Hin-Tak Leung authored
      commit 7cb74be6 upstream.
      
      Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
      hfs_bnode_find() for finding or creating pages corresponding to an inode)
      are immediately kmap()'ed and used (both read and write) and kunmap()'ed,
      and should not be page_cache_release()'ed until hfs_bnode_free().
      
      This patch fixes a problem I first saw in July 2012: merely running "du"
      on a large hfsplus-mounted directory a few times on a reasonably loaded
      system would get the hfsplus driver all confused and complaining about
      B-tree inconsistencies, and generates a "BUG: Bad page state".  Most
      recently, I can generate this problem on up-to-date Fedora 22 with shipped
      kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller
      mounts) and "du /mnt" simultaneously on two windows, where /mnt is a
      lightly-used QEMU VM image of the full Mac OS X 10.9:
      
      $ df -i / /home /mnt
      Filesystem                  Inodes   IUsed      IFree IUse% Mounted on
      /dev/mapper/fedora-root    3276800  551665    2725135   17% /
      /dev/mapper/fedora-home   52879360  716221   52163139    2% /home
      /dev/nbd0p2             4294967295 1387818 4293579477    1% /mnt
      
      After applying the patch, I was able to run "du /" (60+ times) and "du
      /mnt" (150+ times) continuously and simultaneously for 6+ hours.
      
      There are many reports of the hfsplus driver getting confused under load
      and generating "BUG: Bad page state" or other similar issues over the
      years.  [1]
      
      The unpatched code [2] has always been wrong since it entered the kernel
      tree.  The only reason why it gets away with it is that the
      kmap/memcpy/kunmap follow very quickly after the page_cache_release() so
      the kernel has not had a chance to reuse the memory for something else,
      most of the time.
      
      The current RW driver appears to have followed the design and development
      of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec
      2001) had a B-tree node-centric approach to
      read_cache_page()/page_cache_release() per bnode_get()/bnode_put(),
      migrating towards version 0.2 (June 2002) of caching and releasing pages
      per inode extents.  When the current RW code first entered the kernel [2]
      in 2005, there was an REF_PAGES conditional (and "//" commented out code)
      to switch between B-node centric paging to inode-centric paging.  There
      was a mistake with the direction of one of the REF_PAGES conditionals in
      __hfs_bnode_create().  In a subsequent "remove debug code" commit [4], the
      read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
      removed, but a page_cache_release() was mistakenly left in (propagating
      the "REF_PAGES <-> !REF_PAGE" mistake), and the commented-out
      page_cache_release() in bnode_release() (which should be spanned by
      !REF_PAGES) was never enabled.
      
      References:
      [1]:
      Michael Fox, Apr 2013
      http://www.spinics.net/lists/linux-fsdevel/msg63807.html
      ("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")
      
      Sasha Levin, Feb 2015
      http://lkml.org/lkml/2015/2/20/85 ("use after free")
      
      https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814
      https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887
      https://bugzilla.kernel.org/show_bug.cgi?id=42342
      https://bugzilla.kernel.org/show_bug.cgi?id=63841
      https://bugzilla.kernel.org/show_bug.cgi?id=78761
      
      [2]:
      http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
      fs/hfs/bnode.c?id=d1081202
      commit d1081202
      Author: Andrew Morton <akpm@osdl.org>
      Date:   Wed Feb 25 16:17:36 2004 -0800
      
          [PATCH] HFS rewrite
      
      http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
      fs/hfsplus/bnode.c?id=91556682
      
      commit 91556682
      Author: Andrew Morton <akpm@osdl.org>
      Date:   Wed Feb 25 16:17:48 2004 -0800
      
          [PATCH] HFS+ support
      
      [3]:
      http://sourceforge.net/projects/linux-hfsplus/
      
      http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/
      http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/
      
      http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\
      fs/hfsplus/bnode.c?r1=1.4&r2=1.5
      
      Date:   Thu Jun 6 09:45:14 2002 +0000
      Use buffer cache instead of page cache in bnode.c. Cache inode extents.
      
      [4]:
      http://git.kernel.org/cgit/linux/kernel/git/\
      stable/linux-stable.git/commit/?id=a5e3985f
      
      commit a5e3985f
      Author: Roman Zippel <zippel@linux-m68k.org>
      Date:   Tue Sep 6 15:18:47 2005 -0700
      
      [PATCH] hfs: remove debug code
      Signed-off-by: default avatarHin-Tak Leung <htl10@users.sourceforge.net>
      Signed-off-by: default avatarSergei Antonov <saproj@gmail.com>
      Reviewed-by: default avatarAnton Altaparmakov <anton@tuxera.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
      Cc: Sougata Santra <sougata@tuxera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      dd04e674
    • Paul Mackerras's avatar
      powerpc/MSI: Fix race condition in tearing down MSI interrupts · 96ad262f
      Paul Mackerras authored
      commit e297c939 upstream.
      
      This fixes a race which can result in the same virtual IRQ number
      being assigned to two different MSI interrupts.  The most visible
      consequence of that is usually a warning and stack trace from the
      sysfs code about an attempt to create a duplicate entry in sysfs.
      
      The race happens when one CPU (say CPU 0) is disposing of an MSI
      while another CPU (say CPU 1) is setting up an MSI.  CPU 0 calls
      (for example) pnv_teardown_msi_irqs(), which calls
      msi_bitmap_free_hwirqs() to indicate that the MSI (i.e. its
      hardware IRQ number) is no longer in use.  Then, before CPU 0 gets
      to calling irq_dispose_mapping() to free up the virtal IRQ number,
      CPU 1 comes in and calls msi_bitmap_alloc_hwirqs() to allocate an
      MSI, and gets the same hardware IRQ number that CPU 0 just freed.
      CPU 1 then calls irq_create_mapping() to get a virtual IRQ number,
      which sees that there is currently a mapping for that hardware IRQ
      number and returns the corresponding virtual IRQ number (which is
      the same virtual IRQ number that CPU 0 was using).  CPU 0 then
      calls irq_dispose_mapping() and frees that virtual IRQ number.
      Now, if another CPU comes along and calls irq_create_mapping(), it
      is likely to get the virtual IRQ number that was just freed,
      resulting in the same virtual IRQ number apparently being used for
      two different hardware interrupts.
      
      To fix this race, we just move the call to msi_bitmap_free_hwirqs()
      to after the call to irq_dispose_mapping().  Since virq_to_hw()
      doesn't work for the virtual IRQ number after irq_dispose_mapping()
      has been called, we need to call it before irq_dispose_mapping() and
      remember the result for the msi_bitmap_free_hwirqs() call.
      
      The pattern of calling msi_bitmap_free_hwirqs() before
      irq_dispose_mapping() appears in 5 places under arch/powerpc, and
      appears to have originated in commit 05af7bd2 ("[POWERPC] MPIC
      U3/U4 MSI backend") from 2007.
      
      Fixes: 05af7bd2 ("[POWERPC] MPIC U3/U4 MSI backend")
      Reported-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      [bwh: Backported to 3.2:
       - powernv uses a private functions instead of msi_bitmap_free_hwirqs()
       - Adjust filename, context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      96ad262f
    • Konstantin Khlebnikov's avatar
      pagemap: hide physical addresses from non-privileged users · b1fb185f
      Konstantin Khlebnikov authored
      commit 1c90308e upstream.
      
      This patch makes pagemap readable for normal users and hides physical
      addresses from them.  For some use-cases PFN isn't required at all.
      
      See http://lkml.kernel.org/r/1425935472-17949-1-git-send-email-kirill@shutemov.name
      
      Fixes: ab676b7d ("pagemap: do not leak physical addresses to non-privileged userspace")
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarMark Williamson <mwilliamson@undo-software.com>
      Tested-by: default avatarMark Williamson <mwilliamson@undo-software.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.2:
       - Add the same check in the places where we look up a PFN
       - Add struct pagemapread * parameters where necessary
       - Open-code file_ns_capable()
       - Delete pagemap_open() entirely, as it would always return 0]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      b1fb185f
    • Ard Biesheuvel's avatar
      ARM: 8429/1: disable GCC SRA optimization · 9d3eb706
      Ard Biesheuvel authored
      commit a077224f upstream.
      
      While working on the 32-bit ARM port of UEFI, I noticed a strange
      corruption in the kernel log. The following snprintf() statement
      (in drivers/firmware/efi/efi.c:efi_md_typeattr_format())
      
      	snprintf(pos, size, "|%3s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s]",
      
      was producing the following output in the log:
      
      	|    |   |   |   |    |WB|WT|WC|UC]
      	|    |   |   |   |    |WB|WT|WC|UC]
      	|    |   |   |   |    |WB|WT|WC|UC]
      	|RUN|   |   |   |    |WB|WT|WC|UC]*
      	|RUN|   |   |   |    |WB|WT|WC|UC]*
      	|    |   |   |   |    |WB|WT|WC|UC]
      	|RUN|   |   |   |    |WB|WT|WC|UC]*
      	|    |   |   |   |    |WB|WT|WC|UC]
      	|RUN|   |   |   |    |   |   |   |UC]
      	|RUN|   |   |   |    |   |   |   |UC]
      
      As it turns out, this is caused by incorrect code being emitted for
      the string() function in lib/vsprintf.c. The following code
      
      	if (!(spec.flags & LEFT)) {
      		while (len < spec.field_width--) {
      			if (buf < end)
      				*buf = ' ';
      			++buf;
      		}
      	}
      	for (i = 0; i < len; ++i) {
      		if (buf < end)
      			*buf = *s;
      		++buf; ++s;
      	}
      	while (len < spec.field_width--) {
      		if (buf < end)
      			*buf = ' ';
      		++buf;
      	}
      
      when called with len == 0, triggers an issue in the GCC SRA optimization
      pass (Scalar Replacement of Aggregates), which handles promotion of signed
      struct members incorrectly. This is a known but as yet unresolved issue.
      (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65932). In this particular
      case, it is causing the second while loop to be executed erroneously a
      single time, causing the additional space characters to be printed.
      
      So disable the optimization by passing -fno-ipa-sra.
      Acked-by: default avatarNicolas Pitre <nico@linaro.org>
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      9d3eb706
    • Kees Cook's avatar
      fs: create and use seq_show_option for escaping · f4a08180
      Kees Cook authored
      commit a068acf2 upstream.
      
      Many file systems that implement the show_options hook fail to correctly
      escape their output which could lead to unescaped characters (e.g.  new
      lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files.  This
      could lead to confusion, spoofed entries (resulting in things like
      systemd issuing false d-bus "mount" notifications), and who knows what
      else.  This looks like it would only be the root user stepping on
      themselves, but it's possible weird things could happen in containers or
      in other situations with delegated mount privileges.
      
      Here's an example using overlay with setuid fusermount trusting the
      contents of /proc/mounts (via the /etc/mtab symlink).  Imagine the use
      of "sudo" is something more sneaky:
      
        $ BASE="ovl"
        $ MNT="$BASE/mnt"
        $ LOW="$BASE/lower"
        $ UP="$BASE/upper"
        $ WORK="$BASE/work/ 0 0
        none /proc fuse.pwn user_id=1000"
        $ mkdir -p "$LOW" "$UP" "$WORK"
        $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
        $ cat /proc/mounts
        none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
        none /proc fuse.pwn user_id=1000 0 0
        $ fusermount -u /proc
        $ cat /proc/mounts
        cat: /proc/mounts: No such file or directory
      
      This fixes the problem by adding new seq_show_option and
      seq_show_option_n helpers, and updating the vulnerable show_option
      handlers to use them as needed.  Some, like SELinux, need to be open
      coded due to unusual existing escape mechanisms.
      
      [akpm@linux-foundation.org: add lost chunk, per Kees]
      [keescook@chromium.org: seq_show_option should be using const parameters]
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Acked-by: default avatarJan Kara <jack@suse.com>
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Cc: J. R. Okajima <hooanon05g@gmail.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.2:
       - Drop changes to overlayfs, reiserfs
       - Drop vers option from cifs
       - ceph changes are all in one file
       - Adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      f4a08180