1. 10 Sep, 2017 6 commits
  2. 01 Sep, 2017 8 commits
  3. 05 Aug, 2017 2 commits
  4. 31 Jul, 2017 24 commits
    • Sasha Levin's avatar
      ipvs: SNAT packet replies only for NATed connections · 28d8e1bc
      Sasha Levin authored
      [ Upstream commit 3c5ab3f3 ]
      
      We do not check if packet from real server is for NAT
      connection before performing SNAT. This causes problems
      for setups that use DR/TUN and allow local clients to
      access the real server directly, for example:
      
      - local client in director creates IPVS-DR/TUN connection
      CIP->VIP and the request packets are routed to RIP.
      Talks are finished but IPVS connection is not expired yet.
      
      - second local client creates non-IPVS connection CIP->RIP
      with same reply tuple RIP->CIP and when replies are received
      on LOCAL_IN we wrongly assign them for the first client
      connection because RIP->CIP matches the reply direction.
      As result, IPVS SNATs replies for non-IPVS connections.
      
      The problem is more visible to local UDP clients but in rare
      cases it can happen also for TCP or remote clients when the
      real server sends the reply traffic via the director.
      
      So, better to be more precise for the reply traffic.
      As replies are not expected for DR/TUN connections, better
      to not touch them.
      Reported-by: default avatarNick Moriarty <nick.moriarty@york.ac.uk>
      Tested-by: default avatarNick Moriarty <nick.moriarty@york.ac.uk>
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarSimon Horman <horms@verge.net.au>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      28d8e1bc
    • Sasha Levin's avatar
      4e8a4d30
    • Ian Abbott's avatar
      staging: comedi: ni_mio_common: fix E series ni_ai_insn_read() data · 947a97e2
      Ian Abbott authored
      [ Upstream commit 857a6610 ]
      
      Commit 0557344e ("staging: comedi: ni_mio_common: fix local var for
      32-bit read") changed the type of local variable `d` from `unsigned
      short` to `unsigned int` to fix a bug introduced in
      commit 9c340ac9 ("staging: comedi: ni_stc.h: add read/write
      callbacks to struct ni_private") when reading AI data for NI PCI-6110
      and PCI-6111 cards.  Unfortunately, other parts of the function rely on
      the variable being `unsigned short` when an offset value in local
      variable `signbits` is added to `d` before writing the value to the
      `data` array:
      
      			d += signbits;
      		  	data[n] = d;
      
      The `signbits` variable will be non-zero in bipolar mode, and is used to
      convert the hardware's 2's complement, 16-bit numbers to Comedi's
      straight binary sample format (with 0 representing the most negative
      voltage).  This breaks because `d` is now 32 bits wide instead of 16
      bits wide, so after the addition of `signbits`, `data[n]` ends up being
      set to values above 65536 for negative voltages.  This affects all
      supported "E series" cards except PCI-6143 (and PXI-6143). Fix it by
      ANDing the value written to the `data[n]` with the mask 0xffff.
      
      Fixes: 0557344e ("staging: comedi: ni_mio_common: fix local var for 32-bit read")
      Signed-off-by: default avatarIan Abbott <abbotti@mev.co.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      947a97e2
    • Jim Mattson's avatar
      kvm: vmx: Do not disable intercepts for BNDCFGS · 2717d19c
      Jim Mattson authored
      [ Upstream commit a8b6fda3 ]
      
      The MSR permission bitmaps are shared by all VMs. However, some VMs
      may not be configured to support MPX, even when the host does. If the
      host supports VMX and the guest does not, we should intercept accesses
      to the BNDCFGS MSR, so that we can synthesize a #GP
      fault. Furthermore, if the host does not support MPX and the
      "ignore_msrs" kvm kernel parameter is set, then we should intercept
      accesses to the BNDCFGS MSR, so that we can skip over the rdmsr/wrmsr
      without raising a #GP fault.
      
      Fixes: da8999d3 ("KVM: x86: Intel MPX vmx and msr handle")
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      2717d19c
    • Pavankumar Kondeti's avatar
      tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate results · e532444e
      Pavankumar Kondeti authored
      [ Upstream commit c59f29cb ]
      
      The 's' flag is supposed to indicate that a softirq is running. This
      can be detected by testing the preempt_count with SOFTIRQ_OFFSET.
      
      The current code tests the preempt_count with SOFTIRQ_MASK, which
      would be true even when softirqs are disabled but not serving a
      softirq.
      
      Link: http://lkml.kernel.org/r/1481300417-3564-1-git-send-email-pkondeti@codeaurora.orgSigned-off-by: default avatarPavankumar Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      e532444e
    • Dan Carpenter's avatar
      PM / QoS: return -EINVAL for bogus strings · cf86f4f6
      Dan Carpenter authored
      [ Upstream commit 2ca30331 ]
      
      In the current code, if the user accidentally writes a bogus command to
      this sysfs file, then we set the latency tolerance to an uninitialized
      variable.
      
      Fixes: 2d984ad1 (PM / QoS: Introcuce latency tolerance device PM QoS type)
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: default avatarPavel Machek <pavel@ucw.cz>
      Cc: 3.15+ <stable@vger.kernel.org> # 3.15+
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      cf86f4f6
    • Lauro Ramos Venancio's avatar
      sched/topology: Optimize build_group_mask() · 94987721
      Lauro Ramos Venancio authored
      [ Upstream commit f32d782e ]
      
      The group mask is always used in intersection with the group CPUs. So,
      when building the group mask, we don't have to care about CPUs that are
      not part of the group.
      Signed-off-by: default avatarLauro Ramos Venancio <lvenanci@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: lwang@redhat.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1492717903-5195-2-git-send-email-lvenanci@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      94987721
    • Peter Zijlstra's avatar
      sched/topology: Fix overlapping sched_group_mask · 87de53ef
      Peter Zijlstra authored
      [ Upstream commit 73bb059f ]
      
      The point of sched_group_mask is to select those CPUs from
      sched_group_cpus that can actually arrive at this balance domain.
      
      The current code gets it wrong, as can be readily demonstrated with a
      topology like:
      
        node   0   1   2   3
          0:  10  20  30  20
          1:  20  10  20  30
          2:  30  20  10  20
          3:  20  30  20  10
      
      Where (for example) domain 1 on CPU1 ends up with a mask that includes
      CPU0:
      
        [] CPU1 attaching sched-domain:
        []  domain 0: span 0-2 level NUMA
        []   groups: 1 (mask: 1), 2, 0
        []   domain 1: span 0-3 level NUMA
        []    groups: 0-2 (mask: 0-2) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)
      
      This causes sched_balance_cpu() to compute the wrong CPU and
      consequently should_we_balance() will terminate early resulting in
      missed load-balance opportunities.
      
      The fixed topology looks like:
      
        [] CPU1 attaching sched-domain:
        []  domain 0: span 0-2 level NUMA
        []   groups: 1 (mask: 1), 2, 0
        []   domain 1: span 0-3 level NUMA
        []    groups: 0-2 (mask: 1) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)
      
      (note: this relies on OVERLAP domains to always have children, this is
       true because the regular topology domains are still here -- this is
       before degenerate trimming)
      Debugged-by: default avatarLauro Ramos Venancio <lvenanci@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: e3589f6c ("sched: Allow for overlapping sched_domain spans")
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      87de53ef
    • Horia Geantă's avatar
      crypto: caam - fix signals handling · 36feca8c
      Horia Geantă authored
      [ Upstream commit 7459e1d2 ]
      
      Driver does not properly handle the case when signals interrupt
      wait_for_completion_interruptible():
      -it does not check for return value
      -completion structure is allocated on stack; in case a signal interrupts
      the sleep, it will go out of scope, causing the worker thread
      (caam_jr_dequeue) to fail when it accesses it
      
      wait_for_completion_interruptible() is replaced with uninterruptable
      wait_for_completion().
      We choose to block all signals while waiting for I/O (device executing
      the split key generation job descriptor) since the alternative - in
      order to have a deterministic device state - would be to flush the job
      ring (aborting *all* in-progress jobs).
      
      Cc: <stable@vger.kernel.org>
      Fixes: 045e3678 ("crypto: caam - ahash hmac support")
      Fixes: 4c1ec1f9 ("crypto: caam - refactor key_gen, sg")
      Signed-off-by: default avatarHoria Geantă <horia.geanta@nxp.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      36feca8c
    • Gilad Ben-Yossef's avatar
      crypto: atmel - only treat EBUSY as transient if backlog · f04dc46b
      Gilad Ben-Yossef authored
      [ Upstream commit 1606043f ]
      
      The Atmel SHA driver was treating -EBUSY as indication of queueing
      to backlog without checking that backlog is enabled for the request.
      
      Fix it by checking request flags.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarGilad Ben-Yossef <gilad@benyossef.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      f04dc46b
    • Martin Hicks's avatar
      crypto: talitos - Extend max key length for SHA384/512-HMAC and AEAD · 05398f6d
      Martin Hicks authored
      [ Upstream commit 03d2c511 ]
      
      An updated patch that also handles the additional key length requirements
      for the AEAD algorithms.
      
      The max keysize is not 96.  For SHA384/512 it's 128, and for the AEAD
      algorithms it's longer still.  Extend the max keysize for the
      AEAD size for AES256 + HMAC(SHA512).
      
      Cc: <stable@vger.kernel.org> # 3.6+
      Fixes: 357fb605 ("crypto: talitos - add sha224, sha384 and sha512 to existing AEAD algorithms")
      Signed-off-by: default avatarMartin Hicks <mort@bork.org>
      Acked-by: default avatarHoria Geantă <horia.geanta@nxp.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      05398f6d
    • Josh Zimmerman's avatar
      Add "shutdown" to "struct class". · 570bc1b5
      Josh Zimmerman authored
      [ Upstream commit f77af151 ]
      
      The TPM class has some common shutdown code that must be executed for
      all drivers. This adds some needed functionality for that.
      Signed-off-by: default avatarJosh Zimmerman <joshz@google.com>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: stable@vger.kernel.org
      Fixes: 74d6b3ce ("tpm: fix suspend/resume paths for TPM 2.0")
      Reviewed-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Tested-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: default avatarJames Morris <james.l.morris@oracle.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      570bc1b5
    • Eric W. Biederman's avatar
      mnt: Make propagate_umount less slow for overlapping mount propagation trees · 295f2370
      Eric W. Biederman authored
      [ Upstream commit 296990de ]
      
      Andrei Vagin pointed out that time to executue propagate_umount can go
      non-linear (and take a ludicrious amount of time) when the mount
      propogation trees of the mounts to be unmunted by a lazy unmount
      overlap.
      
      Make the walk of the mount propagation trees nearly linear by
      remembering which mounts have already been visited, allowing
      subsequent walks to detect when walking a mount propgation tree or a
      subtree of a mount propgation tree would be duplicate work and to skip
      them entirely.
      
      Walk the list of mounts whose propgatation trees need to be traversed
      from the mount highest in the mount tree to mounts lower in the mount
      tree so that odds are higher that the code will walk the largest trees
      first, allowing later tree walks to be skipped entirely.
      
      Add cleanup_umount_visitation to remover the code's memory of which
      mounts have been visited.
      
      Add the functions last_slave and skip_propagation_subtree to allow
      skipping appropriate parts of the mount propagation tree without
      needing to change the logic of the rest of the code.
      
      A script to generate overlapping mount propagation trees:
      
      $ cat runs.h
      set -e
      mount -t tmpfs zdtm /mnt
      mkdir -p /mnt/1 /mnt/2
      mount -t tmpfs zdtm /mnt/1
      mount --make-shared /mnt/1
      mkdir /mnt/1/1
      
      iteration=10
      if [ -n "$1" ] ; then
      	iteration=$1
      fi
      
      for i in $(seq $iteration); do
      	mount --bind /mnt/1/1 /mnt/1/1
      done
      
      mount --rbind /mnt/1 /mnt/2
      
      TIMEFORMAT='%Rs'
      nr=$(( ( 2 ** ( $iteration + 1 ) ) + 1 ))
      echo -n "umount -l /mnt/1 -> $nr        "
      time umount -l /mnt/1
      
      nr=$(cat /proc/self/mountinfo | grep zdtm | wc -l )
      time umount -l /mnt/2
      
      $ for i in $(seq 9 19); do echo $i; unshare -Urm bash ./run.sh $i; done
      
      Here are the performance numbers with and without the patch:
      
           mhash |  8192   |  8192  | 1048576 | 1048576
          mounts | before  | after  |  before | after
          ------------------------------------------------
            1025 |  0.040s | 0.016s |  0.038s | 0.019s
            2049 |  0.094s | 0.017s |  0.080s | 0.018s
            4097 |  0.243s | 0.019s |  0.206s | 0.023s
            8193 |  1.202s | 0.028s |  1.562s | 0.032s
           16385 |  9.635s | 0.036s |  9.952s | 0.041s
           32769 | 60.928s | 0.063s | 44.321s | 0.064s
           65537 |         | 0.097s |         | 0.097s
          131073 |         | 0.233s |         | 0.176s
          262145 |         | 0.653s |         | 0.344s
          524289 |         | 2.305s |         | 0.735s
         1048577 |         | 7.107s |         | 2.603s
      
      Andrei Vagin reports fixing the performance problem is part of the
      work to fix CVE-2016-6213.
      
      Cc: stable@vger.kernel.org
      Fixes: a05964f3 ("[PATCH] shared mounts handling: umount")
      Reported-by: default avatarAndrei Vagin <avagin@openvz.org>
      Reviewed-by: default avatarAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      295f2370
    • Eric W. Biederman's avatar
      mnt: In propgate_umount handle visiting mounts in any order · dc5ce95b
      Eric W. Biederman authored
      [ Upstream commit 99b19d16 ]
      
      While investigating some poor umount performance I realized that in
      the case of overlapping mount trees where some of the mounts are locked
      the code has been failing to unmount all of the mounts it should
      have been unmounting.
      
      This failure to unmount all of the necessary
      mounts can be reproduced with:
      
      $ cat locked_mounts_test.sh
      
      mount -t tmpfs test-base /mnt
      mount --make-shared /mnt
      mkdir -p /mnt/b
      
      mount -t tmpfs test1 /mnt/b
      mount --make-shared /mnt/b
      mkdir -p /mnt/b/10
      
      mount -t tmpfs test2 /mnt/b/10
      mount --make-shared /mnt/b/10
      mkdir -p /mnt/b/10/20
      
      mount --rbind /mnt/b /mnt/b/10/20
      
      unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
      sleep 1
      umount -l /mnt/b
      wait %%
      
      $ unshare -Urm ./locked_mounts_test.sh
      
      This failure is corrected by removing the prepass that marks mounts
      that may be umounted.
      
      A first pass is added that umounts mounts if possible and if not sets
      mount mark if they could be unmounted if they weren't locked and adds
      them to a list to umount possibilities.  This first pass reconsiders
      the mounts parent if it is on the list of umount possibilities, ensuring
      that information of umoutability will pass from child to mount parent.
      
      A second pass then walks through all mounts that are umounted and processes
      their children unmounting them or marking them for reparenting.
      
      A last pass cleans up the state on the mounts that could not be umounted
      and if applicable reparents them to their first parent that remained
      mounted.
      
      While a bit longer than the old code this code is much more robust
      as it allows information to flow up from the leaves and down
      from the trunk making the order in which mounts are encountered
      in the umount propgation tree irrelevant.
      
      Cc: stable@vger.kernel.org
      Fixes: 0c56fe31 ("mnt: Don't propagate unmounts to locked mounts")
      Reviewed-by: default avatarAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      dc5ce95b
    • Eric W. Biederman's avatar
      mnt: In umount propagation reparent in a separate pass · 8a012b92
      Eric W. Biederman authored
      [ Upstream commit 570487d3 ]
      
      It was observed that in some pathlogical cases that the current code
      does not unmount everything it should.  After investigation it
      was determined that the issue is that mnt_change_mntpoint can
      can change which mounts are available to be unmounted during mount
      propagation which is wrong.
      
      The trivial reproducer is:
      $ cat ./pathological.sh
      
      mount -t tmpfs test-base /mnt
      cd /mnt
      mkdir 1 2 1/1
      mount --bind 1 1
      mount --make-shared 1
      mount --bind 1 2
      mount --bind 1/1 1/1
      mount --bind 1/1 1/1
      echo
      grep test-base /proc/self/mountinfo
      umount 1/1
      echo
      grep test-base /proc/self/mountinfo
      
      $ unshare -Urm ./pathological.sh
      
      The expected output looks like:
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      
      The output without the fix looks like:
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      52 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      
      That last mount in the output was in the propgation tree to be unmounted but
      was missed because the mnt_change_mountpoint changed it's parent before the walk
      through the mount propagation tree observed it.
      
      Cc: stable@vger.kernel.org
      Fixes: 1064f874 ("mnt: Tuck mounts under others instead of creating shadow/side mounts.")
      Acked-by: default avatarAndrei Vagin <avagin@virtuozzo.com>
      Reviewed-by: default avatarRam Pai <linuxram@us.ibm.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      8a012b92
    • Adam Borowski's avatar
      vt: fix unchecked __put_user() in tioclinux ioctls · e3400a3c
      Adam Borowski authored
      [ Upstream commit 6987dc8a ]
      
      Only read access is checked before this call.
      
      Actually, at the moment this is not an issue, as every in-tree arch does
      the same manual checks for VERIFY_READ vs VERIFY_WRITE, relying on the MMU
      to tell them apart, but this wasn't the case in the past and may happen
      again on some odd arch in the future.
      
      If anyone cares about 3.7 and earlier, this is a security hole (untested)
      on real 80386 CPUs.
      Signed-off-by: default avatarAdam Borowski <kilobyte@angband.pl>
      CC: stable@vger.kernel.org # v3.7-
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      e3400a3c
    • Kees Cook's avatar
      exec: Limit arg stack to at most 75% of _STK_LIM · d5e990d9
      Kees Cook authored
      [ Upstream commit da029c11 ]
      
      To avoid pathological stack usage or the need to special-case setuid
      execs, just limit all arg stack usage to at most 75% of _STK_LIM (6MB).
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      d5e990d9
    • Kees Cook's avatar
      s390: reduce ELF_ET_DYN_BASE · 8c14bc2e
      Kees Cook authored
      [ Upstream commit a73dc537 ]
      
      Now that explicitly executed loaders are loaded in the mmap region, we
      have more freedom to decide where we position PIE binaries in the
      address space to avoid possible collisions with mmap or stack regions.
      
      For 64-bit, align to 4GB to allow runtimes to use the entire 32-bit
      address space for 32-bit pointers.  On 32-bit use 4MB, which is the
      traditional x86 minimum load location, likely to avoid historically
      requiring a 4MB page table entry when only a portion of the first 4MB
      would be used (since the NULL address is avoided).  For s390 the
      position could be 0x10000, but that is needlessly close to the NULL
      address.
      
      Link: http://lkml.kernel.org/r/1498154792-49952-5-git-send-email-keescook@chromium.orgSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Pratyush Anand <panand@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      8c14bc2e
    • Kees Cook's avatar
      powerpc: move ELF_ET_DYN_BASE to 4GB / 4MB · cd5425da
      Kees Cook authored
      [ Upstream commit 47ebb09d ]
      
      Now that explicitly executed loaders are loaded in the mmap region, we
      have more freedom to decide where we position PIE binaries in the
      address space to avoid possible collisions with mmap or stack regions.
      
      For 64-bit, align to 4GB to allow runtimes to use the entire 32-bit
      address space for 32-bit pointers.  On 32-bit use 4MB, which is the
      traditional x86 minimum load location, likely to avoid historically
      requiring a 4MB page table entry when only a portion of the first 4MB
      would be used (since the NULL address is avoided).
      
      Link: http://lkml.kernel.org/r/1498154792-49952-4-git-send-email-keescook@chromium.orgSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Tested-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Pratyush Anand <panand@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      cd5425da
    • Kees Cook's avatar
      arm: move ELF_ET_DYN_BASE to 4MB · af58b2cc
      Kees Cook authored
      [ Upstream commit 6a9af90a ]
      
      Now that explicitly executed loaders are loaded in the mmap region, we
      have more freedom to decide where we position PIE binaries in the
      address space to avoid possible collisions with mmap or stack regions.
      
      4MB is chosen here mainly to have parity with x86, where this is the
      traditional minimum load location, likely to avoid historically
      requiring a 4MB page table entry when only a portion of the first 4MB
      would be used (since the NULL address is avoided).
      
      For ARM the position could be 0x8000, the standard ET_EXEC load address,
      but that is needlessly close to the NULL address, and anyone running PIE
      on 32-bit ARM will have an MMU, so the tight mapping is not needed.
      
      Link: http://lkml.kernel.org/r/1498154792-49952-2-git-send-email-keescook@chromium.orgSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Pratyush Anand <panand@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
      Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Qualys Security Advisory <qsa@qualys.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      af58b2cc
    • Kees Cook's avatar
      binfmt_elf: use ELF_ET_DYN_BASE only for PIE · 5bb3ce64
      Kees Cook authored
      [ Upstream commit eab09532 ]
      
      The ELF_ET_DYN_BASE position was originally intended to keep loaders
      away from ET_EXEC binaries.  (For example, running "/lib/ld-linux.so.2
      /bin/cat" might cause the subsequent load of /bin/cat into where the
      loader had been loaded.)
      
      With the advent of PIE (ET_DYN binaries with an INTERP Program Header),
      ELF_ET_DYN_BASE continued to be used since the kernel was only looking
      at ET_DYN.  However, since ELF_ET_DYN_BASE is traditionally set at the
      top 1/3rd of the TASK_SIZE, a substantial portion of the address space
      is unused.
      
      For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs are
      loaded above the mmap region.  This means they can be made to collide
      (CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with
      pathological stack regions.
      
      Lowering ELF_ET_DYN_BASE solves both by moving programs below the mmap
      region in all cases, and will now additionally avoid programs falling
      back to the mmap region by enforcing MAP_FIXED for program loads (i.e.
      if it would have collided with the stack, now it will fail to load
      instead of falling back to the mmap region).
      
      To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP)
      are loaded into the mmap region, leaving space available for either an
      ET_EXEC binary with a fixed location or PIE being loaded into mmap by
      the loader.  Only PIE programs are loaded offset from ELF_ET_DYN_BASE,
      which means architectures can now safely lower their values without risk
      of loaders colliding with their subsequently loaded programs.
      
      For 64-bit, ELF_ET_DYN_BASE is best set to 4GB to allow runtimes to use
      the entire 32-bit address space for 32-bit pointers.
      
      Thanks to PaX Team, Daniel Micay, and Rik van Riel for inspiration and
      suggestions on how to implement this solution.
      
      Fixes: d1fd836d ("mm: split ET_DYN ASLR from mmap ASLR")
      Link: http://lkml.kernel.org/r/20170621173201.GA114489@beastSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Qualys Security Advisory <qsa@qualys.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pratyush Anand <panand@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      5bb3ce64
    • Cyril Bur's avatar
      checkpatch: silence perl 5.26.0 unescaped left brace warnings · 31d1fc95
      Cyril Bur authored
      [ Upstream commit 8d81ae05 ]
      
      As of perl 5, version 26, subversion 0 (v5.26.0) some new warnings have
      occurred when running checkpatch.
      
      Unescaped left brace in regex is deprecated here (and will be fatal in
      Perl 5.30), passed through in regex; marked by <-- HERE in m/^(.\s*){
      <-- HERE \s*/ at scripts/checkpatch.pl line 3544.
      
      Unescaped left brace in regex is deprecated here (and will be fatal in
      Perl 5.30), passed through in regex; marked by <-- HERE in m/^(.\s*){
      <-- HERE \s*/ at scripts/checkpatch.pl line 3885.
      
      Unescaped left brace in regex is deprecated here (and will be fatal in
      Perl 5.30), passed through in regex; marked by <-- HERE in
      m/^(\+.*(?:do|\))){ <-- HERE / at scripts/checkpatch.pl line 4374.
      
      It seems perfectly reasonable to do as the warning suggests and simply
      escape the left brace in these three locations.
      
      Link: http://lkml.kernel.org/r/20170607060135.17384-1-cyrilbur@gmail.comSigned-off-by: default avatarCyril Bur <cyrilbur@gmail.com>
      Acked-by: default avatarJoe Perches <joe@perches.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      31d1fc95
    • Sahitya Tummala's avatar
      fs/dcache.c: fix spin lockup issue on nlru->lock · aebd8abe
      Sahitya Tummala authored
      [ Upstream commit b17c070f ]
      
      __list_lru_walk_one() acquires nlru spin lock (nlru->lock) for longer
      duration if there are more number of items in the lru list.  As per the
      current code, it can hold the spin lock for upto maximum UINT_MAX
      entries at a time.  So if there are more number of items in the lru
      list, then "BUG: spinlock lockup suspected" is observed in the below
      path:
      
        spin_bug+0x90
        do_raw_spin_lock+0xfc
        _raw_spin_lock+0x28
        list_lru_add+0x28
        dput+0x1c8
        path_put+0x20
        terminate_walk+0x3c
        path_lookupat+0x100
        filename_lookup+0x6c
        user_path_at_empty+0x54
        SyS_faccessat+0xd0
        el0_svc_naked+0x24
      
      This nlru->lock is acquired by another CPU in this path -
      
        d_lru_shrink_move+0x34
        dentry_lru_isolate_shrink+0x48
        __list_lru_walk_one.isra.10+0x94
        list_lru_walk_node+0x40
        shrink_dcache_sb+0x60
        do_remount_sb+0xbc
        do_emergency_remount+0xb0
        process_one_work+0x228
        worker_thread+0x2e0
        kthread+0xf4
        ret_from_fork+0x10
      
      Fix this lockup by reducing the number of entries to be shrinked from
      the lru list to 1024 at once.  Also, add cond_resched() before
      processing the lru list again.
      
      Link: http://marc.info/?t=149722864900001&r=1&w=2
      Link: http://lkml.kernel.org/r/1498707575-2472-1-git-send-email-stummala@codeaurora.orgSigned-off-by: default avatarSahitya Tummala <stummala@codeaurora.org>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Suggested-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Alexander Polakov <apolyakov@beget.ru>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      aebd8abe
    • Sahitya Tummala's avatar
      mm/list_lru.c: fix list_lru_count_node() to be race free · eb8e478e
      Sahitya Tummala authored
      [ Upstream commit 2c80cd57 ]
      
      list_lru_count_node() iterates over all memcgs to get the total number of
      entries on the node but it can race with memcg_drain_all_list_lrus(),
      which migrates the entries from a dead cgroup to another.  This can return
      incorrect number of entries from list_lru_count_node().
      
      Fix this by keeping track of entries per node and simply return it in
      list_lru_count_node().
      
      Link: http://lkml.kernel.org/r/1498707555-30525-1-git-send-email-stummala@codeaurora.orgSigned-off-by: default avatarSahitya Tummala <stummala@codeaurora.org>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Alexander Polakov <apolyakov@beget.ru>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      eb8e478e