1. 11 Nov, 2020 12 commits
    • Vivek Goyal's avatar
      fuse: setattr should set FATTR_KILL_SUIDGID · 31792161
      Vivek Goyal authored
      If fc->handle_killpriv_v2 is enabled, we expect file server to clear
      suid/sgid/security.capbility upon chown/truncate/write as appropriate.
      
      Upon truncate (ATTR_SIZE), suid/sgid are cleared only if caller does not
      have CAP_FSETID.  File server does not know whether caller has CAP_FSETID
      or not.  Hence set FATTR_KILL_SUIDGID upon truncate to let file server know
      that caller does not have CAP_FSETID and it should kill suid/sgid as
      appropriate.
      
      On chown (ATTR_UID/ATTR_GID) suid/sgid need to be cleared irrespective of
      capabilities of calling process, so set FATTR_KILL_SUIDGID unconditionally
      in that case.
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      31792161
    • Vivek Goyal's avatar
      fuse: set FUSE_WRITE_KILL_SUIDGID in cached write path · b8667395
      Vivek Goyal authored
      With HANDLE_KILLPRIV_V2, server will need to kill suid/sgid if caller does
      not have CAP_FSETID.  We already have a flag FUSE_WRITE_KILL_SUIDGID in
      WRITE request and we already set it in direct I/O path.
      
      To make it work in cached write path also, start setting
      FUSE_WRITE_KILL_SUIDGID in this path too.
      
      Set it only if fc->handle_killpriv_v2 is set.  Otherwise client is
      responsible for kill suid/sgid.
      
      In case of direct I/O we set FUSE_WRITE_KILL_SUIDGID unconditionally
      because we don't call file_remove_privs() in that path (with cache=none
      option).
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      b8667395
    • Miklos Szeredi's avatar
      fuse: rename FUSE_WRITE_KILL_PRIV to FUSE_WRITE_KILL_SUIDGID · 10c52c84
      Miklos Szeredi authored
      Kernel has:
      ATTR_KILL_PRIV -> clear "security.capability"
      ATTR_KILL_SUID -> clear S_ISUID
      ATTR_KILL_SGID -> clear S_ISGID if executable
      
      Fuse has:
      FUSE_WRITE_KILL_PRIV -> clear S_ISUID and S_ISGID if executable
      
      So FUSE_WRITE_KILL_PRIV implies the complement of ATTR_KILL_PRIV, which is
      somewhat confusing.  Also PRIV implies all privileges, including
      "security.capability".
      
      Change the name to FUSE_WRITE_KILL_SUIDGID and make FUSE_WRITE_KILL_PRIV an
      alias to perserve API compatibility
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      10c52c84
    • Vivek Goyal's avatar
      fuse: introduce the notion of FUSE_HANDLE_KILLPRIV_V2 · 63f9909f
      Vivek Goyal authored
      We already have FUSE_HANDLE_KILLPRIV flag that says that file server will
      remove suid/sgid/caps on truncate/chown/write. But that's little different
      from what Linux VFS implements.
      
      To be consistent with Linux VFS behavior what we want is.
      
      - caps are always cleared on chown/write/truncate
      - suid is always cleared on chown, while for truncate/write it is cleared
        only if caller does not have CAP_FSETID.
      - sgid is always cleared on chown, while for truncate/write it is cleared
        only if caller does not have CAP_FSETID as well as file has group execute
        permission.
      
      As previous flag did not provide above semantics. Implement a V2 of the
      protocol with above said constraints.
      
      Server does not know if caller has CAP_FSETID or not. So for the case
      of write()/truncate(), client will send information in special flag to
      indicate whether to kill priviliges or not. These changes are in subsequent
      patches.
      
      FUSE_HANDLE_KILLPRIV_V2 relies on WRITE being sent to server to clear
      suid/sgid/security.capability. But with ->writeback_cache, WRITES are
      cached in guest. So it is not recommended to use FUSE_HANDLE_KILLPRIV_V2
      and writeback_cache together. Though it probably might be good enough
      for lot of use cases.
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      63f9909f
    • Miklos Szeredi's avatar
      fuse: always revalidate if exclusive create · df8629af
      Miklos Szeredi authored
      Failure to do so may result in EEXIST even if the file only exists in the
      cache and not in the filesystem.
      
      The atomic nature of O_EXCL mandates that the cached state should be
      ignored and existence verified anew.
      Reported-by: default avatarKen Schalk <kschalk@nvidia.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      df8629af
    • Miklos Szeredi's avatar
      virtiofs: clean up error handling in virtio_fs_get_tree() · 833c5a42
      Miklos Szeredi authored
      Avoid duplicating error cleanup.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      833c5a42
    • Miklos Szeredi's avatar
      fuse: add fuse_sb_destroy() helper · 6a68d1e1
      Miklos Szeredi authored
      This is to avoid minor code duplication between fuse_kill_sb_anon() and
      fuse_kill_sb_blk().
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      6a68d1e1
    • Miklos Szeredi's avatar
      fuse: simplify get_fuse_conn*() · bd3bf1e8
      Miklos Szeredi authored
      All callers dereference the result, so no point in checking for NULL
      pointer dereference here.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      bd3bf1e8
    • Miklos Szeredi's avatar
      fuse: get rid of fuse_mount refcount · 514b5e3f
      Miklos Szeredi authored
      Fuse mount now only ever has a refcount of one (before being freed) so the
      count field is unnecessary.
      
      Remove the refcounting and fold fuse_mount_put() into callers.  The only
      caller of fuse_mount_put() where fm->fc was NULL is fuse_dentry_automount()
      and here the fuse_conn_put() can simply be omitted.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      514b5e3f
    • Miklos Szeredi's avatar
      virtiofs: simplify sb setup · b19d3d00
      Miklos Szeredi authored
      Currently when acquiring an sb for virtiofs fuse_mount_get() is being
      called from virtio_fs_set_super() if a new sb is being filled and
      fuse_mount_put() is called unconditionally after sget_fc() returns.
      
      The exact same result can be obtained by checking whether
      fs_contex->s_fs_info was set to NULL (ref trasferred to sb->s_fs_info) and
      only calling fuse_mount_put() if the ref wasn't transferred (error or
      matching sb found).
      
      This allows getting rid of virtio_fs_set_super() and fuse_mount_get().
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      b19d3d00
    • Miklos Szeredi's avatar
      virtiofs fix leak in setup · 66ab33bf
      Miklos Szeredi authored
      This can be triggered for example by adding the "-omand" mount option,
      which will be rejected and virtio_fs_fill_super() will return an error.
      
      In such a case the allocations for fuse_conn and fuse_mount will leak due
      to s_root not yet being set and so ->put_super() not being called.
      
      Fixes: a62a8ef9 ("virtio-fs: add virtiofs filesystem")
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      66ab33bf
    • Miklos Szeredi's avatar
      fuse: launder page should wait for page writeback · 3993382b
      Miklos Szeredi authored
      Qian Cai reports that the WARNING in tree_insert() can be triggered by a
      fuzzer with the following call chain:
      
      invalidate_inode_pages2_range()
         fuse_launder_page()
            fuse_writepage_locked()
               tree_insert()
      
      The reason is that another write for the same page is already queued.
      
      The simplest fix is to wait until the pending write is completed and only
      after that queue the new write.
      
      Since this case is very rare, the additional wait should not be a problem.
      Reported-by: default avatarQian Cai <cai@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      3993382b
  2. 25 Oct, 2020 17 commits
  3. 24 Oct, 2020 11 commits
    • Linus Torvalds's avatar
      Merge tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-block · d7691390
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
      
       - NVMe pull request from Christoph
           - rdma error handling fixes (Chao Leng)
           - fc error handling and reconnect fixes (James Smart)
           - fix the qid displace when tracing ioctl command (Keith Busch)
           - don't use BLK_MQ_REQ_NOWAIT for passthru (Chaitanya Kulkarni)
           - fix MTDT for passthru (Logan Gunthorpe)
           - blacklist Write Same on more devices (Kai-Heng Feng)
           - fix an uninitialized work struct (zhenwei pi)"
      
       - lightnvm out-of-bounds fix (Colin)
      
       - SG allocation leak fix (Doug)
      
       - rnbd fixes (Gioh, Guoqing, Jack)
      
       - zone error translation fixes (Keith)
      
       - kerneldoc markup fix (Mauro)
      
       - zram lockdep fix (Peter)
      
       - Kill unused io_context members (Yufen)
      
       - NUMA memory allocation cleanup (Xianting)
      
       - NBD config wakeup fix (Xiubo)
      
      * tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-block: (27 commits)
        block: blk-mq: fix a kernel-doc markup
        nvme-fc: shorten reconnect delay if possible for FC
        nvme-fc: wait for queues to freeze before calling update_hr_hw_queues
        nvme-fc: fix error loop in create_hw_io_queues
        nvme-fc: fix io timeout to abort I/O
        null_blk: use zone status for max active/open
        nvmet: don't use BLK_MQ_REQ_NOWAIT for passthru
        nvmet: cleanup nvmet_passthru_map_sg()
        nvmet: limit passthru MTDS by BIO_MAX_PAGES
        nvmet: fix uninitialized work for zero kato
        nvme-pci: disable Write Zeroes on Sandisk Skyhawk
        nvme: use queuedata for nvme_req_qid
        nvme-rdma: fix crash due to incorrect cqe
        nvme-rdma: fix crash when connect rejected
        block: remove unused members for io_context
        blk-mq: remove the calling of local_memory_node()
        zram: Fix __zram_bvec_{read,write}() locking order
        skd_main: remove unused including <linux/version.h>
        sgl_alloc_order: fix memory leak
        lightnvm: fix out-of-bounds write to array devices->info[]
        ...
      d7691390
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.10-2020-10-24' of git://git.kernel.dk/linux-block · af004187
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
      
       - fsize was missed in previous unification of work flags
      
       - Few fixes cleaning up the flags unification creds cases (Pavel)
      
       - Fix NUMA affinities for completely unplugged/replugged node for io-wq
      
       - Two fallout fixes from the set_fs changes. One local to io_uring, one
         for the splice entry point that io_uring uses.
      
       - Linked timeout fixes (Pavel)
      
       - Removal of ->flush() ->files work-around that we don't need anymore
         with referenced files (Pavel)
      
       - Various cleanups (Pavel)
      
      * tag 'io_uring-5.10-2020-10-24' of git://git.kernel.dk/linux-block:
        splice: change exported internal do_splice() helper to take kernel offset
        io_uring: make loop_rw_iter() use original user supplied pointers
        io_uring: remove req cancel in ->flush()
        io-wq: re-set NUMA node affinities if CPUs come online
        io_uring: don't reuse linked_timeout
        io_uring: unify fsize with def->work_flags
        io_uring: fix racy REQ_F_LINK_TIMEOUT clearing
        io_uring: do poll's hash_node init in common code
        io_uring: inline io_poll_task_handler()
        io_uring: remove extra ->file check in poll prep
        io_uring: make cached_cq_overflow non atomic_t
        io_uring: inline io_fail_links()
        io_uring: kill ref get/drop in personality init
        io_uring: flags-based creds init in queue
      af004187
    • Linus Torvalds's avatar
      Merge tag 'libata-5.10-2020-10-24' of git://git.kernel.dk/linux-block · cb6b2897
      Linus Torvalds authored
      Pull libata fixes from Jens Axboe:
       "Two minor libata fixes:
      
         - Fix a DMA boundary mask regression for sata_rcar (Geert)
      
         - kerneldoc markup fix (Mauro)"
      
      * tag 'libata-5.10-2020-10-24' of git://git.kernel.dk/linux-block:
        ata: fix some kernel-doc markups
        ata: sata_rcar: Fix DMA boundary mask
      cb6b2897
    • Linus Torvalds's avatar
      Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 0eac1102
      Linus Torvalds authored
      Pull misc vfs updates from Al Viro:
       "Assorted stuff all over the place (the largest group here is
        Christoph's stat cleanups)"
      
      * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        fs: remove KSTAT_QUERY_FLAGS
        fs: remove vfs_stat_set_lookup_flags
        fs: move vfs_fstatat out of line
        fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat
        fs: remove vfs_statx_fd
        fs: omfs: use kmemdup() rather than kmalloc+memcpy
        [PATCH] reduce boilerplate in fsid handling
        fs: Remove duplicated flag O_NDELAY occurring twice in VALID_OPEN_FLAGS
        selftests: mount: add nosymfollow tests
        Add a "nosymfollow" mount option.
      0eac1102
    • Linus Torvalds's avatar
      Merge tag 'dma-mapping-5.10-1' of git://git.infradead.org/users/hch/dma-mapping · 1b307ac8
      Linus Torvalds authored
      Pull dma-mapping fixes from Christoph Hellwig:
      
       - document the new dma_{alloc,free}_pages() API
      
       - two fixups for the dma-mapping.h split
      
      * tag 'dma-mapping-5.10-1' of git://git.infradead.org/users/hch/dma-mapping:
        dma-mapping: document dma_{alloc,free}_pages
        dma-mapping: move more functions to dma-map-ops.h
        ARM/sa1111: add a missing include of dma-map-ops.h
      1b307ac8
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 9bf8d8bc
      Linus Torvalds authored
      Pull KVM fixes from Paolo Bonzini:
       "Two fixes for this merge window, and an unrelated bugfix for a host
        hang"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: ioapic: break infinite recursion on lazy EOI
        KVM: vmx: rename pi_init to avoid conflict with paride
        KVM: x86/mmu: Avoid modulo operator on 64-bit value to fix i386 build
      9bf8d8bc
    • Linus Torvalds's avatar
      Merge tag 'x86_seves_fixes_for_v5.10_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c51ae124
      Linus Torvalds authored
      Pull x86 SEV-ES fixes from Borislav Petkov:
       "Three fixes to SEV-ES to correct setting up the new early pagetable on
        5-level paging machines, to always map boot_params and the kernel
        cmdline, and disable stack protector for ../compressed/head{32,64}.c.
        (Arvind Sankar)"
      
      * tag 'x86_seves_fixes_for_v5.10_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/boot/64: Explicitly map boot_params and command line
        x86/head/64: Disable stack protection for head$(BITS).o
        x86/boot/64: Initialize 5-level paging variables earlier
      c51ae124
    • Willy Tarreau's avatar
      random32: add a selftest for the prandom32 code · c6e169bc
      Willy Tarreau authored
      Given that this code is new, let's add a selftest for it as well.
      It doesn't rely on fixed sets, instead it picks 1024 numbers and
      verifies that they're not more correlated than desired.
      
      Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
      Cc: George Spelvin <lkml@sdf.org>
      Cc: Amit Klein <aksecurity@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: tytso@mit.edu
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Marc Plumb <lkml.mplumb@gmail.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      c6e169bc
    • Willy Tarreau's avatar
      random32: add noise from network and scheduling activity · 3744741a
      Willy Tarreau authored
      With the removal of the interrupt perturbations in previous random32
      change (random32: make prandom_u32() output unpredictable), the PRNG
      has become 100% deterministic again. While SipHash is expected to be
      way more robust against brute force than the previous Tausworthe LFSR,
      there's still the risk that whoever has even one temporary access to
      the PRNG's internal state is able to predict all subsequent draws till
      the next reseed (roughly every minute). This may happen through a side
      channel attack or any data leak.
      
      This patch restores the spirit of commit f227e3ec ("random32: update
      the net random state on interrupt and activity") in that it will perturb
      the internal PRNG's statee using externally collected noise, except that
      it will not pick that noise from the random pool's bits nor upon
      interrupt, but will rather combine a few elements along the Tx path
      that are collectively hard to predict, such as dev, skb and txq
      pointers, packet length and jiffies values. These ones are combined
      using a single round of SipHash into a single long variable that is
      mixed with the net_rand_state upon each invocation.
      
      The operation was inlined because it produces very small and efficient
      code, typically 3 xor, 2 add and 2 rol. The performance was measured
      to be the same (even very slightly better) than before the switch to
      SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC
      (i40e), the connection rate dropped from 556k/s to 555k/s while the
      SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps.
      
      Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
      Cc: George Spelvin <lkml@sdf.org>
      Cc: Amit Klein <aksecurity@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: tytso@mit.edu
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Marc Plumb <lkml.mplumb@gmail.com>
      Tested-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      3744741a
    • George Spelvin's avatar
      random32: make prandom_u32() output unpredictable · c51f8f88
      George Spelvin authored
      Non-cryptographic PRNGs may have great statistical properties, but
      are usually trivially predictable to someone who knows the algorithm,
      given a small sample of their output.  An LFSR like prandom_u32() is
      particularly simple, even if the sample is widely scattered bits.
      
      It turns out the network stack uses prandom_u32() for some things like
      random port numbers which it would prefer are *not* trivially predictable.
      Predictability led to a practical DNS spoofing attack.  Oops.
      
      This patch replaces the LFSR with a homebrew cryptographic PRNG based
      on the SipHash round function, which is in turn seeded with 128 bits
      of strong random key.  (The authors of SipHash have *not* been consulted
      about this abuse of their algorithm.)  Speed is prioritized over security;
      attacks are rare, while performance is always wanted.
      
      Replacing all callers of prandom_u32() is the quick fix.
      Whether to reinstate a weaker PRNG for uses which can tolerate it
      is an open question.
      
      Commit f227e3ec ("random32: update the net random state on interrupt
      and activity") was an earlier attempt at a solution.  This patch replaces
      it.
      Reported-by: default avatarAmit Klein <aksecurity@gmail.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: tytso@mit.edu
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Marc Plumb <lkml.mplumb@gmail.com>
      Fixes: f227e3ec ("random32: update the net random state on interrupt and activity")
      Signed-off-by: default avatarGeorge Spelvin <lkml@sdf.org>
      Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
      [ willy: partial reversal of f227e3ec; moved SIPROUND definitions
        to prandom.h for later use; merged George's prandom_seed() proposal;
        inlined siprand_u32(); replaced the net_rand_state[] array with 4
        members to fix a build issue; cosmetic cleanups to make checkpatch
        happy; fixed RANDOM32_SELFTEST build ]
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      c51f8f88
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · b6f96e75
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
      
       - A fix for undetected data corruption on Power9 Nimbus <= DD2.1 in the
         emulation of VSX loads. The affected CPUs were not widely available.
      
       - Two fixes for machine check handling in guests under PowerVM.
      
       - A fix for our recent changes to SMP setup, when
         CONFIG_CPUMASK_OFFSTACK=y.
      
       - Three fixes for races in the handling of some of our powernv sysfs
         attributes.
      
       - One change to remove TM from the set of Power10 CPU features.
      
       - A couple of other minor fixes.
      
      Thanks to: Aneesh Kumar K.V, Christophe Leroy, Ganesh Goudar, Jordan
      Niethe, Mahesh Salgaonkar, Michael Neuling, Oliver O'Halloran, Qian Cai,
      Srikar Dronamraju, Vasant Hegde.
      
      * tag 'powerpc-5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/pseries: Avoid using addr_to_pfn in real mode
        powerpc/uaccess: Don't use "m<>" constraint with GCC 4.9
        powerpc/eeh: Fix eeh_dev_check_failure() for PE#0
        powerpc/64s: Remove TM from Power10 features
        selftests/powerpc: Make alignment handler test P9N DD2.1 vector CI load workaround
        powerpc: Fix undetected data corruption with P9N DD2.1 VSX CI load emulation
        powerpc/powernv/dump: Handle multiple writes to ack attribute
        powerpc/powernv/dump: Fix race while processing OPAL dump
        powerpc/smp: Use GFP_ATOMIC while allocating tmp mask
        powerpc/smp: Remove unnecessary variable
        powerpc/mce: Avoid nmi_enter/exit in real mode on pseries hash
        powerpc/opal_elog: Handle multiple writes to ack attribute
      b6f96e75