1. 03 May, 2018 2 commits
    • Coly Li's avatar
      bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error() · 6147305c
      Coly Li authored
      Commit c7b7bd07 ("bcache: add io_disable to struct cached_dev") tries
      to stop bcache device by calling bcache_device_stop() when too many I/O
      errors happened on backing device. But if there is internal I/O happening
      on cache device (writeback scan, garbage collection, etc), a regular I/O
      request triggers the internal I/Os may still holds a refcount of dc->count,
      and the refcount may only be dropped after the internal I/O stopped.
      
      By this patch, bch_cached_dev_error() will check if the backing device is
      attached to a cache set, if yes that CACHE_SET_IO_DISABLE will be set to
      flags of this cache set. Then internal I/Os on cache device will be
      rejected and stopped immediately, and the bcache device can be stopped.
      
      For people who are not familiar with the interesting refcount dependance,
      let me explain a bit more how the fix works. Example the writeback thread
      will scan cache device for dirty data writeback purpose. Before it stopps,
      it holds a refcount of dc->count. When CACHE_SET_IO_DISABLE bit is set,
      the internal I/O will stopped and the while-loop in bch_writeback_thread()
      quits and calls cached_dev_put() to drop dc->count. If this is the last
      refcount to drop, then cached_dev_detach_finish() will be called. In this
      call back function, in turn closure_put(dc->disk.cl) is called to drop a
      refcount of closure dc->disk.cl. If this is the last refcount of this
      closure to drop, then cached_dev_flush() will be called. Then the cached
      device is freed. So if CACHE_SET_IO_DISABLE is not set, the bache device
      can not be stopped until all inernal cache device I/O stopped. For large
      size cache device, and writeback thread competes locks with gc thread,
      there might be a quite long time to wait.
      
      Fixes: c7b7bd07 ("bcache: add io_disable to struct cached_dev")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6147305c
    • Coly Li's avatar
      bcache: store disk name in struct cache and struct cached_dev · 6e916a7e
      Coly Li authored
      Current code uses bdevname() or bio_devname() to reference gendisk
      disk name when bcache needs to display the disk names in kernel message.
      It was safe before bcache device failure handling patch set merged in,
      because when devices are failed, there was deadlock to prevent bcache
      printing error messages with gendisk disk name. But after the failure
      handling patch set merged, the deadlock is fixed, so it is possible
      that the gendisk structure bdev->hd_disk is released when bdevname() is
      called to reference bdev->bd_disk->disk_name[]. This is why I receive
      bug report of NULL pointers deference panic.
      
      This patch stores gendisk disk name in a buffer inside struct cache and
      struct cached_dev, then print out the offline device name won't reference
      bdev->hd_disk anymore. And this patch also avoids extra function calls
      of bdevname() and bio_devnmae().
      
      Changelog:
      v3, add Reviewed-by from Hannes.
      v2, call bdevname() earlier in register_bdev()
      v1, first version with segguestion from Junhui Tang.
      
      Fixes: c7b7bd07 ("bcache: add io_disable to struct cached_dev")
      Fixes: 5138ac67 ("bcache: fix misleading error message in bch_count_io_errors()")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6e916a7e
  2. 26 Apr, 2018 6 commits
    • Omar Sandoval's avatar
      blk-mq: fix sysfs inflight counter · bf0ddaba
      Omar Sandoval authored
      When the blk-mq inflight implementation was added, /proc/diskstats was
      converted to use it, but /sys/block/$dev/inflight was not. Fix it by
      adding another helper to count in-flight requests by data direction.
      
      Fixes: f299b7c7 ("blk-mq: provide internal in-flight variant")
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bf0ddaba
    • Omar Sandoval's avatar
      blk-mq: count allocated but not started requests in iostats inflight · 6131837b
      Omar Sandoval authored
      In the legacy block case, we increment the counter right after we
      allocate the request, not when the driver handles it. In both the legacy
      and blk-mq cases, part_inc_in_flight() is called from
      blk_account_io_start() right after we've allocated the request. blk-mq
      only considers requests started requests as inflight, but this is
      inconsistent with the legacy definition and the intention in the code.
      This removes the started condition and instead counts all allocated
      requests.
      
      Fixes: f299b7c7 ("blk-mq: provide internal in-flight variant")
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6131837b
    • Linus Torvalds's avatar
      Merge tag 'for_v4.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs · 69bfd470
      Linus Torvalds authored
      Pull fsnotify fix from Jan Kara:
       "A fix of a fsnotify race causing panics / softlockups"
      
      * tag 'for_v4.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
        fsnotify: Fix fsnotify_mark_connector race
      69bfd470
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 3442097b
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Eight bug fixes, one spelling update and one tracepoint addition.
      
        The most serious is probably the mptsas write same fix because it
        means anyone using these controllers sees errors when modern
        filesystems try to issue discards"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: target: fix crash with iscsi target and dvd
        scsi: sd_zbc: Avoid that resetting a zone fails sporadically
        scsi: sd: Defer spinning up drive while SANITIZE is in progress
        scsi: megaraid_sas: Do not log an error if FW successfully initializes.
        scsi: ufs: add trace event for ufs upiu
        scsi: core: remove reference to scsi_show_extd_sense()
        scsi: mptsas: Disable WRITE SAME
        scsi: fnic: fix spelling mistake in fnic stats "Abord" -> "Abort"
        scsi: scsi_debug: IMMED related delay adjustments
        scsi: iscsi: respond to netlink with unicast when appropriate
      3442097b
    • Linus Torvalds's avatar
      Merge tag 'for-linus-20180425' of git://git.kernel.dk/linux-block · 8fba70b0
      Linus Torvalds authored
      Pull block updates from Jens Axboe:
       "I ended up sitting on this about a week longer than I wanted to, since
        we were hashing out details with a timeout change. I've now killed
        that patch, so we can flush the existing queue in due time.
      
        This contains:
      
         - Fix for an old regression, where entering the queue can be
           disturbed by a signal to the process. This can cause spurious EIO.
           Fix from Alan Jenkins.
      
         - cdrom information leak fix from Dan.
      
         - Trivial helper for testing queue FUA from Dave Chinner, part of his
           O_DIRECT FUA series.
      
         - Series of swim fixes from Finn that actually makes it work again.
      
         - Loop O_DIRECT corruption fix, which caused data corruption in
           production for us. From me.
      
         - BFQ crash fix from me.
      
         - bcache maintainer update. Michael no longer has the time to do it,
           Coly has stepped up to serve as the new maintainer.
      
         - blkcg locking fixes from Jiang Biao.
      
         - Revert of a change from this merge window from Ming, that causes an
           issue on some hardware.
      
         - Minor clarification doc addition from Linus Walleij"
      
      * tag 'for-linus-20180425' of git://git.kernel.dk/linux-block: (22 commits)
        Revert "blk-mq: remove code for dealing with remapping queue"
        block: mq: Add some minor doc for core structs
        bcache: mark Coly Li as bcache maintainer
        MAINTAINERS: Remove me as maintainer of bcache
        blkcg: init root blkcg_gq under lock
        blkcg: small fix on comment in blkcg_init_queue
        blkcg: don't hold blkcg lock when deactivating policy
        block: add blk_queue_fua() helper function
        cdrom: information leak in cdrom_ioctl_media_changed()
        bfq-iosched: ensure to clear bic/bfqq pointers when preparing request
        blk-mq: start request gstate with gen 1
        block/swim: Select appropriate drive on device open
        block/swim: Fix IO error at end of medium
        block/swim: Check drive type
        block/swim: Rename macros to avoid inconsistent inverted logic
        block/swim: Don't log an error message for an invalid ioctl
        block/swim: Remove extra put_disk() call from error path
        block/swim: Fix array bounds check
        m68k/mac: Don't remap SWIM MMIO region
        loop: handle short DIO reads
        ...
      8fba70b0
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-4.17-rc3' of... · c6dc3e71
      Linus Torvalds authored
      Merge tag 'riscv-for-linus-4.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux
      
      Pull RISC-V fixes from Palmer Dabbelt:
       "This contains three small fixes related to the RISC-V port that I'd
        like to target for 4.17-rc3:
      
         - a Kconfig cleanup to select DMA_DIRECT_OPS instead of redefining it
           in arch/riscv
      
         - the removal of asm/handle_irq.h, which doesn't exist, from our arch
           header list
      
         - the addition of "-no-pie" the link rules for our VDSO-related
           files, which fixes the build on systems where PIE is enabled by
           default"
      
      * tag 'riscv-for-linus-4.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux:
        RISC-V: build vdso-dummy.o with -no-pie
        riscv: there is no <asm/handle_irq.h>
        riscv: select DMA_DIRECT_OPS instead of redefining it
      c6dc3e71
  3. 25 Apr, 2018 6 commits
  4. 24 Apr, 2018 22 commits
  5. 23 Apr, 2018 4 commits
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · 77621f02
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter/IPVS fixes for net
      
      The following patchset contains Netfilter/IPVS fixes for your net tree,
      they are:
      
      1) Fix SIP conntrack with phones sending session descriptions for different
         media types but same port numbers, from Florian Westphal.
      
      2) Fix incorrect rtnl_lock mutex logic from IPVS sync thread, from Julian
         Anastasov.
      
      3) Skip compat array allocation in ebtables if there is no entries, also
         from Florian.
      
      4) Do not lose left/right bits when shifting marks from xt_connmark, from
         Jack Ma.
      
      5) Silence false positive memleak in conntrack extensions, from Cong Wang.
      
      6) Fix CONFIG_NF_REJECT_IPV6=m link problems, from Arnd Bergmann.
      
      7) Cannot kfree rule that is already in list in nf_tables, switch order
         so this error handling is not required, from Florian Westphal.
      
      8) Release set name in error path, from Florian.
      
      9) include kmemleak.h in nf_conntrack_extend.c, from Stepheh Rothwell.
      
      10) NAT chain and extensions depend on NF_TABLES.
      
      11) Out of bound access when renaming chains, from Taehee Yoo.
      
      12) Incorrect casting in xt_connmark leads to wrong bitshifting.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      77621f02
    • Eric Dumazet's avatar
      ipv6: add RTA_TABLE and RTA_PREFSRC to rtm_ipv6_policy · aa8f8778
      Eric Dumazet authored
      KMSAN reported use of uninit-value that I tracked to lack
      of proper size check on RTA_TABLE attribute.
      
      I also believe RTA_PREFSRC lacks a similar check.
      
      Fixes: 86872cb5 ("[IPv6] route: FIB6 configuration using struct fib6_config")
      Fixes: c3968a85 ("ipv6: RTA_PREFSRC support for ipv6 route source address selection")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa8f8778
    • Xin Long's avatar
      bonding: do not set slave_dev npinfo before slave_enable_netpoll in bond_enslave · ddea788c
      Xin Long authored
      After Commit 8a8efa22 ("bonding: sync netpoll code with bridge"), it
      would set slave_dev npinfo in slave_enable_netpoll when enslaving a dev
      if bond->dev->npinfo was set.
      
      However now slave_dev npinfo is set with bond->dev->npinfo before calling
      slave_enable_netpoll. With slave_dev npinfo set, __netpoll_setup called
      in slave_enable_netpoll will not call slave dev's .ndo_netpoll_setup().
      It causes that the lower dev of this slave dev can't set its npinfo.
      
      One way to reproduce it:
      
        # modprobe bonding
        # brctl addbr br0
        # brctl addif br0 eth1
        # ifconfig bond0 192.168.122.1/24 up
        # ifenslave bond0 eth2
        # systemctl restart netconsole
        # ifenslave bond0 br0
        # ifconfig eth2 down
        # systemctl restart netconsole
      
      The netpoll won't really work.
      
      This patch is to remove that slave_dev npinfo setting in bond_enslave().
      
      Fixes: 8a8efa22 ("bonding: sync netpoll code with bridge")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ddea788c
    • Jann Horn's avatar
      tcp: don't read out-of-bounds opsize · 7e5a206a
      Jann Horn authored
      The old code reads the "opsize" variable from out-of-bounds memory (first
      byte behind the segment) if a broken TCP segment ends directly after an
      opcode that is neither EOL nor NOP.
      
      The result of the read isn't used for anything, so the worst thing that
      could theoretically happen is a pagefault; and since the physmap is usually
      mostly contiguous, even that seems pretty unlikely.
      
      The following C reproducer triggers the uninitialized read - however, you
      can't actually see anything happen unless you put something like a
      pr_warn() in tcp_parse_md5sig_option() to print the opsize.
      
      ====================================
      #define _GNU_SOURCE
      #include <arpa/inet.h>
      #include <stdlib.h>
      #include <errno.h>
      #include <stdarg.h>
      #include <net/if.h>
      #include <linux/if.h>
      #include <linux/ip.h>
      #include <linux/tcp.h>
      #include <linux/in.h>
      #include <linux/if_tun.h>
      #include <err.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <string.h>
      #include <stdio.h>
      #include <unistd.h>
      #include <sys/ioctl.h>
      #include <assert.h>
      
      void systemf(const char *command, ...) {
        char *full_command;
        va_list ap;
        va_start(ap, command);
        if (vasprintf(&full_command, command, ap) == -1)
          err(1, "vasprintf");
        va_end(ap);
        printf("systemf: <<<%s>>>\n", full_command);
        system(full_command);
      }
      
      char *devname;
      
      int tun_alloc(char *name) {
        int fd = open("/dev/net/tun", O_RDWR);
        if (fd == -1)
          err(1, "open tun dev");
        static struct ifreq req = { .ifr_flags = IFF_TUN|IFF_NO_PI };
        strcpy(req.ifr_name, name);
        if (ioctl(fd, TUNSETIFF, &req))
          err(1, "TUNSETIFF");
        devname = req.ifr_name;
        printf("device name: %s\n", devname);
        return fd;
      }
      
      #define IPADDR(a,b,c,d) (((a)<<0)+((b)<<8)+((c)<<16)+((d)<<24))
      
      void sum_accumulate(unsigned int *sum, void *data, int len) {
        assert((len&2)==0);
        for (int i=0; i<len/2; i++) {
          *sum += ntohs(((unsigned short *)data)[i]);
        }
      }
      
      unsigned short sum_final(unsigned int sum) {
        sum = (sum >> 16) + (sum & 0xffff);
        sum = (sum >> 16) + (sum & 0xffff);
        return htons(~sum);
      }
      
      void fix_ip_sum(struct iphdr *ip) {
        unsigned int sum = 0;
        sum_accumulate(&sum, ip, sizeof(*ip));
        ip->check = sum_final(sum);
      }
      
      void fix_tcp_sum(struct iphdr *ip, struct tcphdr *tcp) {
        unsigned int sum = 0;
        struct {
          unsigned int saddr;
          unsigned int daddr;
          unsigned char pad;
          unsigned char proto_num;
          unsigned short tcp_len;
        } fakehdr = {
          .saddr = ip->saddr,
          .daddr = ip->daddr,
          .proto_num = ip->protocol,
          .tcp_len = htons(ntohs(ip->tot_len) - ip->ihl*4)
        };
        sum_accumulate(&sum, &fakehdr, sizeof(fakehdr));
        sum_accumulate(&sum, tcp, tcp->doff*4);
        tcp->check = sum_final(sum);
      }
      
      int main(void) {
        int tun_fd = tun_alloc("inject_dev%d");
        systemf("ip link set %s up", devname);
        systemf("ip addr add 192.168.42.1/24 dev %s", devname);
      
        struct {
          struct iphdr ip;
          struct tcphdr tcp;
          unsigned char tcp_opts[20];
        } __attribute__((packed)) syn_packet = {
          .ip = {
            .ihl = sizeof(struct iphdr)/4,
            .version = 4,
            .tot_len = htons(sizeof(syn_packet)),
            .ttl = 30,
            .protocol = IPPROTO_TCP,
            /* FIXUP check */
            .saddr = IPADDR(192,168,42,2),
            .daddr = IPADDR(192,168,42,1)
          },
          .tcp = {
            .source = htons(1),
            .dest = htons(1337),
            .seq = 0x12345678,
            .doff = (sizeof(syn_packet.tcp)+sizeof(syn_packet.tcp_opts))/4,
            .syn = 1,
            .window = htons(64),
            .check = 0 /*FIXUP*/
          },
          .tcp_opts = {
            /* INVALID: trailing MD5SIG opcode after NOPs */
            1, 1, 1, 1, 1,
            1, 1, 1, 1, 1,
            1, 1, 1, 1, 1,
            1, 1, 1, 1, 19
          }
        };
        fix_ip_sum(&syn_packet.ip);
        fix_tcp_sum(&syn_packet.ip, &syn_packet.tcp);
        while (1) {
          int write_res = write(tun_fd, &syn_packet, sizeof(syn_packet));
          if (write_res != sizeof(syn_packet))
            err(1, "packet write failed");
        }
      }
      ====================================
      
      Fixes: cfb6eeb4 ("[TCP]: MD5 Signature Option (RFC2385) support.")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7e5a206a