1. 15 Feb, 2017 3 commits
  2. 14 Feb, 2017 1 commit
  3. 13 Feb, 2017 10 commits
    • Shaohua Li's avatar
      md/raid5-cache: exclude reclaiming stripes in reclaim check · e33fbb9c
      Shaohua Li authored
      stripes which are being reclaimed are still accounted into cached
      stripes. The reclaim takes time. r5c_do_reclaim isn't aware of the
      stripes and does unnecessary stripe reclaim. In practice, I saw one
      stripe is reclaimed one time. This will cause bad IO pattern. Fixing
      this by excluding the reclaing stripes in the check.
      
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      e33fbb9c
    • Shaohua Li's avatar
      md/raid5-cache: stripe reclaim only counts valid stripes · e8fd52ee
      Shaohua Li authored
      When log space is tight, we try to reclaim stripes from log head. There
      are stripes which can't be reclaimed right now if some conditions are
      met. We skip such stripes but accidentally count them, which might cause
      no stripes are claimed. Fixing this by only counting valid stripes.
      
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      e8fd52ee
    • Shaohua Li's avatar
      MD: add doc for raid5-cache · 5a6265f9
      Shaohua Li authored
      I'm starting document of the raid5-cache feature. Please note this is a
      kernel doc instead of a mdadm manual, so I don't add the details about
      how to use the feature in mdadm side.
      
      Cc: NeilBrown <neilb@suse.com>
      Reviewed-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      5a6265f9
    • Shaohua Li's avatar
      1601c590
    • NeilBrown's avatar
      md: ensure md devices are freed before module is unloaded. · 9356863c
      NeilBrown authored
      Commit: cbd19983 ("md: Fix unfortunate interaction with evms")
      change mddev_put() so that it would not destroy an md device while
      ->ctime was non-zero.
      
      Unfortunately, we didn't make sure to clear ->ctime when unloading
      the module, so it is possible for an md device to remain after
      module unload.  An attempt to open such a device will trigger
      an invalid memory reference in:
        get_gendisk -> kobj_lookup -> exact_lock -> get_disk
      
      when tring to access disk->fops, which was in the module that has
      been removed.
      
      So ensure we clear ->ctime in md_exit(), and explain how that is useful,
      as it isn't immediately obvious when looking at the code.
      
      Fixes: cbd19983 ("md: Fix unfortunate interaction with evms")
      Tested-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      9356863c
    • Song Liu's avatar
      md/r5cache: improve journal device efficiency · 39b99586
      Song Liu authored
      It is important to be able to flush all stripes in raid5-cache.
      Therefore, we need reserve some space on the journal device for
      these flushes. If flush operation includes pending writes to the
      stripe, we need to reserve (conf->raid_disk + 1) pages per stripe
      for the flush out. This reduces the efficiency of journal space.
      If we exclude these pending writes from flush operation, we only
      need (conf->max_degraded + 1) pages per stripe.
      
      With this patch, when log space is critical (R5C_LOG_CRITICAL=1),
      pending writes will be excluded from stripe flush out. Therefore,
      we can reduce reserved space for flush out and thus improve journal
      device efficiency.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      39b99586
    • Song Liu's avatar
      md/r5cache: enable chunk_aligned_read with write back cache · 03b047f4
      Song Liu authored
      Chunk aligned read significantly reduces CPU usage of raid456.
      However, it is not safe to fully bypass the write back cache.
      This patch enables chunk aligned read with write back cache.
      
      For chunk aligned read, we track stripes in write back cache at
      a bigger granularity, "big_stripe". Each chunk may contain more
      than one stripe (for example, a 256kB chunk contains 64 4kB-page,
      so this chunk contain 64 stripes). For chunk_aligned_read, these
      stripes are grouped into one big_stripe, so we only need one lookup
      for the whole chunk.
      
      For each big_stripe, struct big_stripe_info tracks how many stripes
      of this big_stripe are in the write back cache. We count how many
      stripes of this big_stripe are in the write back cache. These
      counters are tracked in a radix tree (big_stripe_tree).
      r5c_tree_index() is used to calculate keys for the radix tree.
      
      chunk_aligned_read() calls r5c_big_stripe_cached() to look up
      big_stripe of each chunk in the tree. If this big_stripe is in the
      tree, chunk_aligned_read() aborts. This look up is protected by
      rcu_read_lock().
      
      It is necessary to remember whether a stripe is counted in
      big_stripe_tree. Instead of adding new flag, we reuses existing flags:
      STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
      two flags are set, the stripe is counted in big_stripe_tree. This
      requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
      r5c_try_caching_write(); and moving clear_bit of
      STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
      r5c_finish_stripe_write_out().
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      03b047f4
    • Song Liu's avatar
      EXPORT_SYMBOL radix_tree_replace_slot · 10257d71
      Song Liu authored
      It will be used in drivers/md/raid5-cache.c
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      10257d71
    • Shaohua Li's avatar
      raid5: only dispatch IO from raid5d for harddisk raid · 765d704d
      Shaohua Li authored
      We made raid5 stripe handling multi-thread before. It works well for
      SSD. But for harddisk, the multi-threading creates more disk seek, so
      not always improve performance. For several hard disks based raid5,
      multi-threading is required as raid5d becames a bottleneck especially
      for sequential write.
      
      To overcome the disk seek issue, we only dispatch IO from raid5d if the
      array is harddisk based. Other threads can still handle stripes, but
      can't dispatch IO.
      
      Idealy, we should control IO dispatching order according to IO position
      interrnally. Right now we still depend on block layer, which isn't very
      efficient sometimes though.
      
      My setup has 9 harddisks, each disk can do around 180M/s sequential
      write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
      write. The test machine uses an ATOM CPU. I measure sequential write
      with large iodepth bandwidth to raid array:
      
      without patch: ~600M/s
      without patch and group_thread_cnt=4: 750M/s
      with patch and group_thread_cnt=4: 950M/s
      with patch, group_thread_cnt=4, skip_copy=1: 1150M/s
      
      We are pretty close to the maximum bandwidth in the large iodepth
      iodepth case. The performance gap of small iodepth sequential write
      between software raid and theory value is still very big though, because
      we don't have an efficient pipeline.
      
      Cc: NeilBrown <neilb@suse.com>
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      765d704d
    • colyli@suse.de's avatar
      md linear: fix a race between linear_add() and linear_congested() · 03a9e24e
      colyli@suse.de authored
      Recently I receive a bug report that on Linux v3.0 based kerenl, hot add
      disk to a md linear device causes kernel crash at linear_congested(). From
      the crash image analysis, I find in linear_congested(), mddev->raid_disks
      contains value N, but conf->disks[] only has N-1 pointers available. Then
      a NULL pointer deference crashes the kernel.
      
      There is a race between linear_add() and linear_congested(), RCU stuffs
      used in these two functions cannot avoid the race. Since Linuv v4.0
      RCU code is replaced by introducing mddev_suspend().  After checking the
      upstream code, it seems linear_congested() is not called in
      generic_make_request() code patch, so mddev_suspend() cannot provent it
      from being called. The possible race still exists.
      
      Here I explain how the race still exists in current code.  For a machine
      has many CPUs, on one CPU, linear_add() is called to add a hard disk to a
      md linear device; at the same time on other CPU, linear_congested() is
      called to detect whether this md linear device is congested before issuing
      an I/O request onto it.
      
      Now I use a possible code execution time sequence to demo how the possible
      race happens,
      
      seq    linear_add()                linear_congested()
       0                                 conf=mddev->private
       1   oldconf=mddev->private
       2   mddev->raid_disks++
       3                              for (i=0; i<mddev->raid_disks;i++)
       4                                bdev_get_queue(conf->disks[i].rdev->bdev)
       5   mddev->private=newconf
      
      In linear_add() mddev->raid_disks is increased in time seq 2, and on
      another CPU in linear_congested() the for-loop iterates conf->disks[i] by
      the increased mddev->raid_disks in time seq 3,4. But conf with one more
      element (which is a pointer to struct dev_info type) to conf->disks[] is
      not updated yet, accessing its structure member in time seq 4 will cause a
      NULL pointer deference fault.
      
      To fix this race, there are 2 parts of modification in the patch,
       1) Add 'int raid_disks' in struct linear_conf, as a copy of
          mddev->raid_disks. It is initialized in linear_conf(), always being
          consistent with pointers number of 'struct dev_info disks[]'. When
          iterating conf->disks[] in linear_congested(), use conf->raid_disks to
          replace mddev->raid_disks in the for-loop, then NULL pointer deference
          will not happen again.
       2) RCU stuffs are back again, and use kfree_rcu() in linear_add() to
          free oldconf memory. Because oldconf may be referenced as mddev->private
          in linear_congested(), kfree_rcu() makes sure that its memory will not
          be released until no one uses it any more.
      Also some code comments are added in this patch, to make this modification
      to be easier understandable.
      
      This patch can be applied for kernels since v4.0 after commit:
      3be260cc ("md/linear: remove rcu protections in favour of
      suspend/resume"). But this bug is reported on Linux v3.0 based kernel, for
      people who maintain kernels before Linux v4.0, they need to do some back
      back port to this patch.
      
      Changelog:
       - V3: add 'int raid_disks' in struct linear_conf, and use kfree_rcu() to
             replace rcu_call() in linear_add().
       - v2: add RCU stuffs by suggestion from Shaohua and Neil.
       - v1: initial effort.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Neil Brown <neilb@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      03a9e24e
  4. 12 Feb, 2017 1 commit
  5. 11 Feb, 2017 8 commits
  6. 10 Feb, 2017 17 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 1ee18329
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) If the timing is wrong we can indefinitely stop generating new ipv6
          temporary addresses, from Marcus Huewe.
      
       2) Don't double free per-cpu stats in ipv6 SIT tunnel driver, from Cong
          Wang.
      
       3) Put protections in place so that AF_PACKET is not able to submit
          packets which don't even have a link level header to drivers. From
          Willem de Bruijn.
      
       4) Fix memory leaks in ipv4 and ipv6 multicast code, from Hangbin Liu.
      
       5) Don't use udp_ioctl() in l2tp code, UDP version expects a UDP socket
          and that doesn't go over very well when it is passed an L2TP one.
          Fix from Eric Dumazet.
      
       6) Don't crash on NULL pointer in phy_attach_direct(), from Florian
          Fainelli.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        l2tp: do not use udp_ioctl()
        xen-netfront: Delete rx_refill_timer in xennet_disconnect_backend()
        NET: mkiss: Fix panic
        net: hns: Fix the device being used for dma mapping during TX
        net: phy: Initialize mdio clock at probe function
        igmp, mld: Fix memory leak in igmpv3/mld_del_delrec()
        xen-netfront: Improve error handling during initialization
        sierra_net: Skip validating irrelevant fields for IDLE LSIs
        sierra_net: Add support for IPv6 and Dual-Stack Link Sense Indications
        kcm: fix 0-length case for kcm_sendmsg()
        xen-netfront: Rework the fix for Rx stall during OOM and network stress
        net: phy: Fix PHY module checks and NULL deref in phy_attach_direct()
        net: thunderx: Fix PHY autoneg for SGMII QLM mode
        net: dsa: Do not destroy invalid network devices
        ping: fix a null pointer dereference
        packet: round up linear to header len
        net: introduce device min_header_len
        sit: fix a double free on error path
        lwtunnel: valid encap attr check should return 0 when lwtunnel is disabled
        ipv6: addrconf: fix generation of new temporary addresses
      1ee18329
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma · a9dbf5c8
      Linus Torvalds authored
      Pull rdma fixes from Doug Ledford:
       "Third round of -rc fixes for 4.10 kernel:
      
         - two security related issues in the rxe driver
      
         - one compile issue in the RDMA uapi header"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
        RDMA: Don't reference kernel private header from UAPI header
        IB/rxe: Fix mem_check_range integer overflow
        IB/rxe: Fix resid update
      a9dbf5c8
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · aca9fa0c
      Linus Torvalds authored
      Pull i2c bugfixes from Wolfram Sang:
       "Two bugfixes (proper IO mapping and use of mutex) for a driver feature
        we introduced in this cycle"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: piix4: Request the SMBUS semaphore inside the mutex
        i2c: piix4: Fix request_region size
      aca9fa0c
    • Linus Torvalds's avatar
      Merge tag 'mmc-v4.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc · fc6f41ba
      Linus Torvalds authored
      Pull MMC host fix from Ulf Hansson:
       "mmci: Fix hang while waiting for busy-end interrupt"
      
      * tag 'mmc-v4.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
        mmc: mmci: avoid clearing ST Micro busy end interrupt mistakenly
      fc6f41ba
    • Linus Torvalds's avatar
      Merge tag 'sound-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 1f369d16
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "Here are some last-minute fixes: two fixes for races in ALSA sequencer
        queue spotted by syzkaller, a revert for a regression of LINE6 driver
        (since 4.9), and a trivial new codec ID addition for Nvidia HDMI"
      
      * tag 'sound-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: hda - adding a new NV HDMI/DP codec ID in the driver
        ALSA: seq: Fix race at creating a queue
        Revert "ALSA: line6: Only determine control port properties if needed"
        ALSA: seq: Don't handle loop timeout at snd_seq_pool_done()
      1f369d16
    • Linus Torvalds's avatar
      Merge tag 'nfsd-4.10-3' of git://linux-nfs.org/~bfields/linux · 7fe654dc
      Linus Torvalds authored
      Pull nfsd revert from Bruce Fields:
       "This patch turned out to have a couple problems. The problems are
        fixable, but at least one of the fixes is a little ugly. The original
        bug has always been there, so we can wait another week or two to get
        this right"
      
      * tag 'nfsd-4.10-3' of git://linux-nfs.org/~bfields/linux:
        nfsd: Revert "nfsd: special case truncates some more"
      7fe654dc
    • Linus Torvalds's avatar
      Merge tag 'powerpc-4.10-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 3ebc7033
      Linus Torvalds authored
      Pull powerpc fixes friom Michael Ellerman:
       "Apologies for the late pull request, but Ben has been busy finding bugs.
      
         - Userspace was semi-randomly segfaulting on radix due to us
           incorrectly handling a fault triggered by autonuma, caused by a
           patch we merged earlier in v4.10 to prevent the kernel executing
           userspace.
      
         - We weren't marking host IPIs properly for KVM in the OPAL ICP
           backend.
      
         - The ERAT flushing on radix was missing an isync and was incorrectly
           marked as DD1 only.
      
         - The powernv CPU hotplug code was missing a wakeup type and failing
           to flush the interrupt correctly when using OPAL ICP
      
        Thanks to Benjamin Herrenschmidt"
      
      * tag 'powerpc-4.10-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/powernv: Properly set "host-ipi" on IPIs
        powerpc/powernv: Fix CPU hotplug to handle waking on HVI
        powerpc/mm/radix: Update ERAT flushes when invalidating TLB
        powerpc/mm: Fix spurrious segfaults on radix with autonuma
      3ebc7033
    • Eric Dumazet's avatar
      l2tp: do not use udp_ioctl() · 72fb96e7
      Eric Dumazet authored
      udp_ioctl(), as its name suggests, is used by UDP protocols,
      but is also used by L2TP :(
      
      L2TP should use its own handler, because it really does not
      look the same.
      
      SIOCINQ for instance should not assume UDP checksum or headers.
      
      Thanks to Andrey and syzkaller team for providing the report
      and a nice reproducer.
      
      While crashes only happen on recent kernels (after commit
      7c13f97f ("udp: do fwd memory scheduling on dequeue")), this
      probably needs to be backported to older kernels.
      
      Fixes: 7c13f97f ("udp: do fwd memory scheduling on dequeue")
      Fixes: 85584672 ("udp: Fix udp_poll() and ioctl()")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      72fb96e7
    • Chris Mason's avatar
      Merge branch 'for-chris' of... · f3c7bfbd
      Chris Mason authored
      Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.10
      f3c7bfbd
    • Boris Ostrovsky's avatar
      xen-netfront: Delete rx_refill_timer in xennet_disconnect_backend() · 74470954
      Boris Ostrovsky authored
      rx_refill_timer should be deleted as soon as we disconnect from the
      backend since otherwise it is possible for the timer to go off before
      we get to xennet_destroy_queues(). If this happens we may dereference
      queue->rx.sring which is set to NULL in xennet_disconnect_backend().
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      CC: stable@vger.kernel.org
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      74470954
    • Ralf Baechle's avatar
      NET: mkiss: Fix panic · 7ba1b689
      Ralf Baechle authored
      If a USB-to-serial adapter is unplugged, the driver re-initializes, with
      dev->hard_header_len and dev->addr_len set to zero, instead of the correct
      values.  If then a packet is sent through the half-dead interface, the
      kernel will panic due to running out of headroom in the skb when pushing
      for the AX.25 headers resulting in this panic:
      
      [<c0595468>] (skb_panic) from [<c0401f70>] (skb_push+0x4c/0x50)
      [<c0401f70>] (skb_push) from [<bf0bdad4>] (ax25_hard_header+0x34/0xf4 [ax25])
      [<bf0bdad4>] (ax25_hard_header [ax25]) from [<bf0d05d4>] (ax_header+0x38/0x40 [mkiss])
      [<bf0d05d4>] (ax_header [mkiss]) from [<c041b584>] (neigh_compat_output+0x8c/0xd8)
      [<c041b584>] (neigh_compat_output) from [<c043e7a8>] (ip_finish_output+0x2a0/0x914)
      [<c043e7a8>] (ip_finish_output) from [<c043f948>] (ip_output+0xd8/0xf0)
      [<c043f948>] (ip_output) from [<c043f04c>] (ip_local_out_sk+0x44/0x48)
      
      This patch makes mkiss behave like the 6pack driver. 6pack does not
      panic.  In 6pack.c sp_setup() (same function name here) the values for
      dev->hard_header_len and dev->addr_len are set to the same values as in
      my mkiss patch.
      
      [ralf@linux-mips.org: Massages original submission to conform to the usual
      standards for patch submissions.]
      Signed-off-by: default avatarThomas Osterried <thomas@osterried.de>
      Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7ba1b689
    • Kejian Yan's avatar
      net: hns: Fix the device being used for dma mapping during TX · b85ea006
      Kejian Yan authored
      This patch fixes the device being used to DMA map skb->data.
      Erroneous device assignment causes the crash when SMMU is enabled.
      This happens during TX since buffer gets DMA mapped with device
      correspondign to net_device and gets unmapped using the device
      related to DSAF.
      Signed-off-by: default avatarKejian Yan <yankejian@huawei.com>
      Reviewed-by: default avatarYisen Zhuang <yisen.zhuang@huawei.com>
      Signed-off-by: default avatarSalil Mehta <salil.mehta@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b85ea006
    • Thomas Gleixner's avatar
      Merge tag 'irqchip-fixes-4.10' of git://git.infradead.org/users/jcooper/linux into irq/urgent · d128dfb5
      Thomas Gleixner authored
      Pull irqchip fixes for v4.10 from Jason Cooper
      
      - keystone: Fix scheduling while atomic for realtime
      - mxs: Enable SKIP_SET_WAKE and MASK_ON_SUSPEND
      d128dfb5
    • Andrey Ryabinin's avatar
      x86/mm/ptdump: Fix soft lockup in page table walker · 146fbb76
      Andrey Ryabinin authored
      CONFIG_KASAN=y needs a lot of virtual memory mapped for its shadow.
      In that case ptdump_walk_pgd_level_core() takes a lot of time to
      walk across all page tables and doing this without
      a rescheduling causes soft lockups:
      
       NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [swapper/0:1]
       ...
       Call Trace:
        ptdump_walk_pgd_level_core+0x40c/0x550
        ptdump_walk_pgd_level_checkwx+0x17/0x20
        mark_rodata_ro+0x13b/0x150
        kernel_init+0x2f/0x120
        ret_from_fork+0x2c/0x40
      
      I guess that this issue might arise even without KASAN on huge machines
      with several terabytes of RAM.
      
      Stick cond_resched() in pgd loop to fix this.
      Reported-by: default avatarTobias Regnery <tobias.regnery@gmail.com>
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: kasan-dev@googlegroups.com
      Cc: Alexander Potapenko <glider@google.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170210095405.31802-1-aryabinin@virtuozzo.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      146fbb76
    • Thomas Gleixner's avatar
      x86/tsc: Make the TSC ADJUST sanitizing work for tsc_reliable · 5f2e71e7
      Thomas Gleixner authored
      When the TSC is marked reliable then the synchronization check is skipped,
      but that also skips the TSC ADJUST sanitizing code. So on a machine with a
      wreckaged BIOS the TSC deviation between CPUs might go unnoticed.
      
      Let the TSC adjust sanitizing code run unconditionally and just skip the
      expensive synchronization checks when TSC is marked reliable.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Olof Johansson <olof@lixom.net>
      Link: http://lkml.kernel.org/r/20170209151231.491189912@linutronix.deSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      5f2e71e7
    • Thomas Gleixner's avatar
      x86/tsc: Avoid the large time jump when sanitizing TSC ADJUST · f2e04214
      Thomas Gleixner authored
      Olof reported that on a machine which has a BIOS wreckaged TSC the
      timestamps in dmesg are making a large jump because the TSC value is
      jumping forward after resetting the TSC ADJUST register to a sane value.
      
      This can be avoided by calling the TSC ADJUST saniziting function before
      initializing the per cpu sched clock machinery. That takes the offset into
      account and avoid the time jump.
      
      What cannot be avoided is that the 'Firmware Bug' warnings on the secondary
      CPUs are printed with the large time offsets because it would be too much
      effort and ugly hackery to print those warnings into a buffer and emit them
      after the adjustemt on the starting CPUs. It's a firmware bug and should be
      fixed in firmware. The weird timestamps are collateral damage and just
      illustrate the sillyness of the BIOS folks:
      
      [    0.397445] smp: Bringing up secondary CPUs ...
      [    0.402100] x86: Booting SMP configuration:
      [    0.406343] .... node  #0, CPUs:      #1
      [1265776479.930667] [Firmware Bug]: TSC ADJUST differs: Reference CPU0: -2978888639075328 CPU1: -2978888639183101
      [1265776479.944664] TSC ADJUST synchronize: Reference CPU0: 0 CPU1: -2978888639183101
      [    0.508119]  #2
      [1265776480.032346] [Firmware Bug]: TSC ADJUST differs: Reference CPU0: -2978888639075328 CPU2: -2978888639183677
      [1265776480.044192] TSC ADJUST synchronize: Reference CPU0: 0 CPU2: -2978888639183677
      [    0.607643]  #3
      [1265776480.131874] [Firmware Bug]: TSC ADJUST differs: Reference CPU0: -2978888639075328 CPU3: -2978888639184530
      [1265776480.143720] TSC ADJUST synchronize: Reference CPU0: 0 CPU3: -2978888639184530
      [    0.707108] smp: Brought up 1 node, 4 CPUs
      [    0.711271] smpboot: Total of 4 processors activated (21698.88 BogoMIPS)
      Reported-by: default avatarOlof Johansson <olof@lixom.net>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20170209151231.411460506@linutronix.deSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      f2e04214
    • Frederic Weisbecker's avatar
      tick/nohz: Fix possible missing clock reprog after tick soft restart · 7bdb59f1
      Frederic Weisbecker authored
      ts->next_tick keeps track of the next tick deadline in order to optimize
      clock programmation on irq exit and avoid redundant clock device writes.
      
      Now if ts->next_tick missed an update, we may spuriously miss a clock
      reprog later as the nohz code is fooled by an obsolete next_tick value.
      
      This is what happens here on a specific path: when we observe an
      expired timer from the nohz update code on irq exit, we perform a soft
      tick restart which simply fires the closest possible tick without
      actually exiting the nohz mode and restoring a periodic state. But we
      forget to update ts->next_tick accordingly.
      
      As a result, after the next tick resulting from such soft tick restart,
      the nohz code sees a stale value on ts->next_tick which doesn't match
      the clock deadline that just expired. If that obsolete ts->next_tick
      value happens to collide with the actual next tick deadline to be
      scheduled, we may spuriously bypass the clock reprogramming. In the
      worst case, the tick may never fire again.
      
      Fix this with a ts->next_tick reset on soft tick restart.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed: Wanpeng Li <wanpeng.li@hotmail.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1486485894-29173-1-git-send-email-fweisbec@gmail.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      7bdb59f1