1. 27 Mar, 2015 2 commits
  2. 23 Mar, 2015 4 commits
    • Steven Rostedt's avatar
      sched/rt: Use IPI to trigger RT task push migration instead of pulling · b6366f04
      Steven Rostedt authored
      When debugging the latencies on a 40 core box, where we hit 300 to
      500 microsecond latencies, I found there was a huge contention on the
      runqueue locks.
      
      Investigating it further, running ftrace, I found that it was due to
      the pulling of RT tasks.
      
      The test that was run was the following:
      
       cyclictest --numa -p95 -m -d0 -i100
      
      This created a thread on each CPU, that would set its wakeup in iterations
      of 100 microseconds. The -d0 means that all the threads had the same
      interval (100us). Each thread sleeps for 100us and wakes up and measures
      its latencies.
      
      cyclictest is maintained at:
       git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git
      
      What happened was another RT task would be scheduled on one of the CPUs
      that was running our test, when the other CPU tests went to sleep and
      scheduled idle. This caused the "pull" operation to execute on all
      these CPUs. Each one of these saw the RT task that was overloaded on
      the CPU of the test that was still running, and each one tried
      to grab that task in a thundering herd way.
      
      To grab the task, each thread would do a double rq lock grab, grabbing
      its own lock as well as the rq of the overloaded CPU. As the sched
      domains on this box was rather flat for its size, I saw up to 12 CPUs
      block on this lock at once. This caused a ripple affect with the
      rq locks especially since the taking was done via a double rq lock, which
      means that several of the CPUs had their own rq locks held while trying
      to take this rq lock. As these locks were blocked, any wakeups or load
      balanceing on these CPUs would also block on these locks, and the wait
      time escalated.
      
      I've tried various methods to lessen the load, but things like an
      atomic counter to only let one CPU grab the task wont work, because
      the task may have a limited affinity, and we may pick the wrong
      CPU to take that lock and do the pull, to only find out that the
      CPU we picked isn't in the task's affinity.
      
      Instead of doing the PULL, I now have the CPUs that want the pull to
      send over an IPI to the overloaded CPU, and let that CPU pick what
      CPU to push the task to. No more need to grab the rq lock, and the
      push/pull algorithm still works fine.
      
      With this patch, the latency dropped to just 150us over a 20 hour run.
      Without the patch, the huge latencies would trigger in seconds.
      
      I've created a new sched feature called RT_PUSH_IPI, which is enabled
      by default.
      
      When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
      and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
      is enabled, the IPI is sent to the overloaded CPU to do a push.
      
      To enabled or disable this at run time:
      
       # mount -t debugfs nodev /sys/kernel/debug
       # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
      or
       # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features
      
      Update: This original patch would send an IPI to all CPUs in the RT overload
      list. But that could theoretically cause the reverse issue. That is, there
      could be lots of overloaded RT queues and one CPU lowers its priority. It would
      then send an IPI to all the overloaded RT queues and they could then all try
      to grab the rq lock of the CPU lowering its priority, and then we have the
      same problem.
      
      The latest design sends out only one IPI to the first overloaded CPU. It tries to
      push any tasks that it can, and then looks for the next overloaded CPU that can
      push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable
      tasks that have priorities greater than the source CPU are covered. In case the
      source CPU lowers its priority again, a flag is set to tell the IPI traversal to
      restart with the first RT overloaded CPU after the source CPU.
      Parts-suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joern Engel <joern@purestorage.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.homeSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b6366f04
    • Steven Rostedt's avatar
      irq_work: Fix build failure when CONFIG_IRQ_WORK is not defined · 71ad00d6
      Steven Rostedt authored
      When CONFIG_IRQ_WORK is not defined (difficult to do, as it also
      requires CONFIG_PRINTK not to be defined), we get a build failure:
      
      	kernel/built-in.o: In function `flush_smp_call_function_queue':
      	kernel/smp.c:263: undefined reference to `irq_work_run'
      	kernel/smp.c:263: undefined reference to `irq_work_run'
      	Makefile:933: recipe for target 'vmlinux' failed
      
      Simplest thing to do is to make irq_work_run() a nop when not set.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Link: http://lkml.kernel.org/r/20150319101851.4d224d9b@gandalf.local.homeSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      71ad00d6
    • Ingo Molnar's avatar
    • Brian Silverman's avatar
      sched: Fix RLIMIT_RTTIME when PI-boosting to RT · 746db944
      Brian Silverman authored
      When non-realtime tasks get priority-inheritance boosted to a realtime
      scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
      counter used for checking this (the same one used for SCHED_RR
      timeslices) was not getting reset. This meant that tasks running with a
      non-realtime scheduling class which are repeatedly boosted to a realtime
      one, but never block while they are running realtime, eventually hit the
      timeout without ever running for a time over the limit. This patch
      resets the realtime timeslice counter when un-PI-boosting from an RT to
      a non-RT scheduling class.
      
      I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
      mutex which induces priority boosting and spins while boosted that gets
      killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
      applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
      does happen eventually with PREEMPT_VOLUNTARY kernels.
      Signed-off-by: default avatarBrian Silverman <brian@peloton-tech.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: austin@peloton-tech.com
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1424305436-6716-1-git-send-email-brian@peloton-tech.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      746db944
  3. 22 Mar, 2015 7 commits
  4. 21 Mar, 2015 11 commits
    • Linus Torvalds's avatar
      Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dma · f8975224
      Linus Torvalds authored
      Pull slave dmaengine fixes from Vinod Koul:
       "Four fixes for dw, pl08x, imx-sdma and at_hdmac driver.  Nothing
        unusual here, simple fixes to these drivers"
      
      * 'fixes' of git://git.infradead.org/users/vkoul/slave-dma:
        dmaengine: pl08x: Define capabilities for generic capabilities reporting
        dmaengine: dw: append MODULE_ALIAS for platform driver
        dmaengine: imx-sdma: switch to dynamic context mode after script loaded
        dmaengine: at_hdmac: Fix calculation of the residual bytes
      f8975224
    • Linus Torvalds's avatar
      Merge tag 'pm+acpi-4.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 3d7a6db5
      Linus Torvalds authored
      Pull power management and ACPI fixes from Rafael Wysocki:
       "These are fixes for recent regressions (PCI/ACPI resources and at91
        RTC locking), a stable-candidate powercap RAPL driver fix and two ARM
        cpuidle fixes (one stable-candidate too).
      
        Specifics:
      
         - Revert a recent PCI commit related to IRQ resources management that
           introduced a regression for drivers attempting to bind to devices
           whose previous drivers did not balance pci_enable_device() and
           pci_disable_device() as expected (Rafael J Wysocki).
      
         - Fix a deadlock in at91_rtc_interrupt() introduced by a typo in a
           recent commit related to wakeup interrupt handling (Dan Carpenter).
      
         - Allow the power capping RAPL (Running-Average Power Limit) driver
           to use different energy units for domains within one CPU package
           which is necessary to handle Intel Haswell EP processors correctly
           (Jacob Pan).
      
         - Improve the cpuidle mvebu driver's handling of Armada XP SoCs by
           updating the target residency and exit latency numbers for those
           chips (Sebastien Rannou).
      
         - Prevent the cpuidle mvebu driver from calling cpu_pm_enter() twice
           in a row before cpu_pm_exit() is called on the same CPU which
           breaks the core's assumptions regarding the usage of those
           functions (Gregory Clement)"
      
      * tag 'pm+acpi-4.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        Revert "x86/PCI: Refine the way to release PCI IRQ resources"
        rtc: at91rm9200: double locking bug in at91_rtc_interrupt()
        powercap / RAPL: handle domains with different energy units
        cpuidle: mvebu: Update cpuidle thresholds for Armada XP SOCs
        cpuidle: mvebu: Fix the CPU PM notifier usage
      3d7a6db5
    • Linus Torvalds's avatar
      Merge git://people.freedesktop.org/~airlied/linux · 97448d5b
      Linus Torvalds authored
      Pull drm updates from Dave Airlie:
       "A bunch of fixes across drivers:
      
        radeon:
           disable two ended allocation for now, it breaks some stuff
      
        amdkfd:
           misc fixes
      
        nouveau:
           fix irq loop problem, add basic support for GM206 (new hw)
      
        i915:
           fix some WARNs people were seeing
      
        exynos:
           fix some iommu interactions causing boot failures"
      
      * git://people.freedesktop.org/~airlied/linux:
        drm/radeon: drop ttm two ended allocation
        drm/exynos: fix the initialization order in FIMD
        drm/exynos: fix typo config name correctly.
        drm/exynos: Check for NULL dereference of crtc
        drm/exynos: IS_ERR() vs NULL bug
        drm/exynos: remove unused files
        drm/i915: Make sure the primary plane is enabled before reading out the fb state
        drm/nouveau/bios: fix i2c table parsing for dcb 4.1
        drm/nouveau/device/gm100: Basic GM206 bring up (as copy of GM204)
        drm/nouveau/device: post write to NV_PMC_BOOT_1 when flipping endian switch
        drm/nouveau/gr/gf100: fix some accidental or'ing of buffer addresses
        drm/nouveau/fifo/nv04: remove the loop from the interrupt handler
        drm/radeon: Changing number of compute pipe lines
        drm/amdkfd: Fix SDMA queue init. in non-HWS mode
        drm/amdkfd: destroy mqd when destroying kernel queue
        drm/i915: Ensure plane->state->fb stays in sync with plane->fb
      97448d5b
    • Linus Torvalds's avatar
      Merge tag 'devicetree-fixes-for-4.0-part2' of... · bb8ef2fb
      Linus Torvalds authored
      Merge tag 'devicetree-fixes-for-4.0-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
      
      Pull more DeviceTree fixes vfom Rob Herring:
      
       - revert setting stdout-path as preferred console.  This caused
         regressions in PowerMACs and other systems.
      
       - yet another fix for stdout-path option parsing.
      
       - fix error path handling in of_irq_parse_one
      
      * tag 'devicetree-fixes-for-4.0-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
        Revert "of: Fix premature bootconsole disable with 'stdout-path'"
        of: handle both '/' and ':' in path strings
        of: unittest: Add option string test case with longer path
        of/irq: Fix of_irq_parse_one() returned error codes
      bb8ef2fb
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · e477f3e0
      Linus Torvalds authored
      Pull SCSI target fixes from Nicholas Bellinger:
       "Here are current target-pending fixes for v4.0-rc5 code that have made
        their way into the queue over the last weeks.
      
        The fixes this round include:
      
         - Fix long-standing iser-target logout bug related to early
           conn_logout_comp completion, resulting in iscsi_conn use-after-tree
           OOpsen.  (Sagi + nab)
      
         - Fix long-standing tcm_fc bug in ft_invl_hw_context() failure
           handing for DDP hw offload.  (DanC)
      
         - Fix incorrect use of unprotected __transport_register_session() in
           tcm_qla2xxx + other single local se_node_acl fabrics.  (Bart)
      
         - Fix reference leak in target_submit_cmd() -> target_get_sess_cmd()
           for ack_kref=1 failure path.  (Bart)
      
         - Fix pSCSI backend ->get_device_type() statistics OOPs with
           un-configured device.  (Olaf + nab)
      
         - Fix virtual LUN=0 target_configure_device failure OOPs at modprobe
           time.  (Claudio + nab)
      
         - Fix FUA write false positive failure regression in v4.0-rc1 code.
           (Christophe Vu-Brugier + HCH)"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
        target: do not reject FUA CDBs when write cache is enabled but emulate_write_cache is 0
        target: Fix virtual LUN=0 target_configure_device failure OOPs
        target/pscsi: Fix NULL pointer dereference in get_device_type
        tcm_fc: missing curly braces in ft_invl_hw_context()
        target: Fix reference leak in target_get_sess_cmd() error path
        loop/usb/vhost-scsi/xen-scsiback: Fix use of __transport_register_session
        tcm_qla2xxx: Fix incorrect use of __transport_register_session
        iscsi-target: Avoid early conn_logout_comp for iser connections
        Revert "iscsi-target: Avoid IN_LOGOUT failure case for iser-target"
        target: Disallow changing of WRITE cache/FUA attrs after export
      e477f3e0
    • Linus Torvalds's avatar
      Merge tag 'dm-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm · da6b9a20
      Linus Torvalds authored
      Pull devicemapper fixes from Mike Snitzer:
       "A handful of stable fixes for DM:
         - fix thin target to always zero-fill reads to unprovisioned blocks
         - fix to interlock device destruction's suspend from internal
           suspends
         - fix 2 snapshot exception store handover bugs
         - fix dm-io to cope with DISCARD and WRITE_SAME capabilities changing"
      
      * tag 'dm-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
        dm io: deal with wandering queue limits when handling REQ_DISCARD and REQ_WRITE_SAME
        dm snapshot: suspend merging snapshot when doing exception handover
        dm snapshot: suspend origin when doing exception handover
        dm: hold suspend_lock while suspending device during device deletion
        dm thin: fix to consistently zero-fill reads to unprovisioned blocks
      da6b9a20
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · 521d4746
      Linus Torvalds authored
      Pull btrfs fixes from Chris Mason:
       "Most of these are fixing extent reservation accounting, or corners
        with tree writeback during commit.
      
        Josef's set does add a test, which isn't strictly a fix, but it'll
        keep us from making this same mistake again"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
        Btrfs: fix outstanding_extents accounting in DIO
        Btrfs: add sanity test for outstanding_extents accounting
        Btrfs: just free dummy extent buffers
        Btrfs: account merges/splits properly
        Btrfs: prepare block group cache before writing
        Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list)
        Btrfs: account for the correct number of extents for delalloc reservations
        Btrfs: fix merge delalloc logic
        Btrfs: fix comp_oper to get right order
        Btrfs: catch transaction abortion after waiting for it
        btrfs: fix sizeof format specifier in btrfs_check_super_valid()
      521d4746
    • Linus Torvalds's avatar
      Merge branch 'for-4.0' of git://linux-nfs.org/~bfields/linux · 0d122f74
      Linus Torvalds authored
      Pull nfsd bufix from Bruce Fields:
       "This is a fix for a crash easily triggered by 4.1 activity to a server
        built with CONFIG_NFSD_PNFS.
      
        There are some more bugfixes queued up that I intend to pass along
        next week, but this is the most critical"
      
      * 'for-4.0' of git://linux-nfs.org/~bfields/linux:
        Subject: nfsd: don't recursively call nfsd4_cb_layout_fail
      0d122f74
    • Linus Torvalds's avatar
      Merge tag 'upstream-4.0-rc5' of git://git.infradead.org/linux-ubifs · c6ef8145
      Linus Torvalds authored
      Pull UBI fix from Artem Bityutskiy:
       "This fixes a bug introduced during the v4.0 merge window where we
        forgot to put braces where they should be"
      
      * tag 'upstream-4.0-rc5' of git://git.infradead.org/linux-ubifs:
        UBI: fix missing brace control flow
      c6ef8145
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 60ed380e
      Linus Torvalds authored
      Pull arm64 fixes from Catalin Marinas:
      
       - mm switching fix where the kernel pgd ends up in the user TTBR0 after
         returning from an EFI run-time services call
      
       - fix __GFP_ZERO handling for atomic pool and CMA DMA allocations (the
         generic code does get the gfp flags, so it's left with the arch code
         to memzero accordingly)
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: Honor __GFP_ZERO in dma allocations
        arm64: efi: don't restore TTBR0 if active_mm points at init_mm
      60ed380e
    • Linus Torvalds's avatar
      Merge branch 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm · 62a202d7
      Linus Torvalds authored
      Pull ARM fixes from Russell King:
       "Another few ARM fixes.  Fabrice fixed the L2 cache DT parsing to allow
        prefetch configuration to be specified even when the cache size
        parsing fails.
      
        Laura noticed that the setting of page attributes wasn't working for
        modules due to is_module_addr() always returning false.
      
        Marc Gonzalez (aka Mason) noticed a potential latent bug with the way
        we read one of the CPUID registers (where we could attempt to read a
        non-present CPUID register which may fault.)
      
        I've fixed an issue where 32-bit DMA masks were failing with memory
        which extended to the top of physical address space, and I've also
        added debugging output of the page tables when we hit a data access
        exception which we don't specifically handle - prompted by the lack of
        information in a bug report"
      
      * 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm:
        ARM: 8313/1: Use read_cpuid_ext() macro instead of inline asm
        ARM: 8311/1: Don't use is_module_addr in setting page attributes
        ARM: 8310/1: l2c: Fix prefetch settings dt parsing
        ARM: dump pgd, pmd and pte states on unhandled data abort faults
        ARM: dma-api: fix off-by-one error in __dma_supported()
      62a202d7
  5. 20 Mar, 2015 16 commits