1. 10 Oct, 2017 24 commits
    • Steven Rostedt (Red Hat)'s avatar
      sched/rt: Simplify the IPI based RT balancing logic · 4bdced5c
      Steven Rostedt (Red Hat) authored
      When a CPU lowers its priority (schedules out a high priority task for a
      lower priority one), a check is made to see if any other CPU has overloaded
      RT tasks (more than one). It checks the rto_mask to determine this and if so
      it will request to pull one of those tasks to itself if the non running RT
      task is of higher priority than the new priority of the next task to run on
      the current CPU.
      
      When we deal with large number of CPUs, the original pull logic suffered
      from large lock contention on a single CPU run queue, which caused a huge
      latency across all CPUs. This was caused by only having one CPU having
      overloaded RT tasks and a bunch of other CPUs lowering their priority. To
      solve this issue, commit:
      
        b6366f04 ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
      
      changed the way to request a pull. Instead of grabbing the lock of the
      overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.
      
      Although the IPI logic worked very well in removing the large latency build
      up, it still could suffer from a large number of IPIs being sent to a single
      CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
      when I tested this on a 120 CPU box, with a stress test that had lots of
      RT tasks scheduling on all CPUs, it actually triggered the hard lockup
      detector! One CPU had so many IPIs sent to it, and due to the restart
      mechanism that is triggered when the source run queue has a priority status
      change, the CPU spent minutes! processing the IPIs.
      
      Thinking about this further, I realized there's no reason for each run queue
      to send its own IPI. As all CPUs with overloaded tasks must be scanned
      regardless if there's one or many CPUs lowering their priority, because
      there's no current way to find the CPU with the highest priority task that
      can schedule to one of these CPUs, there really only needs to be one IPI
      being sent around at a time.
      
      This greatly simplifies the code!
      
      The new approach is to have each root domain have its own irq work, as the
      rto_mask is per root domain. The root domain has the following fields
      attached to it:
      
        rto_push_work	 - the irq work to process each CPU set in rto_mask
        rto_lock	 - the lock to protect some of the other rto fields
        rto_loop_start - an atomic that keeps contention down on rto_lock
      		    the first CPU scheduling in a lower priority task
      		    is the one to kick off the process.
        rto_loop_next	 - an atomic that gets incremented for each CPU that
      		    schedules in a lower priority task.
        rto_loop	 - a variable protected by rto_lock that is used to
      		    compare against rto_loop_next
        rto_cpu	 - The cpu to send the next IPI to, also protected by
      		    the rto_lock.
      
      When a CPU schedules in a lower priority task and wants to make sure
      overloaded CPUs know about it. It increments the rto_loop_next. Then it
      atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
      then it is done, as another CPU is kicking off the IPI loop. If the old
      value is "0", then it will take the rto_lock to synchronize with a possible
      IPI being sent around to the overloaded CPUs.
      
      If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
      IPI being sent around, or one is about to finish. Then rto_cpu is set to the
      first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
      in rto_mask, then there's nothing to be done.
      
      When the CPU receives the IPI, it will first try to push any RT tasks that is
      queued on the CPU but can't run because a higher priority RT task is
      currently running on that CPU.
      
      Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
      finds one, it simply sends an IPI to that CPU and the process continues.
      
      If there's no more CPUs in the rto_mask, then rto_loop is compared with
      rto_loop_next. If they match, everything is done and the process is over. If
      they do not match, then a CPU scheduled in a lower priority task as the IPI
      was being passed around, and the process needs to start again. The first CPU
      in rto_mask is sent the IPI.
      
      This change removes this duplication of work in the IPI logic, and greatly
      lowers the latency caused by the IPIs. This removed the lockup happening on
      the 120 CPU machine. It also simplifies the code tremendously. What else
      could anyone ask for?
      
      Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
      supplying me with the rto_start_trylock() and rto_start_unlock() helper
      functions.
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott Wood <swood@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.homeSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      4bdced5c
    • Sebastian Andrzej Siewior's avatar
      block/ioprio: Use a helper to check for RT prio · 36436440
      Sebastian Andrzej Siewior authored
      A side-effect to the old code is that now SCHED_DEADLINE is also
      recognized.
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20171004154901.26904-2-bigeasy@linutronix.deSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      36436440
    • Sebastian Andrzej Siewior's avatar
      sched/rt: Add a helper to test for a RT task · ff0d4a9d
      Sebastian Andrzej Siewior authored
      This helper returns true if a task has elevated priority which is true
      for RT tasks (SCHED_RR and SCHED_FIFO) and also for SCHED_DEADLINE.
      A task which runs at RT priority due to PI-boosting is not considered as
      one with elevated priority.
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20171004154901.26904-1-bigeasy@linutronix.deSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ff0d4a9d
    • Brendan Jackman's avatar
      sched/fair: Fix usage of find_idlest_group() when the local group is idlest · 93f50f90
      Brendan Jackman authored
      find_idlest_group() returns NULL when the local group is idlest. The
      caller then continues the find_idlest_group() search at a lower level
      of the current CPU's sched_domain hierarchy. find_idlest_group_cpu() is
      not consulted and, crucially, @new_cpu is not updated. This means the
      search is pointless and we return @prev_cpu from select_task_rq_fair().
      
      This is fixed by initialising @new_cpu to @cpu instead of @prev_cpu.
      Signed-off-by: default avatarBrendan Jackman <brendan.jackman@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20171005114516.18617-6-brendan.jackman@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      93f50f90
    • Brendan Jackman's avatar
      sched/fair: Fix usage of find_idlest_group() when no groups are allowed · 6fee85cc
      Brendan Jackman authored
      When 'p' is not allowed on any of the CPUs in the sched_domain, we
      currently return NULL from find_idlest_group(), and pointlessly
      continue the search on lower sched_domain levels (where 'p' is also not
      allowed) before returning prev_cpu regardless (as we have not updated
      new_cpu).
      
      Add an explicit check for this case, and add a comment to
      find_idlest_group(). Now when find_idlest_group() returns NULL, it always
      means that the local group is allowed and idlest.
      Signed-off-by: default avatarBrendan Jackman <brendan.jackman@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20171005114516.18617-5-brendan.jackman@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6fee85cc
    • Brendan Jackman's avatar
      sched/fair: Fix find_idlest_group() when local group is not allowed · 0d10ab95
      Brendan Jackman authored
      When the local group is not allowed we do not modify this_*_load from
      their initial value of 0. That means that the load checks at the end
      of find_idlest_group cause us to incorrectly return NULL. Fixing the
      initial values to ULONG_MAX means we will instead return the idlest
      remote group in that case.
      Signed-off-by: default avatarBrendan Jackman <brendan.jackman@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20171005114516.18617-4-brendan.jackman@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0d10ab95
    • Brendan Jackman's avatar
      sched/fair: Remove unnecessary comparison with -1 · e90381ea
      Brendan Jackman authored
      Since commit:
      
        83a0a96a ("sched/fair: Leverage the idle state info when choosing the "idlest" cpu")
      
      find_idlest_group_cpu() (formerly find_idlest_cpu) no longer returns -1,
      so we can simplify the checking of the return value in find_idlest_cpu().
      Signed-off-by: default avatarBrendan Jackman <brendan.jackman@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20171005114516.18617-3-brendan.jackman@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e90381ea
    • Brendan Jackman's avatar
      sched/fair: Move select_task_rq_fair() slow-path into its own function · 18bd1b4b
      Brendan Jackman authored
      In preparation for changes that would otherwise require adding a new
      level of indentation to the while(sd) loop, create a new function
      find_idlest_cpu() which contains this loop, and rename the existing
      find_idlest_cpu() to find_idlest_group_cpu().
      
      Code inside the while(sd) loop is unchanged. @new_cpu is added as a
      variable in the new function, with the same initial value as the
      @new_cpu in select_task_rq_fair().
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarBrendan Jackman <brendan.jackman@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20171005114516.18617-2-brendan.jackman@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      18bd1b4b
    • Brendan Jackman's avatar
      sched/fair: Force balancing on NOHZ balance if local group has capacity · 583ffd99
      Brendan Jackman authored
      The "goto force_balance" here is intended to mitigate the fact that
      avg_load calculations can result in bad placement decisions when
      priority is asymmetrical.
      
      The original commit that adds it:
      
        fab47622 ("sched: Force balancing on newidle balance if local group has capacity")
      
      explains:
      
          Under certain situations, such as a niced down task (i.e. nice =
          -15) in the presence of nr_cpus NICE0 tasks, the niced task lands
          on a sched group and kicks away other tasks because of its large
          weight. This leads to sub-optimal utilization of the
          machine. Even though the sched group has capacity, it does not
          pull tasks because sds.this_load >> sds.max_load, and f_b_g()
          returns NULL.
      
      A similar but inverted issue also affects ARM big.LITTLE (asymmetrical CPU
      capacity) systems - consider 8 always-running, same-priority tasks on a
      system with 4 "big" and 4 "little" CPUs. Suppose that 5 of them end up on
      the "big" CPUs (which will be represented by one sched_group in the DIE
      sched_domain) and 3 on the "little" (the other sched_group in DIE), leaving
      one CPU unused. Because the "big" group has a higher group_capacity its
      avg_load may not present an imbalance that would cause migrating a
      task to the idle "little".
      
      The force_balance case here solves the problem but currently only for
      CPU_NEWLY_IDLE balances, which in theory might never happen on the
      unused CPU. Including CPU_IDLE in the force_balance case means
      there's an upper bound on the time before we can attempt to solve the
      underutilization: after DIE's sd->balance_interval has passed the
      next nohz balance kick will help us out.
      Signed-off-by: default avatarBrendan Jackman <brendan.jackman@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170807163900.25180-1-brendan.jackman@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      583ffd99
    • Brendan Jackman's avatar
      sched/fair: Sync task util before slow-path wakeup · ea16f0ea
      Brendan Jackman authored
      We use task_util() in find_idlest_group() via capacity_spare_wake().
      This task_util() updated in wake_cap(). However wake_cap() is not the
      only reason for ending up in find_idlest_group() - we could have been sent
      there by wake_wide(). So explicitly sync the task util with prev_cpu
      when we are about to head to find_idlest_group().
      
      We could simply do this at the beginning of
      select_task_rq_fair() (i.e. irrespective of whether we're heading to
      select_idle_sibling() or find_idlest_group() & co), but I didn't want to
      slow down the select_idle_sibling() path more than necessary.
      
      Don't do this during fork balancing, we won't need the task_util and
      we'd just clobber the last_update_time, which is supposed to be 0.
      Signed-off-by: default avatarBrendan Jackman <brendan.jackman@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andres Oportus <andresoportus@google.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20170808095519.10077-1-brendan.jackman@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ea16f0ea
    • Uladzislau Rezki's avatar
      sched/fair: Search a task from the tail of the queue · 93824900
      Uladzislau Rezki authored
      As a first step this patch makes cfs_tasks list as MRU one.
      It means, that when a next task is picked to run on physical
      CPU it is moved to the front of the list.
      
      Therefore, the cfs_tasks list is more or less sorted (except
      woken tasks) starting from recently given CPU time tasks toward
      tasks with max wait time in a run-queue, i.e. MRU list.
      
      Second, as part of the load balance operation, this approach
      starts detach_tasks()/detach_one_task() from the tail of the
      queue instead of the head, giving some advantages:
      
       - tends to pick a task with highest wait time;
      
       - tasks located in the tail are less likely cache-hot,
         therefore the can_migrate_task() decision is higher.
      
      hackbench illustrates slightly better performance. For example
      doing 1000 samples and 40 groups on i5-3320M CPU, it shows below
      figures:
      
       default: 0.657 avg
       patched: 0.646 avg
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20170913102430.8985-2-urezki@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      93824900
    • Suravee Suthikulpanit's avatar
      sched/topology: Introduce NUMA identity node sched domain · 051f3ca0
      Suravee Suthikulpanit authored
      On AMD Family17h-based (EPYC) system, a logical NUMA node can contain
      upto 8 cores (16 threads) with the following topology.
      
                   ----------------------------
               C0  | T0 T1 |    ||    | T0 T1 | C4
                   --------|    ||    |--------
               C1  | T0 T1 | L3 || L3 | T0 T1 | C5
                   --------|    ||    |--------
               C2  | T0 T1 | #0 || #1 | T0 T1 | C6
                   --------|    ||    |--------
               C3  | T0 T1 |    ||    | T0 T1 | C7
                   ----------------------------
      
      Here, there are 2 last-level (L3) caches per logical NUMA node.
      A socket can contain upto 4 NUMA nodes, and a system can support
      upto 2 sockets. With full system configuration, current scheduler
      creates 4 sched domains:
      
        domain0 SMT       (span a core)
        domain1 MC        (span a last-level-cache)
        domain2 NUMA      (span a socket: 4 nodes)
        domain3 NUMA      (span a system: 8 nodes)
      
      Note that there is no domain to represent cpus spaning a logical
      NUMA node.  With this hierarchy of sched domains, the scheduler does
      not balance properly in the following cases:
      
      Case1:
      
       When running 8 tasks, a properly balanced system should
       schedule a task per logical NUMA node. This is not the case for
       the current scheduler.
      
      Case2:
      
       In some cases, threads are scheduled on the same cpu, while other
       cpus are idle. This results in run-to-run inconsistency. For example:
      
        taskset -c 0-7 sysbench --num-threads=8 --test=cpu \
                                --cpu-max-prime=100000 run
      
      Total execution time ranges from 25.1s to 33.5s depending on threads
      placement, where 25.1s is when all 8 threads are balanced properly
      on 8 cpus.
      
      Introducing NUMA identity node sched domain, which is based on how
      SRAT/SLIT table define a logical NUMA node. This results in the following
      hierarchy of sched domains on the same system described above.
      
        domain0 SMT       (span a core)
        domain1 MC        (span a last-level-cache)
        domain2 NODE      (span a logical NUMA node)
        domain3 NUMA      (span a socket: 4 nodes)
        domain4 NUMA      (span a system: 8 nodes)
      
      This fixes the improper load balancing cases mentioned above.
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bp@suse.de
      Link: http://lkml.kernel.org/r/1504768805-46716-1-git-send-email-suravee.suthikulpanit@amd.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      051f3ca0
    • Peter Zijlstra's avatar
      sched/topology: Restore SD_PREFER_SIBLING on MC domains · ed4ad1ca
      Peter Zijlstra authored
      The normal x86_topology on NHM+ machines degenerates because the MC
      and CPU domains are of the same size, therefore MC inherits
      SD_PREFER_SIBLING from CPU (which then gets taken out). The result is
      that we'll spread tasks across the first NUMA level in order to
      maximize cache utilization.
      
      However, for the x86_numa_in_package_topology we loose the CPU domain,
      and we'll not have SD_PREFER_SIBLING set anywhere, giving a distinct
      difference in behaviour.
      
      Commit:
      
        8e7fbcbc ("sched: Remove stale power aware scheduling remnants and dysfunctional knobs")
      
      made a fail by not preserving the SD_PREFER_SIBLING for the !power_saving
      case on both CPU and MC.
      
      Then commit:
      
        6956dc56 ("sched/numa: Add SD_PERFER_SIBLING to CPU domain")
      
      adds it back to the CPU but not MC.
      
      Restore that now, such that we get consistent spreading behaviour wrt
      L3 and NUMA.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ed4ad1ca
    • luca abeni's avatar
      sched/deadline: Use C bitfields for the state flags · 799ba82d
      luca abeni authored
      Ask the compiler to use a single bit for storing true / false values,
      instead of wasting the size of a whole int value.
      Tested with gcc 5.4.0 on x86_64, and the compiler produces the expected
      Assembly (similar to the Assembly code generated when explicitly accessing
      the bits with bitmasks, "&" and "|").
      Signed-off-by: default avatarluca abeni <luca.abeni@santannapisa.it>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1504778971-13573-5-git-send-email-luca.abeni@santannapisa.itSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      799ba82d
    • Peter Zijlstra's avatar
      sched/deadline: Rename __dl_clear() to __dl_sub() · 8c0944ce
      Peter Zijlstra authored
      __dl_sub() is more meaningful as a name, and is more consistent
      with the naming of the dual function (__dl_add()).
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarLuca Abeni <luca.abeni@santannapisa.it>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1504778971-13573-4-git-send-email-luca.abeni@santannapisa.itSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8c0944ce
    • Luca Abeni's avatar
      sched/deadline: Fix switching to -deadline · 295d6d5e
      Luca Abeni authored
      Fix a bug introduced in:
      
        72f9f3fd ("sched/deadline: Remove dl_new from struct sched_dl_entity")
      
      After that commit, when switching to -deadline if the scheduling
      deadline of a task is in the past then switched_to_dl() calls
      setup_new_entity() to properly initialize the scheduling deadline
      and runtime.
      
      The problem is that the task is enqueued _before_ having its parameters
      initialized by setup_new_entity(), and this can cause problems.
      For example, a task with its out-of-date deadline in the past will
      potentially be enqueued as the highest priority one; however, its
      adjusted deadline may not be the earliest one.
      
      This patch fixes the problem by initializing the task's parameters before
      enqueuing it.
      Signed-off-by: default avatarluca abeni <luca.abeni@santannapisa.it>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1504778971-13573-3-git-send-email-luca.abeni@santannapisa.itSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      295d6d5e
    • luca abeni's avatar
      sched/headers: Remove duplicate prototype of __dl_clear_params() · e964d350
      luca abeni authored
      Signed-off-by: default avatarluca abeni <luca.abeni@santannapisa.it>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1504778971-13573-2-git-send-email-luca.abeni@santannapisa.itSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e964d350
    • Peter Zijlstra's avatar
      sched/debug: Rename task-state printing helpers · 1d48b080
      Peter Zijlstra authored
      Steve requested better names for the new task-state helper functions.
      
      So introduce the concept of task-state index for the printing and
      rename __get_task_state() to task_state_index() and
      __task_state_to_char() to task_index_to_char().
      Requested-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170929115016.pzlqc7ss3ccystyg@hirez.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1d48b080
    • Peter Zijlstra's avatar
      sched/idle: Move quiet_vmstate() into the NOHZ code · 62cb1188
      Peter Zijlstra authored
      quiet_vmstat() is an expensive function that only makes sense when we
      go into NOHZ.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: aubrey.li@linux.intel.com
      Cc: cl@linux.com
      Cc: fweisbec@gmail.com
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      62cb1188
    • Ingo Molnar's avatar
    • Peter Zijlstra's avatar
      sched/core: Ensure load_balance() respects the active_mask · 024c9d2f
      Peter Zijlstra authored
      While load_balance() masks the source CPUs against active_mask, it had
      a hole against the destination CPU. Ensure the destination CPU is also
      part of the 'domain-mask & active-mask' set.
      Reported-by: default avatarLevin, Alexander (Sasha Levin) <alexander.levin@verizon.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 77d1dfda ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      024c9d2f
    • Peter Zijlstra's avatar
      sched/core: Address more wake_affine() regressions · f2cdd9cc
      Peter Zijlstra authored
      The trivial wake_affine_idle() implementation is very good for a
      number of workloads, but it comes apart at the moment there are no
      idle CPUs left, IOW. the overloaded case.
      
      hackbench:
      
      		NO_WA_WEIGHT		WA_WEIGHT
      
      hackbench-20  : 7.362717561 seconds	6.450509391 seconds
      
      (win)
      
      netperf:
      
      		  NO_WA_WEIGHT		WA_WEIGHT
      
      TCP_SENDFILE-1	: Avg: 54524.6		Avg: 52224.3
      TCP_SENDFILE-10	: Avg: 48185.2          Avg: 46504.3
      TCP_SENDFILE-20	: Avg: 29031.2          Avg: 28610.3
      TCP_SENDFILE-40	: Avg: 9819.72          Avg: 9253.12
      TCP_SENDFILE-80	: Avg: 5355.3           Avg: 4687.4
      
      TCP_STREAM-1	: Avg: 41448.3          Avg: 42254
      TCP_STREAM-10	: Avg: 24123.2          Avg: 25847.9
      TCP_STREAM-20	: Avg: 15834.5          Avg: 18374.4
      TCP_STREAM-40	: Avg: 5583.91          Avg: 5599.57
      TCP_STREAM-80	: Avg: 2329.66          Avg: 2726.41
      
      TCP_RR-1	: Avg: 80473.5          Avg: 82638.8
      TCP_RR-10	: Avg: 72660.5          Avg: 73265.1
      TCP_RR-20	: Avg: 52607.1          Avg: 52634.5
      TCP_RR-40	: Avg: 57199.2          Avg: 56302.3
      TCP_RR-80	: Avg: 25330.3          Avg: 26867.9
      
      UDP_RR-1	: Avg: 108266           Avg: 107844
      UDP_RR-10	: Avg: 95480            Avg: 95245.2
      UDP_RR-20	: Avg: 68770.8          Avg: 68673.7
      UDP_RR-40	: Avg: 76231            Avg: 75419.1
      UDP_RR-80	: Avg: 34578.3          Avg: 35639.1
      
      UDP_STREAM-1	: Avg: 64684.3          Avg: 66606
      UDP_STREAM-10	: Avg: 52701.2          Avg: 52959.5
      UDP_STREAM-20	: Avg: 30376.4          Avg: 29704
      UDP_STREAM-40	: Avg: 15685.8          Avg: 15266.5
      UDP_STREAM-80	: Avg: 8415.13          Avg: 7388.97
      
      (wins and losses)
      
      sysbench:
      
      		    NO_WA_WEIGHT		WA_WEIGHT
      
      sysbench-mysql-2  :  2135.17 per sec.		 2142.51 per sec.
      sysbench-mysql-5  :  4809.68 per sec.            4800.19 per sec.
      sysbench-mysql-10 :  9158.59 per sec.            9157.05 per sec.
      sysbench-mysql-20 : 14570.70 per sec.           14543.55 per sec.
      sysbench-mysql-40 : 22130.56 per sec.           22184.82 per sec.
      sysbench-mysql-80 : 20995.56 per sec.           21904.18 per sec.
      
      sysbench-psql-2   :  1679.58 per sec.            1705.06 per sec.
      sysbench-psql-5   :  3797.69 per sec.            3879.93 per sec.
      sysbench-psql-10  :  7253.22 per sec.            7258.06 per sec.
      sysbench-psql-20  : 11166.75 per sec.           11220.00 per sec.
      sysbench-psql-40  : 17277.28 per sec.           17359.78 per sec.
      sysbench-psql-80  : 17112.44 per sec.           17221.16 per sec.
      
      (increase on the top end)
      
      tbench:
      
      NO_WA_WEIGHT
      
      Throughput 685.211 MB/sec   2 clients   2 procs  max_latency=0.123 ms
      Throughput 1596.64 MB/sec   5 clients   5 procs  max_latency=0.119 ms
      Throughput 2985.47 MB/sec  10 clients  10 procs  max_latency=0.262 ms
      Throughput 4521.15 MB/sec  20 clients  20 procs  max_latency=0.506 ms
      Throughput 9438.1  MB/sec  40 clients  40 procs  max_latency=2.052 ms
      Throughput 8210.5  MB/sec  80 clients  80 procs  max_latency=8.310 ms
      
      WA_WEIGHT
      
      Throughput 697.292 MB/sec   2 clients   2 procs  max_latency=0.127 ms
      Throughput 1596.48 MB/sec   5 clients   5 procs  max_latency=0.080 ms
      Throughput 2975.22 MB/sec  10 clients  10 procs  max_latency=0.254 ms
      Throughput 4575.14 MB/sec  20 clients  20 procs  max_latency=0.502 ms
      Throughput 9468.65 MB/sec  40 clients  40 procs  max_latency=2.069 ms
      Throughput 8631.73 MB/sec  80 clients  80 procs  max_latency=8.605 ms
      
      (increase on the top end)
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f2cdd9cc
    • Peter Zijlstra's avatar
      sched/core: Fix wake_affine() performance regression · d153b153
      Peter Zijlstra authored
      Eric reported a sysbench regression against commit:
      
        3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      
      Similarly, Rik was looking at the NAS-lu.C benchmark, which regressed
      against his v3.10 enterprise kernel.
      
      PRE (current tip/master):
      
       ivb-ep sysbench:
      
         2: [30 secs]     transactions:                        64110  (2136.94 per sec.)
         5: [30 secs]     transactions:                        143644 (4787.99 per sec.)
        10: [30 secs]     transactions:                        274298 (9142.93 per sec.)
        20: [30 secs]     transactions:                        418683 (13955.45 per sec.)
        40: [30 secs]     transactions:                        320731 (10690.15 per sec.)
        80: [30 secs]     transactions:                        355096 (11834.28 per sec.)
      
       hsw-ex NAS:
      
       OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds =                    18.01
       OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds =                    17.89
       OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds =                    17.93
       lu.C.x_threads_144_run_1.log: Time in seconds =                   434.68
       lu.C.x_threads_144_run_2.log: Time in seconds =                   405.36
       lu.C.x_threads_144_run_3.log: Time in seconds =                   433.83
      
      POST (+patch):
      
       ivb-ep sysbench:
      
         2: [30 secs]     transactions:                        64494  (2149.75 per sec.)
         5: [30 secs]     transactions:                        145114 (4836.99 per sec.)
        10: [30 secs]     transactions:                        278311 (9276.69 per sec.)
        20: [30 secs]     transactions:                        437169 (14571.60 per sec.)
        40: [30 secs]     transactions:                        669837 (22326.73 per sec.)
        80: [30 secs]     transactions:                        631739 (21055.88 per sec.)
      
       hsw-ex NAS:
      
       lu.C.x_threads_144_run_1.log: Time in seconds =                    23.36
       lu.C.x_threads_144_run_2.log: Time in seconds =                    22.96
       lu.C.x_threads_144_run_3.log: Time in seconds =                    22.52
      
      This patch takes out all the shiny wake_affine() stuff and goes back to
      utter basics. Between the two CPUs involved with the wakeup (the CPU
      doing the wakeup and the CPU we ran on previously) pick the CPU we can
      run on _now_.
      
      This restores much of the regressions against the older kernels,
      but leaves some ground in the overloaded case. The default-enabled
      WA_WEIGHT (which will be introduced in the next patch) is an attempt
      to address the overloaded situation.
      Reported-by: default avatarEric Farman <farman@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: jinpuwang@gmail.com
      Cc: vcaputo@pengaru.com
      Fixes: 3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d153b153
    • Linus Torvalds's avatar
      Merge branch 'ppc-bundle' (bundle from Michael Ellerman) · 529a86e0
      Linus Torvalds authored
      Merge powerpc transactional memory fixes from Michael Ellerman:
       "I figured I'd still send you the commits using a bundle to make sure
        it works in case I need to do it again in future"
      
      This fixes transactional memory state restore for powerpc.
      
      * bundle'd patches from Michael Ellerman:
        powerpc/tm: Fix illegal TM state in signal handler
        powerpc/64s: Use emergency stack for kernel TM Bad Thing program checks
      529a86e0
  2. 09 Oct, 2017 16 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · ff33952e
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix object leak on IPSEC offload failure, from Steffen Klassert.
      
       2) Fix range checks in ipset address range addition operations, from
          Jozsef Kadlecsik.
      
       3) Fix pernet ops unregistration order in ipset, from Florian Westphal.
      
       4) Add missing netlink attribute policy for nl80211 packet pattern
          attrs, from Peng Xu.
      
       5) Fix PPP device destruction race, from Guillaume Nault.
      
       6) Write marks get lost when BPF verifier processes R1=R2 register
          assignments, causing incorrect liveness information and less state
          pruning. Fix from Alexei Starovoitov.
      
       7) Fix blockhole routes so that they are marked dead and therefore not
          cached in sockets, otherwise IPSEC stops working. From Steffen
          Klassert.
      
       8) Fix broadcast handling of UDP socket early demux, from Paolo Abeni.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (37 commits)
        cdc_ether: flag the u-blox TOBY-L2 and SARA-U2 as wwan
        net: thunderx: mark expected switch fall-throughs in nicvf_main()
        udp: fix bcast packet reception
        netlink: do not set cb_running if dump's start() errs
        ipv4: Fix traffic triggered IPsec connections.
        ipv6: Fix traffic triggered IPsec connections.
        ixgbe: incorrect XDP ring accounting in ethtool tx_frame param
        net: ixgbe: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
        Revert commit 1a8b6d76 ("net:add one common config...")
        ixgbe: fix masking of bits read from IXGBE_VXLANCTRL register
        ixgbe: Return error when getting PHY address if PHY access is not supported
        netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'
        netfilter: SYNPROXY: skip non-tcp packet in {ipv4, ipv6}_synproxy_hook
        tipc: Unclone message at secondary destination lookup
        tipc: correct initialization of skb list
        gso: fix payload length when gso_size is zero
        mlxsw: spectrum_router: Avoid expensive lookup during route removal
        bpf: fix liveness marking
        doc: Fix typo "8023.ad" in bonding documentation
        ipv6: fix net.ipv6.conf.all.accept_dad behaviour for real
        ...
      ff33952e
    • Aleksander Morgado's avatar
      cdc_ether: flag the u-blox TOBY-L2 and SARA-U2 as wwan · fdfbad32
      Aleksander Morgado authored
      The u-blox TOBY-L2 is a LTE Cat 4 module with HSPA+ and 2G fallback.
      This module allows switching to different USB profiles with the
      'AT+UUSBCONF' command, and provides a ECM network interface when the
      'AT+UUSBCONF=2' profile is selected.
      
      The u-blox SARA-U2 is a HSPA module with 2G fallback. The default USB
      configuration includes a ECM network interface.
      
      Both these modules are controlled via AT commands through one of the
      TTYs exposed. Connecting these modules may be done just by activating
      the desired PDP context with 'AT+CGACT=1,<cid>' and then running DHCP
      on the ECM interface.
      Signed-off-by: default avatarAleksander Morgado <aleksander@aleksander.es>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fdfbad32
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-4.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 68ebe3cb
      Linus Torvalds authored
      Pull NFS client bugfixes from Trond Myklebust:
       "Hightlights include:
      
        stable fixes:
         - nfs/filelayout: fix oops when freeing filelayout segment
         - NFS: Fix uninitialized rpc_wait_queue
      
        bugfixes:
         - NFSv4/pnfs: Fix an infinite layoutget loop
         - nfs: RPC_MAX_AUTH_SIZE is in bytes"
      
      * tag 'nfs-for-4.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
        NFSv4/pnfs: Fix an infinite layoutget loop
        nfs/filelayout: fix oops when freeing filelayout segment
        sunrpc: remove redundant initialization of sock
        NFS: Fix uninitialized rpc_wait_queue
        NFS: Cleanup error handling in nfs_idmap_request_key()
        nfs: RPC_MAX_AUTH_SIZE is in bytes
      68ebe3cb
    • Gustavo A. R. Silva's avatar
      net: thunderx: mark expected switch fall-throughs in nicvf_main() · 1a2ace56
      Gustavo A. R. Silva authored
      In preparation to enabling -Wimplicit-fallthrough, mark switch cases
      where we are expecting to fall through.
      
      Cc: Sunil Goutham <sgoutham@cavium.com>
      Cc: Robert Richter <rric@kernel.org>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: netdev@vger.kernel.org
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a2ace56
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · fb60bccc
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter/IPVS fixes for net
      
      The following patchset contains Netfilter/IPVS fixes for your net tree,
      they are:
      
      1) Fix packet drops due to incorrect ECN handling in IPVS, from Vadim
         Fedorenko.
      
      2) Fix splat with mark restoration in xt_socket with non-full-sock,
         patch from Subash Abhinov Kasiviswanathan.
      
      3) ipset bogusly bails out when adding IPv4 range containing more than
         2^31 addresses, from Jozsef Kadlecsik.
      
      4) Incorrect pernet unregistration order in ipset, from Florian Westphal.
      
      5) Races between dump and swap in ipset results in BUG_ON splats, from
         Ross Lagerwall.
      
      6) Fix chain renames in nf_tables, from JingPiao Chen.
      
      7) Fix race in pernet codepath with ebtables table registration, from
         Artem Savkov.
      
      8) Memory leak in error path in set name allocation in nf_tables, patch
         from Arvind Yadav.
      
      9) Don't dump chain counters if they are not available, this fixes a
         crash when listing the ruleset.
      
      10) Fix out of bound memory read in strlcpy() in x_tables compat code,
          from Eric Dumazet.
      
      11) Make sure we only process TCP packets in SYNPROXY hooks, patch from
          Lin Zhang.
      
      12) Cannot load rules incrementally anymore after xt_bpf with pinned
          objects, added in revision 1. From Shmulik Ladkani.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fb60bccc
    • David S. Miller's avatar
      Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue · 5766cd68
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2017-10-09
      
      This series contains updates to ixgbe and arch/Kconfig.
      
      Mark fixes a case where PHY register access is not supported and we were
      returning a PHY address, when we should have been returning -EOPNOTSUPP.
      
      Sabrina Dubroca fixes the use of a logical "and" when it should have been
      the bitwise "and" operator.
      
      Ding Tianhong reverts the commit that added the Kconfig bool option
      ARCH_WANT_RELAX_ORDER, since there is now a new flag
      PCI_DEV_FLAGS_NO_RELAXED_ORDERING that has been added to indicate that
      Relaxed Ordering Attributes should not be used for Transaction Layer
      Packets.  Then follows up with making the needed changes to ixgbe to
      use the new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag.
      
      John Fastabend fixes an issue in the ring accounting when the transmit
      ring parameters are changed via ethtool when an XDP program is attached.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5766cd68
    • Paolo Abeni's avatar
      udp: fix bcast packet reception · 996b44fc
      Paolo Abeni authored
      The commit bc044e8d ("udp: perform source validation for
      mcast early demux") does not take into account that broadcast packets
      lands in the same code path and they need different checks for the
      source address - notably, zero source address are valid for bcast
      and invalid for mcast.
      
      As a result, 2nd and later broadcast packets with 0 source address
      landing to the same socket are dropped. This breaks dhcp servers.
      
      Since we don't have stringent performance requirements for ingress
      broadcast traffic, fix it by disabling UDP early demux such traffic.
      Reported-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Fixes: bc044e8d ("udp: perform source validation for mcast early demux")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      996b44fc
    • Jason A. Donenfeld's avatar
      netlink: do not set cb_running if dump's start() errs · 41c87425
      Jason A. Donenfeld authored
      It turns out that multiple places can call netlink_dump(), which means
      it's still possible to dereference partially initialized values in
      dump() that were the result of a faulty returned start().
      
      This fixes the issue by calling start() _before_ setting cb_running to
      true, so that there's no chance at all of hitting the dump() function
      through any indirect paths.
      
      It also moves the call to start() to be when the mutex is held. This has
      the nice side effect of serializing invocations to start(), which is
      likely desirable anyway. It also prevents any possible other races that
      might come out of this logic.
      
      In testing this with several different pieces of tricky code to trigger
      these issues, this commit fixes all avenues that I'm aware of.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Reviewed-by: default avatarJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      41c87425
    • David S. Miller's avatar
      Merge tag 'mac80211-for-davem-2017-10-09' of... · 6df4d17c
      David S. Miller authored
      Merge tag 'mac80211-for-davem-2017-10-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
      
      Johannes Berg says:
      
      ====================
      pull-request: mac80211 2017-10-09
      
      The QCA folks found another netlink problem - we were missing validation
      of some attributes. It's not super problematic since one can only read a
      few bytes beyond the message (and that memory must exist), but here's the
      fix for it.
      
      I thought perhaps we can make nla_parse_nested() require a policy, but
      given the two-stage validation/parsing in regular netlink that won't work.
      
      Please pull and let me know if there's any problem.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6df4d17c
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec · 93b03193
      David S. Miller authored
      Steffen Klassert says:
      
      ====================
      pull request (net): ipsec 2017-10-09
      
      1) Fix some error paths of the IPsec offloading API.
      
      2) Fix a NULL pointer dereference when IPsec is used
         with vti. From Alexey Kodanev.
      
      3) Don't call xfrm_policy_cache_flush under xfrm_state_lock,
         it triggers several locking warnings. From Artem Savkov.
      
      Please pull or let me know if there are problems.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      93b03193
    • Steffen Klassert's avatar
      ipv4: Fix traffic triggered IPsec connections. · 6c0e7284
      Steffen Klassert authored
      A recent patch removed the dst_free() on the allocated
      dst_entry in ipv4_blackhole_route(). The dst_free() marked the
      dst_entry as dead and added it to the gc list. I.e. it was setup
      for a one time usage. As a result we may now have a blackhole
      route cached at a socket on some IPsec scenarios. This makes the
      connection unusable.
      
      Fix this by marking the dst_entry directly at allocation time
      as 'dead', so it is used only once.
      
      Fixes: b838d5e1 ("ipv4: mark DST_NOGC and remove the operation of dst_free()")
      Reported-by: default avatarTobias Brunner <tobias@strongswan.org>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c0e7284
    • Steffen Klassert's avatar
      ipv6: Fix traffic triggered IPsec connections. · 62cf27e5
      Steffen Klassert authored
      A recent patch removed the dst_free() on the allocated
      dst_entry in ipv6_blackhole_route(). The dst_free() marked
      the dst_entry as dead and added it to the gc list. I.e. it
      was setup for a one time usage. As a result we may now have
      a blackhole route cached at a socket on some IPsec scenarios.
      This makes the connection unusable.
      
      Fix this by marking the dst_entry directly at allocation time
      as 'dead', so it is used only once.
      
      Fixes: 587fea74 ("ipv6: mark DST_NOGC and remove the operation of dst_free()")
      Reported-by: default avatarTobias Brunner <tobias@strongswan.org>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      62cf27e5
    • John Fastabend's avatar
      ixgbe: incorrect XDP ring accounting in ethtool tx_frame param · 8e679021
      John Fastabend authored
      Changing the TX ring parameters with an XDP program attached may
      cause the XDP queues to be cleared and the TX rings to be incorrectly
      configured.
      
      Fix by doing correct ring accounting in setup call.
      
      Fixes: 33fdc82f ("ixgbe: add support for XDP_TX action")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      8e679021
    • Ding Tianhong's avatar
      net: ixgbe: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag · 5e0fac63
      Ding Tianhong authored
      The ixgbe driver use the compile check to determine if it can
      send TLPs to Root Port with the Relaxed Ordering Attribute set,
      this is too inconvenient, now the new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING
      has been added to the kernel and we could check the bit4 in the PCIe
      Device Control register to determine whether we should use the Relaxed
      Ordering Attributes or not, so use this new way in the ixgbe driver.
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Acked-by: default avatarEmil Tantilov <emil.s.tantilov@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      5e0fac63
    • Ding Tianhong's avatar
      Revert commit 1a8b6d76 ("net:add one common config...") · f4986d25
      Ding Tianhong authored
      The new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING has been added
      to indicate that Relaxed Ordering Attributes (RO) should not
      be used for Transaction Layer Packets (TLP) targeted toward
      these affected Root Port, it will clear the bit4 in the PCIe
      Device Control register, so the PCIe device drivers could
      query PCIe configuration space to determine if it can send
      TLPs to Root Port with the Relaxed Ordering Attributes set.
      
      With this new flag  we don't need the config ARCH_WANT_RELAX_ORDER
      to control the Relaxed Ordering Attributes for the ixgbe drivers
      just like the commit 1a8b6d76 ("net:add one common config...") did,
      so revert this commit.
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      f4986d25
    • Sabrina Dubroca's avatar
      ixgbe: fix masking of bits read from IXGBE_VXLANCTRL register · a39221ce
      Sabrina Dubroca authored
      In ixgbe_clear_udp_tunnel_port(), we read the IXGBE_VXLANCTRL register
      and then try to mask some bits out of the value, using the logical
      instead of bitwise and operator.
      
      Fixes: a21d0822 ("ixgbe: add support for geneve Rx offload")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      a39221ce