Commit 00bcf5cd authored by Linus Torvalds's avatar Linus Torvalds

Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - rwsem micro-optimizations (Davidlohr Bueso)

   - Improve the implementation and optimize the performance of
     percpu-rwsems. (Peter Zijlstra.)

   - Convert all lglock users to better facilities such as percpu-rwsems
     or percpu-spinlocks and remove lglocks. (Peter Zijlstra)

   - Remove the ticket (spin)lock implementation. (Peter Zijlstra)

   - Korean translation of memory-barriers.txt and related fixes to the
     English document. (SeongJae Park)

   - misc fixes and cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  x86/cmpxchg, locking/atomics: Remove superfluous definitions
  x86, locking/spinlocks: Remove ticket (spin)lock implementation
  locking/lglock: Remove lglock implementation
  stop_machine: Remove stop_cpus_lock and lg_double_lock/unlock()
  fs/locks: Use percpu_down_read_preempt_disable()
  locking/percpu-rwsem: Add down_read_preempt_disable()
  fs/locks: Replace lg_local with a per-cpu spinlock
  fs/locks: Replace lg_global with a percpu-rwsem
  locking/percpu-rwsem: Add DEFINE_STATIC_PERCPU_RWSEMand percpu_rwsem_assert_held()
  locking/pv-qspinlock: Use cmpxchg_release() in __pv_queued_spin_unlock()
  locking/rwsem, x86: Drop a bogus cc clobber
  futex: Add some more function commentry
  locking/hung_task: Show all locks
  locking/rwsem: Scan the wait_list for readers only once
  locking/rwsem: Remove a few useless comments
  locking/rwsem: Return void in __rwsem_mark_wake()
  locking, rcu, cgroup: Avoid synchronize_sched() in __cgroup_procs_write()
  locking/Documentation: Add Korean translation
  locking/Documentation: Fix a typo of example result
  locking/Documentation: Fix wrong section reference
  ...
parents de956b8f 08645077
This source diff could not be displayed because it is too large. You can view the blob instead.
lglock - local/global locks for mostly local access patterns
------------------------------------------------------------
Origin: Nick Piggin's VFS scalability series introduced during
2.6.35++ [1] [2]
Location: kernel/locking/lglock.c
include/linux/lglock.h
Users: currently only the VFS and stop_machine related code
Design Goal:
------------
Improve scalability of globally used large data sets that are
distributed over all CPUs as per_cpu elements.
To manage global data structures that are partitioned over all CPUs
as per_cpu elements but can be mostly handled by CPU local actions
lglock will be used where the majority of accesses are cpu local
reading and occasional cpu local writing with very infrequent
global write access.
* deal with things locally whenever possible
- very fast access to the local per_cpu data
- reasonably fast access to specific per_cpu data on a different
CPU
* while making global action possible when needed
- by expensive access to all CPUs locks - effectively
resulting in a globally visible critical section.
Design:
-------
Basically it is an array of per_cpu spinlocks with the
lg_local_lock/unlock accessing the local CPUs lock object and the
lg_local_lock_cpu/unlock_cpu accessing a remote CPUs lock object
the lg_local_lock has to disable preemption as migration protection so
that the reference to the local CPUs lock does not go out of scope.
Due to the lg_local_lock/unlock only touching cpu-local resources it
is fast. Taking the local lock on a different CPU will be more
expensive but still relatively cheap.
One can relax the migration constraints by acquiring the current
CPUs lock with lg_local_lock_cpu, remember the cpu, and release that
lock at the end of the critical section even if migrated. This should
give most of the performance benefits without inhibiting migration
though needs careful considerations for nesting of lglocks and
consideration of deadlocks with lg_global_lock.
The lg_global_lock/unlock locks all underlying spinlocks of all
possible CPUs (including those off-line). The preemption disable/enable
are needed in the non-RT kernels to prevent deadlocks like:
on cpu 1
task A task B
lg_global_lock
got cpu 0 lock
<<<< preempt <<<<
lg_local_lock_cpu for cpu 0
spin on cpu 0 lock
On -RT this deadlock scenario is resolved by the arch_spin_locks in the
lglocks being replaced by rt_mutexes which resolve the above deadlock
by boosting the lock-holder.
Implementation:
---------------
The initial lglock implementation from Nick Piggin used some complex
macros to generate the lglock/brlock in lglock.h - they were later
turned into a set of functions by Andi Kleen [7]. The change to functions
was motivated by the presence of multiple lock users and also by them
being easier to maintain than the generating macros. This change to
functions is also the basis to eliminated the restriction of not
being initializeable in kernel modules (the remaining problem is that
locks are not explicitly initialized - see lockdep-design.txt)
Declaration and initialization:
-------------------------------
#include <linux/lglock.h>
DEFINE_LGLOCK(name)
or:
DEFINE_STATIC_LGLOCK(name);
lg_lock_init(&name, "lockdep_name_string");
on UP this is mapped to DEFINE_SPINLOCK(name) in both cases, note
also that as of 3.18-rc6 all declaration in use are of the _STATIC_
variant (and it seems that the non-static was never in use).
lg_lock_init is initializing the lockdep map only.
Usage:
------
From the locking semantics it is a spinlock. It could be called a
locality aware spinlock. lg_local_* behaves like a per_cpu
spinlock and lg_global_* like a global spinlock.
No surprises in the API.
lg_local_lock(*lglock);
access to protected per_cpu object on this CPU
lg_local_unlock(*lglock);
lg_local_lock_cpu(*lglock, cpu);
access to protected per_cpu object on other CPU cpu
lg_local_unlock_cpu(*lglock, cpu);
lg_global_lock(*lglock);
access all protected per_cpu objects on all CPUs
lg_global_unlock(*lglock);
There are no _trylock variants of the lglocks.
Note that the lg_global_lock/unlock has to iterate over all possible
CPUs rather than the actually present CPUs or a CPU could go off-line
with a held lock [4] and that makes it very expensive. A discussion on
these issues can be found at [5]
Constraints:
------------
* currently the declaration of lglocks in kernel modules is not
possible, though this should be doable with little change.
* lglocks are not recursive.
* suitable for code that can do most operations on the CPU local
data and will very rarely need the global lock
* lg_global_lock/unlock is *very* expensive and does not scale
* on UP systems all lg_* primitives are simply spinlocks
* in PREEMPT_RT the spinlock becomes an rt-mutex and can sleep but
does not change the tasks state while sleeping [6].
* in PREEMPT_RT the preempt_disable/enable in lg_local_lock/unlock
is downgraded to a migrate_disable/enable, the other
preempt_disable/enable are downgraded to barriers [6].
The deadlock noted for non-RT above is resolved due to rt_mutexes
boosting the lock-holder in this case which arch_spin_locks do
not do.
lglocks were designed for very specific problems in the VFS and probably
only are the right answer in these corner cases. Any new user that looks
at lglocks probably wants to look at the seqlock and RCU alternatives as
her first choice. There are also efforts to resolve the RCU issues that
currently prevent using RCU in place of view remaining lglocks.
Note on brlock history:
-----------------------
The 'Big Reader' read-write spinlocks were originally introduced by
Ingo Molnar in 2000 (2.4/2.5 kernel series) and removed in 2003. They
later were introduced by the VFS scalability patch set in 2.6 series
again as the "big reader lock" brlock [2] variant of lglock which has
been replaced by seqlock primitives or by RCU based primitives in the
3.13 kernel series as was suggested in [3] in 2003. The brlock was
entirely removed in the 3.13 kernel series.
Link: 1 http://lkml.org/lkml/2010/8/2/81
Link: 2 http://lwn.net/Articles/401738/
Link: 3 http://lkml.org/lkml/2003/3/9/205
Link: 4 https://lkml.org/lkml/2011/8/24/185
Link: 5 http://lkml.org/lkml/2011/12/18/189
Link: 6 https://www.kernel.org/pub/linux/kernel/projects/rt/
patch series - lglocks-rt.patch.patch
Link: 7 http://lkml.org/lkml/2012/3/5/26
...@@ -609,7 +609,7 @@ A data-dependency barrier must also order against dependent writes: ...@@ -609,7 +609,7 @@ A data-dependency barrier must also order against dependent writes:
The data-dependency barrier must order the read into Q with the store The data-dependency barrier must order the read into Q with the store
into *Q. This prohibits this outcome: into *Q. This prohibits this outcome:
(Q == B) && (B == 4) (Q == &B) && (B == 4)
Please note that this pattern should be rare. After all, the whole point Please note that this pattern should be rare. After all, the whole point
of dependency ordering is to -prevent- writes to the data structure, along of dependency ordering is to -prevent- writes to the data structure, along
...@@ -1928,6 +1928,7 @@ There are some more advanced barrier functions: ...@@ -1928,6 +1928,7 @@ There are some more advanced barrier functions:
See Documentation/DMA-API.txt for more information on consistent memory. See Documentation/DMA-API.txt for more information on consistent memory.
MMIO WRITE BARRIER MMIO WRITE BARRIER
------------------ ------------------
...@@ -2075,7 +2076,7 @@ systems, and so cannot be counted on in such a situation to actually achieve ...@@ -2075,7 +2076,7 @@ systems, and so cannot be counted on in such a situation to actually achieve
anything at all - especially with respect to I/O accesses - unless combined anything at all - especially with respect to I/O accesses - unless combined
with interrupt disabling operations. with interrupt disabling operations.
See also the section on "Inter-CPU locking barrier effects". See also the section on "Inter-CPU acquiring barrier effects".
As an example, consider the following: As an example, consider the following:
......
...@@ -705,7 +705,6 @@ config PARAVIRT_DEBUG ...@@ -705,7 +705,6 @@ config PARAVIRT_DEBUG
config PARAVIRT_SPINLOCKS config PARAVIRT_SPINLOCKS
bool "Paravirtualization layer for spinlocks" bool "Paravirtualization layer for spinlocks"
depends on PARAVIRT && SMP depends on PARAVIRT && SMP
select UNINLINE_SPIN_UNLOCK if !QUEUED_SPINLOCKS
---help--- ---help---
Paravirtualized spinlocks allow a pvops backend to replace the Paravirtualized spinlocks allow a pvops backend to replace the
spinlock implementation with something virtualization-friendly spinlock implementation with something virtualization-friendly
...@@ -718,7 +717,7 @@ config PARAVIRT_SPINLOCKS ...@@ -718,7 +717,7 @@ config PARAVIRT_SPINLOCKS
config QUEUED_LOCK_STAT config QUEUED_LOCK_STAT
bool "Paravirt queued spinlock statistics" bool "Paravirt queued spinlock statistics"
depends on PARAVIRT_SPINLOCKS && DEBUG_FS && QUEUED_SPINLOCKS depends on PARAVIRT_SPINLOCKS && DEBUG_FS
---help--- ---help---
Enable the collection of statistical data on the slowpath Enable the collection of statistical data on the slowpath
behavior of paravirtualized queued spinlocks and report behavior of paravirtualized queued spinlocks and report
......
...@@ -158,53 +158,9 @@ extern void __add_wrong_size(void) ...@@ -158,53 +158,9 @@ extern void __add_wrong_size(void)
* value of "*ptr". * value of "*ptr".
* *
* xadd() is locked when multiple CPUs are online * xadd() is locked when multiple CPUs are online
* xadd_sync() is always locked
* xadd_local() is never locked
*/ */
#define __xadd(ptr, inc, lock) __xchg_op((ptr), (inc), xadd, lock) #define __xadd(ptr, inc, lock) __xchg_op((ptr), (inc), xadd, lock)
#define xadd(ptr, inc) __xadd((ptr), (inc), LOCK_PREFIX) #define xadd(ptr, inc) __xadd((ptr), (inc), LOCK_PREFIX)
#define xadd_sync(ptr, inc) __xadd((ptr), (inc), "lock; ")
#define xadd_local(ptr, inc) __xadd((ptr), (inc), "")
#define __add(ptr, inc, lock) \
({ \
__typeof__ (*(ptr)) __ret = (inc); \
switch (sizeof(*(ptr))) { \
case __X86_CASE_B: \
asm volatile (lock "addb %b1, %0\n" \
: "+m" (*(ptr)) : "qi" (inc) \
: "memory", "cc"); \
break; \
case __X86_CASE_W: \
asm volatile (lock "addw %w1, %0\n" \
: "+m" (*(ptr)) : "ri" (inc) \
: "memory", "cc"); \
break; \
case __X86_CASE_L: \
asm volatile (lock "addl %1, %0\n" \
: "+m" (*(ptr)) : "ri" (inc) \
: "memory", "cc"); \
break; \
case __X86_CASE_Q: \
asm volatile (lock "addq %1, %0\n" \
: "+m" (*(ptr)) : "ri" (inc) \
: "memory", "cc"); \
break; \
default: \
__add_wrong_size(); \
} \
__ret; \
})
/*
* add_*() adds "inc" to "*ptr"
*
* __add() takes a lock prefix
* add_smp() is locked when multiple CPUs are online
* add_sync() is always locked
*/
#define add_smp(ptr, inc) __add((ptr), (inc), LOCK_PREFIX)
#define add_sync(ptr, inc) __add((ptr), (inc), "lock; ")
#define __cmpxchg_double(pfx, p1, p2, o1, o2, n1, n2) \ #define __cmpxchg_double(pfx, p1, p2, o1, o2, n1, n2) \
({ \ ({ \
......
...@@ -661,8 +661,6 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, ...@@ -661,8 +661,6 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS) #if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS)
#ifdef CONFIG_QUEUED_SPINLOCKS
static __always_inline void pv_queued_spin_lock_slowpath(struct qspinlock *lock, static __always_inline void pv_queued_spin_lock_slowpath(struct qspinlock *lock,
u32 val) u32 val)
{ {
...@@ -684,22 +682,6 @@ static __always_inline void pv_kick(int cpu) ...@@ -684,22 +682,6 @@ static __always_inline void pv_kick(int cpu)
PVOP_VCALL1(pv_lock_ops.kick, cpu); PVOP_VCALL1(pv_lock_ops.kick, cpu);
} }
#else /* !CONFIG_QUEUED_SPINLOCKS */
static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
__ticket_t ticket)
{
PVOP_VCALLEE2(pv_lock_ops.lock_spinning, lock, ticket);
}
static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock,
__ticket_t ticket)
{
PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
}
#endif /* CONFIG_QUEUED_SPINLOCKS */
#endif /* SMP && PARAVIRT_SPINLOCKS */ #endif /* SMP && PARAVIRT_SPINLOCKS */
#ifdef CONFIG_X86_32 #ifdef CONFIG_X86_32
......
...@@ -301,23 +301,16 @@ struct pv_mmu_ops { ...@@ -301,23 +301,16 @@ struct pv_mmu_ops {
struct arch_spinlock; struct arch_spinlock;
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
#include <asm/spinlock_types.h> #include <asm/spinlock_types.h>
#else
typedef u16 __ticket_t;
#endif #endif
struct qspinlock; struct qspinlock;
struct pv_lock_ops { struct pv_lock_ops {
#ifdef CONFIG_QUEUED_SPINLOCKS
void (*queued_spin_lock_slowpath)(struct qspinlock *lock, u32 val); void (*queued_spin_lock_slowpath)(struct qspinlock *lock, u32 val);
struct paravirt_callee_save queued_spin_unlock; struct paravirt_callee_save queued_spin_unlock;
void (*wait)(u8 *ptr, u8 val); void (*wait)(u8 *ptr, u8 val);
void (*kick)(int cpu); void (*kick)(int cpu);
#else /* !CONFIG_QUEUED_SPINLOCKS */
struct paravirt_callee_save lock_spinning;
void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
#endif /* !CONFIG_QUEUED_SPINLOCKS */
}; };
/* This contains all the paravirt structures: we get a convenient /* This contains all the paravirt structures: we get a convenient
......
...@@ -154,7 +154,7 @@ static inline bool __down_write_trylock(struct rw_semaphore *sem) ...@@ -154,7 +154,7 @@ static inline bool __down_write_trylock(struct rw_semaphore *sem)
: "+m" (sem->count), "=&a" (tmp0), "=&r" (tmp1), : "+m" (sem->count), "=&a" (tmp0), "=&r" (tmp1),
CC_OUT(e) (result) CC_OUT(e) (result)
: "er" (RWSEM_ACTIVE_WRITE_BIAS) : "er" (RWSEM_ACTIVE_WRITE_BIAS)
: "memory", "cc"); : "memory");
return result; return result;
} }
......
...@@ -20,187 +20,13 @@ ...@@ -20,187 +20,13 @@
* (the type definitions are in asm/spinlock_types.h) * (the type definitions are in asm/spinlock_types.h)
*/ */
#ifdef CONFIG_X86_32
# define LOCK_PTR_REG "a"
#else
# define LOCK_PTR_REG "D"
#endif
#if defined(CONFIG_X86_32) && (defined(CONFIG_X86_PPRO_FENCE))
/*
* On PPro SMP, we use a locked operation to unlock
* (PPro errata 66, 92)
*/
# define UNLOCK_LOCK_PREFIX LOCK_PREFIX
#else
# define UNLOCK_LOCK_PREFIX
#endif
/* How long a lock should spin before we consider blocking */ /* How long a lock should spin before we consider blocking */
#define SPIN_THRESHOLD (1 << 15) #define SPIN_THRESHOLD (1 << 15)
extern struct static_key paravirt_ticketlocks_enabled; extern struct static_key paravirt_ticketlocks_enabled;
static __always_inline bool static_key_false(struct static_key *key); static __always_inline bool static_key_false(struct static_key *key);
#ifdef CONFIG_QUEUED_SPINLOCKS
#include <asm/qspinlock.h> #include <asm/qspinlock.h>
#else
#ifdef CONFIG_PARAVIRT_SPINLOCKS
static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
{
set_bit(0, (volatile unsigned long *)&lock->tickets.head);
}
#else /* !CONFIG_PARAVIRT_SPINLOCKS */
static __always_inline void __ticket_lock_spinning(arch_spinlock_t *lock,
__ticket_t ticket)
{
}
static inline void __ticket_unlock_kick(arch_spinlock_t *lock,
__ticket_t ticket)
{
}
#endif /* CONFIG_PARAVIRT_SPINLOCKS */
static inline int __tickets_equal(__ticket_t one, __ticket_t two)
{
return !((one ^ two) & ~TICKET_SLOWPATH_FLAG);
}
static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock,
__ticket_t head)
{
if (head & TICKET_SLOWPATH_FLAG) {
arch_spinlock_t old, new;
old.tickets.head = head;
new.tickets.head = head & ~TICKET_SLOWPATH_FLAG;
old.tickets.tail = new.tickets.head + TICKET_LOCK_INC;
new.tickets.tail = old.tickets.tail;
/* try to clear slowpath flag when there are no contenders */
cmpxchg(&lock->head_tail, old.head_tail, new.head_tail);
}
}
static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
{
return __tickets_equal(lock.tickets.head, lock.tickets.tail);
}
/*
* Ticket locks are conceptually two parts, one indicating the current head of
* the queue, and the other indicating the current tail. The lock is acquired
* by atomically noting the tail and incrementing it by one (thus adding
* ourself to the queue and noting our position), then waiting until the head
* becomes equal to the the initial value of the tail.
*
* We use an xadd covering *both* parts of the lock, to increment the tail and
* also load the position of the head, which takes care of memory ordering
* issues and should be optimal for the uncontended case. Note the tail must be
* in the high part, because a wide xadd increment of the low part would carry
* up and contaminate the high part.
*/
static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
{
register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
inc = xadd(&lock->tickets, inc);
if (likely(inc.head == inc.tail))
goto out;
for (;;) {
unsigned count = SPIN_THRESHOLD;
do {
inc.head = READ_ONCE(lock->tickets.head);
if (__tickets_equal(inc.head, inc.tail))
goto clear_slowpath;
cpu_relax();
} while (--count);
__ticket_lock_spinning(lock, inc.tail);
}
clear_slowpath:
__ticket_check_and_clear_slowpath(lock, inc.head);
out:
barrier(); /* make sure nothing creeps before the lock is taken */
}
static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
{
arch_spinlock_t old, new;
old.tickets = READ_ONCE(lock->tickets);
if (!__tickets_equal(old.tickets.head, old.tickets.tail))
return 0;
new.head_tail = old.head_tail + (TICKET_LOCK_INC << TICKET_SHIFT);
new.head_tail &= ~TICKET_SLOWPATH_FLAG;
/* cmpxchg is a full barrier, so nothing can move before it */
return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == old.head_tail;
}
static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
{
if (TICKET_SLOWPATH_FLAG &&
static_key_false(&paravirt_ticketlocks_enabled)) {
__ticket_t head;
BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS);
head = xadd(&lock->tickets.head, TICKET_LOCK_INC);
if (unlikely(head & TICKET_SLOWPATH_FLAG)) {
head &= ~TICKET_SLOWPATH_FLAG;
__ticket_unlock_kick(lock, (head + TICKET_LOCK_INC));
}
} else
__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
}
static inline int arch_spin_is_locked(arch_spinlock_t *lock)
{
struct __raw_tickets tmp = READ_ONCE(lock->tickets);
return !__tickets_equal(tmp.tail, tmp.head);
}
static inline int arch_spin_is_contended(arch_spinlock_t *lock)
{
struct __raw_tickets tmp = READ_ONCE(lock->tickets);
tmp.head &= ~TICKET_SLOWPATH_FLAG;
return (__ticket_t)(tmp.tail - tmp.head) > TICKET_LOCK_INC;
}
#define arch_spin_is_contended arch_spin_is_contended
static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock,
unsigned long flags)
{
arch_spin_lock(lock);
}
static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
{
__ticket_t head = READ_ONCE(lock->tickets.head);
for (;;) {
struct __raw_tickets tmp = READ_ONCE(lock->tickets);
/*
* We need to check "unlocked" in a loop, tmp.head == head
* can be false positive because of overflow.
*/
if (__tickets_equal(tmp.head, tmp.tail) ||
!__tickets_equal(tmp.head, head))
break;
cpu_relax();
}
}
#endif /* CONFIG_QUEUED_SPINLOCKS */
/* /*
* Read-write spinlocks, allowing multiple readers * Read-write spinlocks, allowing multiple readers
......
...@@ -23,20 +23,7 @@ typedef u32 __ticketpair_t; ...@@ -23,20 +23,7 @@ typedef u32 __ticketpair_t;
#define TICKET_SHIFT (sizeof(__ticket_t) * 8) #define TICKET_SHIFT (sizeof(__ticket_t) * 8)
#ifdef CONFIG_QUEUED_SPINLOCKS
#include <asm-generic/qspinlock_types.h> #include <asm-generic/qspinlock_types.h>
#else
typedef struct arch_spinlock {
union {
__ticketpair_t head_tail;
struct __raw_tickets {
__ticket_t head, tail;
} tickets;
};
} arch_spinlock_t;
#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } }
#endif /* CONFIG_QUEUED_SPINLOCKS */
#include <asm-generic/qrwlock_types.h> #include <asm-generic/qrwlock_types.h>
......
...@@ -575,9 +575,6 @@ static void kvm_kick_cpu(int cpu) ...@@ -575,9 +575,6 @@ static void kvm_kick_cpu(int cpu)
kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid); kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
} }
#ifdef CONFIG_QUEUED_SPINLOCKS
#include <asm/qspinlock.h> #include <asm/qspinlock.h>
static void kvm_wait(u8 *ptr, u8 val) static void kvm_wait(u8 *ptr, u8 val)
...@@ -606,243 +603,6 @@ static void kvm_wait(u8 *ptr, u8 val) ...@@ -606,243 +603,6 @@ static void kvm_wait(u8 *ptr, u8 val)
local_irq_restore(flags); local_irq_restore(flags);
} }
#else /* !CONFIG_QUEUED_SPINLOCKS */
enum kvm_contention_stat {
TAKEN_SLOW,
TAKEN_SLOW_PICKUP,
RELEASED_SLOW,
RELEASED_SLOW_KICKED,
NR_CONTENTION_STATS
};
#ifdef CONFIG_KVM_DEBUG_FS
#define HISTO_BUCKETS 30
static struct kvm_spinlock_stats
{
u32 contention_stats[NR_CONTENTION_STATS];
u32 histo_spin_blocked[HISTO_BUCKETS+1];
u64 time_blocked;
} spinlock_stats;
static u8 zero_stats;
static inline void check_zero(void)
{
u8 ret;
u8 old;
old = READ_ONCE(zero_stats);
if (unlikely(old)) {
ret = cmpxchg(&zero_stats, old, 0);
/* This ensures only one fellow resets the stat */
if (ret == old)
memset(&spinlock_stats, 0, sizeof(spinlock_stats));
}
}
static inline void add_stats(enum kvm_contention_stat var, u32 val)
{
check_zero();
spinlock_stats.contention_stats[var] += val;
}
static inline u64 spin_time_start(void)
{
return sched_clock();
}
static void __spin_time_accum(u64 delta, u32 *array)
{
unsigned index;
index = ilog2(delta);
check_zero();
if (index < HISTO_BUCKETS)
array[index]++;
else
array[HISTO_BUCKETS]++;
}
static inline void spin_time_accum_blocked(u64 start)
{
u32 delta;
delta = sched_clock() - start;
__spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
spinlock_stats.time_blocked += delta;
}
static struct dentry *d_spin_debug;
static struct dentry *d_kvm_debug;
static struct dentry *kvm_init_debugfs(void)
{
d_kvm_debug = debugfs_create_dir("kvm-guest", NULL);
if (!d_kvm_debug)
printk(KERN_WARNING "Could not create 'kvm' debugfs directory\n");
return d_kvm_debug;
}
static int __init kvm_spinlock_debugfs(void)
{
struct dentry *d_kvm;
d_kvm = kvm_init_debugfs();
if (d_kvm == NULL)
return -ENOMEM;
d_spin_debug = debugfs_create_dir("spinlocks", d_kvm);
debugfs_create_u8("zero_stats", 0644, d_spin_debug, &zero_stats);
debugfs_create_u32("taken_slow", 0444, d_spin_debug,
&spinlock_stats.contention_stats[TAKEN_SLOW]);
debugfs_create_u32("taken_slow_pickup", 0444, d_spin_debug,
&spinlock_stats.contention_stats[TAKEN_SLOW_PICKUP]);
debugfs_create_u32("released_slow", 0444, d_spin_debug,
&spinlock_stats.contention_stats[RELEASED_SLOW]);
debugfs_create_u32("released_slow_kicked", 0444, d_spin_debug,
&spinlock_stats.contention_stats[RELEASED_SLOW_KICKED]);
debugfs_create_u64("time_blocked", 0444, d_spin_debug,
&spinlock_stats.time_blocked);
debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
spinlock_stats.histo_spin_blocked, HISTO_BUCKETS + 1);
return 0;
}
fs_initcall(kvm_spinlock_debugfs);
#else /* !CONFIG_KVM_DEBUG_FS */
static inline void add_stats(enum kvm_contention_stat var, u32 val)
{
}
static inline u64 spin_time_start(void)
{
return 0;
}
static inline void spin_time_accum_blocked(u64 start)
{
}
#endif /* CONFIG_KVM_DEBUG_FS */
struct kvm_lock_waiting {
struct arch_spinlock *lock;
__ticket_t want;
};
/* cpus 'waiting' on a spinlock to become available */
static cpumask_t waiting_cpus;
/* Track spinlock on which a cpu is waiting */
static DEFINE_PER_CPU(struct kvm_lock_waiting, klock_waiting);
__visible void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
{
struct kvm_lock_waiting *w;
int cpu;
u64 start;
unsigned long flags;
__ticket_t head;
if (in_nmi())
return;
w = this_cpu_ptr(&klock_waiting);
cpu = smp_processor_id();
start = spin_time_start();
/*
* Make sure an interrupt handler can't upset things in a
* partially setup state.
*/
local_irq_save(flags);
/*
* The ordering protocol on this is that the "lock" pointer
* may only be set non-NULL if the "want" ticket is correct.
* If we're updating "want", we must first clear "lock".
*/
w->lock = NULL;
smp_wmb();
w->want = want;
smp_wmb();
w->lock = lock;
add_stats(TAKEN_SLOW, 1);
/*
* This uses set_bit, which is atomic but we should not rely on its
* reordering gurantees. So barrier is needed after this call.
*/
cpumask_set_cpu(cpu, &waiting_cpus);
barrier();
/*
* Mark entry to slowpath before doing the pickup test to make
* sure we don't deadlock with an unlocker.
*/
__ticket_enter_slowpath(lock);
/* make sure enter_slowpath, which is atomic does not cross the read */
smp_mb__after_atomic();
/*
* check again make sure it didn't become free while
* we weren't looking.
*/
head = READ_ONCE(lock->tickets.head);
if (__tickets_equal(head, want)) {
add_stats(TAKEN_SLOW_PICKUP, 1);
goto out;
}
/*
* halt until it's our turn and kicked. Note that we do safe halt
* for irq enabled case to avoid hang when lock info is overwritten
* in irq spinlock slowpath and no spurious interrupt occur to save us.
*/
if (arch_irqs_disabled_flags(flags))
halt();
else
safe_halt();
out:
cpumask_clear_cpu(cpu, &waiting_cpus);
w->lock = NULL;
local_irq_restore(flags);
spin_time_accum_blocked(start);
}
PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
/* Kick vcpu waiting on @lock->head to reach value @ticket */
static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
{
int cpu;
add_stats(RELEASED_SLOW, 1);
for_each_cpu(cpu, &waiting_cpus) {
const struct kvm_lock_waiting *w = &per_cpu(klock_waiting, cpu);
if (READ_ONCE(w->lock) == lock &&
READ_ONCE(w->want) == ticket) {
add_stats(RELEASED_SLOW_KICKED, 1);
kvm_kick_cpu(cpu);
break;
}
}
}
#endif /* !CONFIG_QUEUED_SPINLOCKS */
/* /*
* Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
*/ */
...@@ -854,16 +614,11 @@ void __init kvm_spinlock_init(void) ...@@ -854,16 +614,11 @@ void __init kvm_spinlock_init(void)
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)) if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return; return;
#ifdef CONFIG_QUEUED_SPINLOCKS
__pv_init_lock_hash(); __pv_init_lock_hash();
pv_lock_ops.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath; pv_lock_ops.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock); pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock);
pv_lock_ops.wait = kvm_wait; pv_lock_ops.wait = kvm_wait;
pv_lock_ops.kick = kvm_kick_cpu; pv_lock_ops.kick = kvm_kick_cpu;
#else /* !CONFIG_QUEUED_SPINLOCKS */
pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
pv_lock_ops.unlock_kick = kvm_unlock_kick;
#endif
} }
static __init int kvm_spinlock_init_jump(void) static __init int kvm_spinlock_init_jump(void)
......
...@@ -8,7 +8,6 @@ ...@@ -8,7 +8,6 @@
#include <asm/paravirt.h> #include <asm/paravirt.h>
#ifdef CONFIG_QUEUED_SPINLOCKS
__visible void __native_queued_spin_unlock(struct qspinlock *lock) __visible void __native_queued_spin_unlock(struct qspinlock *lock)
{ {
native_queued_spin_unlock(lock); native_queued_spin_unlock(lock);
...@@ -21,19 +20,13 @@ bool pv_is_native_spin_unlock(void) ...@@ -21,19 +20,13 @@ bool pv_is_native_spin_unlock(void)
return pv_lock_ops.queued_spin_unlock.func == return pv_lock_ops.queued_spin_unlock.func ==
__raw_callee_save___native_queued_spin_unlock; __raw_callee_save___native_queued_spin_unlock;
} }
#endif
struct pv_lock_ops pv_lock_ops = { struct pv_lock_ops pv_lock_ops = {
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
#ifdef CONFIG_QUEUED_SPINLOCKS
.queued_spin_lock_slowpath = native_queued_spin_lock_slowpath, .queued_spin_lock_slowpath = native_queued_spin_lock_slowpath,
.queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock), .queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock),
.wait = paravirt_nop, .wait = paravirt_nop,
.kick = paravirt_nop, .kick = paravirt_nop,
#else /* !CONFIG_QUEUED_SPINLOCKS */
.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
.unlock_kick = paravirt_nop,
#endif /* !CONFIG_QUEUED_SPINLOCKS */
#endif /* SMP */ #endif /* SMP */
}; };
EXPORT_SYMBOL(pv_lock_ops); EXPORT_SYMBOL(pv_lock_ops);
......
...@@ -10,7 +10,7 @@ DEF_NATIVE(pv_mmu_ops, write_cr3, "mov %eax, %cr3"); ...@@ -10,7 +10,7 @@ DEF_NATIVE(pv_mmu_ops, write_cr3, "mov %eax, %cr3");
DEF_NATIVE(pv_mmu_ops, read_cr3, "mov %cr3, %eax"); DEF_NATIVE(pv_mmu_ops, read_cr3, "mov %cr3, %eax");
DEF_NATIVE(pv_cpu_ops, clts, "clts"); DEF_NATIVE(pv_cpu_ops, clts, "clts");
#if defined(CONFIG_PARAVIRT_SPINLOCKS) && defined(CONFIG_QUEUED_SPINLOCKS) #if defined(CONFIG_PARAVIRT_SPINLOCKS)
DEF_NATIVE(pv_lock_ops, queued_spin_unlock, "movb $0, (%eax)"); DEF_NATIVE(pv_lock_ops, queued_spin_unlock, "movb $0, (%eax)");
#endif #endif
...@@ -49,7 +49,7 @@ unsigned native_patch(u8 type, u16 clobbers, void *ibuf, ...@@ -49,7 +49,7 @@ unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
PATCH_SITE(pv_mmu_ops, read_cr3); PATCH_SITE(pv_mmu_ops, read_cr3);
PATCH_SITE(pv_mmu_ops, write_cr3); PATCH_SITE(pv_mmu_ops, write_cr3);
PATCH_SITE(pv_cpu_ops, clts); PATCH_SITE(pv_cpu_ops, clts);
#if defined(CONFIG_PARAVIRT_SPINLOCKS) && defined(CONFIG_QUEUED_SPINLOCKS) #if defined(CONFIG_PARAVIRT_SPINLOCKS)
case PARAVIRT_PATCH(pv_lock_ops.queued_spin_unlock): case PARAVIRT_PATCH(pv_lock_ops.queued_spin_unlock):
if (pv_is_native_spin_unlock()) { if (pv_is_native_spin_unlock()) {
start = start_pv_lock_ops_queued_spin_unlock; start = start_pv_lock_ops_queued_spin_unlock;
......
...@@ -19,7 +19,7 @@ DEF_NATIVE(pv_cpu_ops, swapgs, "swapgs"); ...@@ -19,7 +19,7 @@ DEF_NATIVE(pv_cpu_ops, swapgs, "swapgs");
DEF_NATIVE(, mov32, "mov %edi, %eax"); DEF_NATIVE(, mov32, "mov %edi, %eax");
DEF_NATIVE(, mov64, "mov %rdi, %rax"); DEF_NATIVE(, mov64, "mov %rdi, %rax");
#if defined(CONFIG_PARAVIRT_SPINLOCKS) && defined(CONFIG_QUEUED_SPINLOCKS) #if defined(CONFIG_PARAVIRT_SPINLOCKS)
DEF_NATIVE(pv_lock_ops, queued_spin_unlock, "movb $0, (%rdi)"); DEF_NATIVE(pv_lock_ops, queued_spin_unlock, "movb $0, (%rdi)");
#endif #endif
...@@ -61,7 +61,7 @@ unsigned native_patch(u8 type, u16 clobbers, void *ibuf, ...@@ -61,7 +61,7 @@ unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
PATCH_SITE(pv_cpu_ops, clts); PATCH_SITE(pv_cpu_ops, clts);
PATCH_SITE(pv_mmu_ops, flush_tlb_single); PATCH_SITE(pv_mmu_ops, flush_tlb_single);
PATCH_SITE(pv_cpu_ops, wbinvd); PATCH_SITE(pv_cpu_ops, wbinvd);
#if defined(CONFIG_PARAVIRT_SPINLOCKS) && defined(CONFIG_QUEUED_SPINLOCKS) #if defined(CONFIG_PARAVIRT_SPINLOCKS)
case PARAVIRT_PATCH(pv_lock_ops.queued_spin_unlock): case PARAVIRT_PATCH(pv_lock_ops.queued_spin_unlock):
if (pv_is_native_spin_unlock()) { if (pv_is_native_spin_unlock()) {
start = start_pv_lock_ops_queued_spin_unlock; start = start_pv_lock_ops_queued_spin_unlock;
......
...@@ -21,8 +21,6 @@ static DEFINE_PER_CPU(int, lock_kicker_irq) = -1; ...@@ -21,8 +21,6 @@ static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
static DEFINE_PER_CPU(char *, irq_name); static DEFINE_PER_CPU(char *, irq_name);
static bool xen_pvspin = true; static bool xen_pvspin = true;
#ifdef CONFIG_QUEUED_SPINLOCKS
#include <asm/qspinlock.h> #include <asm/qspinlock.h>
static void xen_qlock_kick(int cpu) static void xen_qlock_kick(int cpu)
...@@ -71,207 +69,6 @@ static void xen_qlock_wait(u8 *byte, u8 val) ...@@ -71,207 +69,6 @@ static void xen_qlock_wait(u8 *byte, u8 val)
xen_poll_irq(irq); xen_poll_irq(irq);
} }
#else /* CONFIG_QUEUED_SPINLOCKS */
enum xen_contention_stat {
TAKEN_SLOW,
TAKEN_SLOW_PICKUP,
TAKEN_SLOW_SPURIOUS,
RELEASED_SLOW,
RELEASED_SLOW_KICKED,
NR_CONTENTION_STATS
};
#ifdef CONFIG_XEN_DEBUG_FS
#define HISTO_BUCKETS 30
static struct xen_spinlock_stats
{
u32 contention_stats[NR_CONTENTION_STATS];
u32 histo_spin_blocked[HISTO_BUCKETS+1];
u64 time_blocked;
} spinlock_stats;
static u8 zero_stats;
static inline void check_zero(void)
{
u8 ret;
u8 old = READ_ONCE(zero_stats);
if (unlikely(old)) {
ret = cmpxchg(&zero_stats, old, 0);
/* This ensures only one fellow resets the stat */
if (ret == old)
memset(&spinlock_stats, 0, sizeof(spinlock_stats));
}
}
static inline void add_stats(enum xen_contention_stat var, u32 val)
{
check_zero();
spinlock_stats.contention_stats[var] += val;
}
static inline u64 spin_time_start(void)
{
return xen_clocksource_read();
}
static void __spin_time_accum(u64 delta, u32 *array)
{
unsigned index = ilog2(delta);
check_zero();
if (index < HISTO_BUCKETS)
array[index]++;
else
array[HISTO_BUCKETS]++;
}
static inline void spin_time_accum_blocked(u64 start)
{
u32 delta = xen_clocksource_read() - start;
__spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
spinlock_stats.time_blocked += delta;
}
#else /* !CONFIG_XEN_DEBUG_FS */
static inline void add_stats(enum xen_contention_stat var, u32 val)
{
}
static inline u64 spin_time_start(void)
{
return 0;
}
static inline void spin_time_accum_blocked(u64 start)
{
}
#endif /* CONFIG_XEN_DEBUG_FS */
struct xen_lock_waiting {
struct arch_spinlock *lock;
__ticket_t want;
};
static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting);
static cpumask_t waiting_cpus;
__visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
{
int irq = __this_cpu_read(lock_kicker_irq);
struct xen_lock_waiting *w = this_cpu_ptr(&lock_waiting);
int cpu = smp_processor_id();
u64 start;
__ticket_t head;
unsigned long flags;
/* If kicker interrupts not initialized yet, just spin */
if (irq == -1)
return;
start = spin_time_start();
/*
* Make sure an interrupt handler can't upset things in a
* partially setup state.
*/
local_irq_save(flags);
/*
* We don't really care if we're overwriting some other
* (lock,want) pair, as that would mean that we're currently
* in an interrupt context, and the outer context had
* interrupts enabled. That has already kicked the VCPU out
* of xen_poll_irq(), so it will just return spuriously and
* retry with newly setup (lock,want).
*
* The ordering protocol on this is that the "lock" pointer
* may only be set non-NULL if the "want" ticket is correct.
* If we're updating "want", we must first clear "lock".
*/
w->lock = NULL;
smp_wmb();
w->want = want;
smp_wmb();
w->lock = lock;
/* This uses set_bit, which atomic and therefore a barrier */
cpumask_set_cpu(cpu, &waiting_cpus);
add_stats(TAKEN_SLOW, 1);
/* clear pending */
xen_clear_irq_pending(irq);
/* Only check lock once pending cleared */
barrier();
/*
* Mark entry to slowpath before doing the pickup test to make
* sure we don't deadlock with an unlocker.
*/
__ticket_enter_slowpath(lock);
/* make sure enter_slowpath, which is atomic does not cross the read */
smp_mb__after_atomic();
/*
* check again make sure it didn't become free while
* we weren't looking
*/
head = READ_ONCE(lock->tickets.head);
if (__tickets_equal(head, want)) {
add_stats(TAKEN_SLOW_PICKUP, 1);
goto out;
}
/* Allow interrupts while blocked */
local_irq_restore(flags);
/*
* If an interrupt happens here, it will leave the wakeup irq
* pending, which will cause xen_poll_irq() to return
* immediately.
*/
/* Block until irq becomes pending (or perhaps a spurious wakeup) */
xen_poll_irq(irq);
add_stats(TAKEN_SLOW_SPURIOUS, !xen_test_irq_pending(irq));
local_irq_save(flags);
kstat_incr_irq_this_cpu(irq);
out:
cpumask_clear_cpu(cpu, &waiting_cpus);
w->lock = NULL;
local_irq_restore(flags);
spin_time_accum_blocked(start);
}
PV_CALLEE_SAVE_REGS_THUNK(xen_lock_spinning);
static void xen_unlock_kick(struct arch_spinlock *lock, __ticket_t next)
{
int cpu;
add_stats(RELEASED_SLOW, 1);
for_each_cpu(cpu, &waiting_cpus) {
const struct xen_lock_waiting *w = &per_cpu(lock_waiting, cpu);
/* Make sure we read lock before want */
if (READ_ONCE(w->lock) == lock &&
READ_ONCE(w->want) == next) {
add_stats(RELEASED_SLOW_KICKED, 1);
xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
break;
}
}
}
#endif /* CONFIG_QUEUED_SPINLOCKS */
static irqreturn_t dummy_handler(int irq, void *dev_id) static irqreturn_t dummy_handler(int irq, void *dev_id)
{ {
BUG(); BUG();
...@@ -334,16 +131,12 @@ void __init xen_init_spinlocks(void) ...@@ -334,16 +131,12 @@ void __init xen_init_spinlocks(void)
return; return;
} }
printk(KERN_DEBUG "xen: PV spinlocks enabled\n"); printk(KERN_DEBUG "xen: PV spinlocks enabled\n");
#ifdef CONFIG_QUEUED_SPINLOCKS
__pv_init_lock_hash(); __pv_init_lock_hash();
pv_lock_ops.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath; pv_lock_ops.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock); pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock);
pv_lock_ops.wait = xen_qlock_wait; pv_lock_ops.wait = xen_qlock_wait;
pv_lock_ops.kick = xen_qlock_kick; pv_lock_ops.kick = xen_qlock_kick;
#else
pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning);
pv_lock_ops.unlock_kick = xen_unlock_kick;
#endif
} }
/* /*
...@@ -372,44 +165,3 @@ static __init int xen_parse_nopvspin(char *arg) ...@@ -372,44 +165,3 @@ static __init int xen_parse_nopvspin(char *arg)
} }
early_param("xen_nopvspin", xen_parse_nopvspin); early_param("xen_nopvspin", xen_parse_nopvspin);
#if defined(CONFIG_XEN_DEBUG_FS) && !defined(CONFIG_QUEUED_SPINLOCKS)
static struct dentry *d_spin_debug;
static int __init xen_spinlock_debugfs(void)
{
struct dentry *d_xen = xen_init_debugfs();
if (d_xen == NULL)
return -ENOMEM;
if (!xen_pvspin)
return 0;
d_spin_debug = debugfs_create_dir("spinlocks", d_xen);
debugfs_create_u8("zero_stats", 0644, d_spin_debug, &zero_stats);
debugfs_create_u32("taken_slow", 0444, d_spin_debug,
&spinlock_stats.contention_stats[TAKEN_SLOW]);
debugfs_create_u32("taken_slow_pickup", 0444, d_spin_debug,
&spinlock_stats.contention_stats[TAKEN_SLOW_PICKUP]);
debugfs_create_u32("taken_slow_spurious", 0444, d_spin_debug,
&spinlock_stats.contention_stats[TAKEN_SLOW_SPURIOUS]);
debugfs_create_u32("released_slow", 0444, d_spin_debug,
&spinlock_stats.contention_stats[RELEASED_SLOW]);
debugfs_create_u32("released_slow_kicked", 0444, d_spin_debug,
&spinlock_stats.contention_stats[RELEASED_SLOW_KICKED]);
debugfs_create_u64("time_blocked", 0444, d_spin_debug,
&spinlock_stats.time_blocked);
debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
spinlock_stats.histo_spin_blocked, HISTO_BUCKETS + 1);
return 0;
}
fs_initcall(xen_spinlock_debugfs);
#endif /* CONFIG_XEN_DEBUG_FS */
...@@ -79,6 +79,7 @@ config EXPORTFS_BLOCK_OPS ...@@ -79,6 +79,7 @@ config EXPORTFS_BLOCK_OPS
config FILE_LOCKING config FILE_LOCKING
bool "Enable POSIX file locking API" if EXPERT bool "Enable POSIX file locking API" if EXPERT
default y default y
select PERCPU_RWSEM
help help
This option enables standard file locking support, required This option enables standard file locking support, required
for filesystems like NFS and for the flock() system for filesystems like NFS and for the flock() system
......
...@@ -127,7 +127,6 @@ ...@@ -127,7 +127,6 @@
#include <linux/pid_namespace.h> #include <linux/pid_namespace.h>
#include <linux/hashtable.h> #include <linux/hashtable.h>
#include <linux/percpu.h> #include <linux/percpu.h>
#include <linux/lglock.h>
#define CREATE_TRACE_POINTS #define CREATE_TRACE_POINTS
#include <trace/events/filelock.h> #include <trace/events/filelock.h>
...@@ -158,12 +157,18 @@ int lease_break_time = 45; ...@@ -158,12 +157,18 @@ int lease_break_time = 45;
/* /*
* The global file_lock_list is only used for displaying /proc/locks, so we * The global file_lock_list is only used for displaying /proc/locks, so we
* keep a list on each CPU, with each list protected by its own spinlock via * keep a list on each CPU, with each list protected by its own spinlock.
* the file_lock_lglock. Note that alterations to the list also require that * Global serialization is done using file_rwsem.
* the relevant flc_lock is held. *
* Note that alterations to the list also require that the relevant flc_lock is
* held.
*/ */
DEFINE_STATIC_LGLOCK(file_lock_lglock); struct file_lock_list_struct {
static DEFINE_PER_CPU(struct hlist_head, file_lock_list); spinlock_t lock;
struct hlist_head hlist;
};
static DEFINE_PER_CPU(struct file_lock_list_struct, file_lock_list);
DEFINE_STATIC_PERCPU_RWSEM(file_rwsem);
/* /*
* The blocked_hash is used to find POSIX lock loops for deadlock detection. * The blocked_hash is used to find POSIX lock loops for deadlock detection.
...@@ -587,15 +592,23 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2) ...@@ -587,15 +592,23 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2)
/* Must be called with the flc_lock held! */ /* Must be called with the flc_lock held! */
static void locks_insert_global_locks(struct file_lock *fl) static void locks_insert_global_locks(struct file_lock *fl)
{ {
lg_local_lock(&file_lock_lglock); struct file_lock_list_struct *fll = this_cpu_ptr(&file_lock_list);
percpu_rwsem_assert_held(&file_rwsem);
spin_lock(&fll->lock);
fl->fl_link_cpu = smp_processor_id(); fl->fl_link_cpu = smp_processor_id();
hlist_add_head(&fl->fl_link, this_cpu_ptr(&file_lock_list)); hlist_add_head(&fl->fl_link, &fll->hlist);
lg_local_unlock(&file_lock_lglock); spin_unlock(&fll->lock);
} }
/* Must be called with the flc_lock held! */ /* Must be called with the flc_lock held! */
static void locks_delete_global_locks(struct file_lock *fl) static void locks_delete_global_locks(struct file_lock *fl)
{ {
struct file_lock_list_struct *fll;
percpu_rwsem_assert_held(&file_rwsem);
/* /*
* Avoid taking lock if already unhashed. This is safe since this check * Avoid taking lock if already unhashed. This is safe since this check
* is done while holding the flc_lock, and new insertions into the list * is done while holding the flc_lock, and new insertions into the list
...@@ -603,9 +616,11 @@ static void locks_delete_global_locks(struct file_lock *fl) ...@@ -603,9 +616,11 @@ static void locks_delete_global_locks(struct file_lock *fl)
*/ */
if (hlist_unhashed(&fl->fl_link)) if (hlist_unhashed(&fl->fl_link))
return; return;
lg_local_lock_cpu(&file_lock_lglock, fl->fl_link_cpu);
fll = per_cpu_ptr(&file_lock_list, fl->fl_link_cpu);
spin_lock(&fll->lock);
hlist_del_init(&fl->fl_link); hlist_del_init(&fl->fl_link);
lg_local_unlock_cpu(&file_lock_lglock, fl->fl_link_cpu); spin_unlock(&fll->lock);
} }
static unsigned long static unsigned long
...@@ -915,6 +930,7 @@ static int flock_lock_inode(struct inode *inode, struct file_lock *request) ...@@ -915,6 +930,7 @@ static int flock_lock_inode(struct inode *inode, struct file_lock *request)
return -ENOMEM; return -ENOMEM;
} }
percpu_down_read_preempt_disable(&file_rwsem);
spin_lock(&ctx->flc_lock); spin_lock(&ctx->flc_lock);
if (request->fl_flags & FL_ACCESS) if (request->fl_flags & FL_ACCESS)
goto find_conflict; goto find_conflict;
...@@ -955,6 +971,7 @@ static int flock_lock_inode(struct inode *inode, struct file_lock *request) ...@@ -955,6 +971,7 @@ static int flock_lock_inode(struct inode *inode, struct file_lock *request)
out: out:
spin_unlock(&ctx->flc_lock); spin_unlock(&ctx->flc_lock);
percpu_up_read_preempt_enable(&file_rwsem);
if (new_fl) if (new_fl)
locks_free_lock(new_fl); locks_free_lock(new_fl);
locks_dispose_list(&dispose); locks_dispose_list(&dispose);
...@@ -991,6 +1008,7 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request, ...@@ -991,6 +1008,7 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request,
new_fl2 = locks_alloc_lock(); new_fl2 = locks_alloc_lock();
} }
percpu_down_read_preempt_disable(&file_rwsem);
spin_lock(&ctx->flc_lock); spin_lock(&ctx->flc_lock);
/* /*
* New lock request. Walk all POSIX locks and look for conflicts. If * New lock request. Walk all POSIX locks and look for conflicts. If
...@@ -1162,6 +1180,7 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request, ...@@ -1162,6 +1180,7 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request,
} }
out: out:
spin_unlock(&ctx->flc_lock); spin_unlock(&ctx->flc_lock);
percpu_up_read_preempt_enable(&file_rwsem);
/* /*
* Free any unused locks. * Free any unused locks.
*/ */
...@@ -1436,6 +1455,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type) ...@@ -1436,6 +1455,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
return error; return error;
} }
percpu_down_read_preempt_disable(&file_rwsem);
spin_lock(&ctx->flc_lock); spin_lock(&ctx->flc_lock);
time_out_leases(inode, &dispose); time_out_leases(inode, &dispose);
...@@ -1487,9 +1507,13 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type) ...@@ -1487,9 +1507,13 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
locks_insert_block(fl, new_fl); locks_insert_block(fl, new_fl);
trace_break_lease_block(inode, new_fl); trace_break_lease_block(inode, new_fl);
spin_unlock(&ctx->flc_lock); spin_unlock(&ctx->flc_lock);
percpu_up_read_preempt_enable(&file_rwsem);
locks_dispose_list(&dispose); locks_dispose_list(&dispose);
error = wait_event_interruptible_timeout(new_fl->fl_wait, error = wait_event_interruptible_timeout(new_fl->fl_wait,
!new_fl->fl_next, break_time); !new_fl->fl_next, break_time);
percpu_down_read_preempt_disable(&file_rwsem);
spin_lock(&ctx->flc_lock); spin_lock(&ctx->flc_lock);
trace_break_lease_unblock(inode, new_fl); trace_break_lease_unblock(inode, new_fl);
locks_delete_block(new_fl); locks_delete_block(new_fl);
...@@ -1506,6 +1530,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type) ...@@ -1506,6 +1530,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
} }
out: out:
spin_unlock(&ctx->flc_lock); spin_unlock(&ctx->flc_lock);
percpu_up_read_preempt_enable(&file_rwsem);
locks_dispose_list(&dispose); locks_dispose_list(&dispose);
locks_free_lock(new_fl); locks_free_lock(new_fl);
return error; return error;
...@@ -1660,6 +1685,7 @@ generic_add_lease(struct file *filp, long arg, struct file_lock **flp, void **pr ...@@ -1660,6 +1685,7 @@ generic_add_lease(struct file *filp, long arg, struct file_lock **flp, void **pr
return -EINVAL; return -EINVAL;
} }
percpu_down_read_preempt_disable(&file_rwsem);
spin_lock(&ctx->flc_lock); spin_lock(&ctx->flc_lock);
time_out_leases(inode, &dispose); time_out_leases(inode, &dispose);
error = check_conflicting_open(dentry, arg, lease->fl_flags); error = check_conflicting_open(dentry, arg, lease->fl_flags);
...@@ -1730,6 +1756,7 @@ generic_add_lease(struct file *filp, long arg, struct file_lock **flp, void **pr ...@@ -1730,6 +1756,7 @@ generic_add_lease(struct file *filp, long arg, struct file_lock **flp, void **pr
lease->fl_lmops->lm_setup(lease, priv); lease->fl_lmops->lm_setup(lease, priv);
out: out:
spin_unlock(&ctx->flc_lock); spin_unlock(&ctx->flc_lock);
percpu_up_read_preempt_enable(&file_rwsem);
locks_dispose_list(&dispose); locks_dispose_list(&dispose);
if (is_deleg) if (is_deleg)
inode_unlock(inode); inode_unlock(inode);
...@@ -1752,6 +1779,7 @@ static int generic_delete_lease(struct file *filp, void *owner) ...@@ -1752,6 +1779,7 @@ static int generic_delete_lease(struct file *filp, void *owner)
return error; return error;
} }
percpu_down_read_preempt_disable(&file_rwsem);
spin_lock(&ctx->flc_lock); spin_lock(&ctx->flc_lock);
list_for_each_entry(fl, &ctx->flc_lease, fl_list) { list_for_each_entry(fl, &ctx->flc_lease, fl_list) {
if (fl->fl_file == filp && if (fl->fl_file == filp &&
...@@ -1764,6 +1792,7 @@ static int generic_delete_lease(struct file *filp, void *owner) ...@@ -1764,6 +1792,7 @@ static int generic_delete_lease(struct file *filp, void *owner)
if (victim) if (victim)
error = fl->fl_lmops->lm_change(victim, F_UNLCK, &dispose); error = fl->fl_lmops->lm_change(victim, F_UNLCK, &dispose);
spin_unlock(&ctx->flc_lock); spin_unlock(&ctx->flc_lock);
percpu_up_read_preempt_enable(&file_rwsem);
locks_dispose_list(&dispose); locks_dispose_list(&dispose);
return error; return error;
} }
...@@ -2703,9 +2732,9 @@ static void *locks_start(struct seq_file *f, loff_t *pos) ...@@ -2703,9 +2732,9 @@ static void *locks_start(struct seq_file *f, loff_t *pos)
struct locks_iterator *iter = f->private; struct locks_iterator *iter = f->private;
iter->li_pos = *pos + 1; iter->li_pos = *pos + 1;
lg_global_lock(&file_lock_lglock); percpu_down_write(&file_rwsem);
spin_lock(&blocked_lock_lock); spin_lock(&blocked_lock_lock);
return seq_hlist_start_percpu(&file_lock_list, &iter->li_cpu, *pos); return seq_hlist_start_percpu(&file_lock_list.hlist, &iter->li_cpu, *pos);
} }
static void *locks_next(struct seq_file *f, void *v, loff_t *pos) static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
...@@ -2713,14 +2742,14 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos) ...@@ -2713,14 +2742,14 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
struct locks_iterator *iter = f->private; struct locks_iterator *iter = f->private;
++iter->li_pos; ++iter->li_pos;
return seq_hlist_next_percpu(v, &file_lock_list, &iter->li_cpu, pos); return seq_hlist_next_percpu(v, &file_lock_list.hlist, &iter->li_cpu, pos);
} }
static void locks_stop(struct seq_file *f, void *v) static void locks_stop(struct seq_file *f, void *v)
__releases(&blocked_lock_lock) __releases(&blocked_lock_lock)
{ {
spin_unlock(&blocked_lock_lock); spin_unlock(&blocked_lock_lock);
lg_global_unlock(&file_lock_lglock); percpu_up_write(&file_rwsem);
} }
static const struct seq_operations locks_seq_operations = { static const struct seq_operations locks_seq_operations = {
...@@ -2761,10 +2790,13 @@ static int __init filelock_init(void) ...@@ -2761,10 +2790,13 @@ static int __init filelock_init(void)
filelock_cache = kmem_cache_create("file_lock_cache", filelock_cache = kmem_cache_create("file_lock_cache",
sizeof(struct file_lock), 0, SLAB_PANIC, NULL); sizeof(struct file_lock), 0, SLAB_PANIC, NULL);
lg_lock_init(&file_lock_lglock, "file_lock_lglock");
for_each_possible_cpu(i) for_each_possible_cpu(i) {
INIT_HLIST_HEAD(per_cpu_ptr(&file_lock_list, i)); struct file_lock_list_struct *fll = per_cpu_ptr(&file_lock_list, i);
spin_lock_init(&fll->lock);
INIT_HLIST_HEAD(&fll->hlist);
}
return 0; return 0;
} }
......
/*
* Specialised local-global spinlock. Can only be declared as global variables
* to avoid overhead and keep things simple (and we don't want to start using
* these inside dynamically allocated structures).
*
* "local/global locks" (lglocks) can be used to:
*
* - Provide fast exclusive access to per-CPU data, with exclusive access to
* another CPU's data allowed but possibly subject to contention, and to
* provide very slow exclusive access to all per-CPU data.
* - Or to provide very fast and scalable read serialisation, and to provide
* very slow exclusive serialisation of data (not necessarily per-CPU data).
*
* Brlocks are also implemented as a short-hand notation for the latter use
* case.
*
* Copyright 2009, 2010, Nick Piggin, Novell Inc.
*/
#ifndef __LINUX_LGLOCK_H
#define __LINUX_LGLOCK_H
#include <linux/spinlock.h>
#include <linux/lockdep.h>
#include <linux/percpu.h>
#include <linux/cpu.h>
#include <linux/notifier.h>
#ifdef CONFIG_SMP
#ifdef CONFIG_DEBUG_LOCK_ALLOC
#define LOCKDEP_INIT_MAP lockdep_init_map
#else
#define LOCKDEP_INIT_MAP(a, b, c, d)
#endif
struct lglock {
arch_spinlock_t __percpu *lock;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lock_class_key lock_key;
struct lockdep_map lock_dep_map;
#endif
};
#define DEFINE_LGLOCK(name) \
static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock) \
= __ARCH_SPIN_LOCK_UNLOCKED; \
struct lglock name = { .lock = &name ## _lock }
#define DEFINE_STATIC_LGLOCK(name) \
static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock) \
= __ARCH_SPIN_LOCK_UNLOCKED; \
static struct lglock name = { .lock = &name ## _lock }
void lg_lock_init(struct lglock *lg, char *name);
void lg_local_lock(struct lglock *lg);
void lg_local_unlock(struct lglock *lg);
void lg_local_lock_cpu(struct lglock *lg, int cpu);
void lg_local_unlock_cpu(struct lglock *lg, int cpu);
void lg_double_lock(struct lglock *lg, int cpu1, int cpu2);
void lg_double_unlock(struct lglock *lg, int cpu1, int cpu2);
void lg_global_lock(struct lglock *lg);
void lg_global_unlock(struct lglock *lg);
#else
/* When !CONFIG_SMP, map lglock to spinlock */
#define lglock spinlock
#define DEFINE_LGLOCK(name) DEFINE_SPINLOCK(name)
#define DEFINE_STATIC_LGLOCK(name) static DEFINE_SPINLOCK(name)
#define lg_lock_init(lg, name) spin_lock_init(lg)
#define lg_local_lock spin_lock
#define lg_local_unlock spin_unlock
#define lg_local_lock_cpu(lg, cpu) spin_lock(lg)
#define lg_local_unlock_cpu(lg, cpu) spin_unlock(lg)
#define lg_global_lock spin_lock
#define lg_global_unlock spin_unlock
#endif
#endif
...@@ -10,32 +10,122 @@ ...@@ -10,32 +10,122 @@
struct percpu_rw_semaphore { struct percpu_rw_semaphore {
struct rcu_sync rss; struct rcu_sync rss;
unsigned int __percpu *fast_read_ctr; unsigned int __percpu *read_count;
struct rw_semaphore rw_sem; struct rw_semaphore rw_sem;
atomic_t slow_read_ctr; wait_queue_head_t writer;
wait_queue_head_t write_waitq; int readers_block;
}; };
extern void percpu_down_read(struct percpu_rw_semaphore *); #define DEFINE_STATIC_PERCPU_RWSEM(name) \
extern int percpu_down_read_trylock(struct percpu_rw_semaphore *); static DEFINE_PER_CPU(unsigned int, __percpu_rwsem_rc_##name); \
extern void percpu_up_read(struct percpu_rw_semaphore *); static struct percpu_rw_semaphore name = { \
.rss = __RCU_SYNC_INITIALIZER(name.rss, RCU_SCHED_SYNC), \
.read_count = &__percpu_rwsem_rc_##name, \
.rw_sem = __RWSEM_INITIALIZER(name.rw_sem), \
.writer = __WAIT_QUEUE_HEAD_INITIALIZER(name.writer), \
}
extern int __percpu_down_read(struct percpu_rw_semaphore *, int);
extern void __percpu_up_read(struct percpu_rw_semaphore *);
static inline void percpu_down_read_preempt_disable(struct percpu_rw_semaphore *sem)
{
might_sleep();
rwsem_acquire_read(&sem->rw_sem.dep_map, 0, 0, _RET_IP_);
preempt_disable();
/*
* We are in an RCU-sched read-side critical section, so the writer
* cannot both change sem->state from readers_fast and start checking
* counters while we are here. So if we see !sem->state, we know that
* the writer won't be checking until we're past the preempt_enable()
* and that one the synchronize_sched() is done, the writer will see
* anything we did within this RCU-sched read-size critical section.
*/
__this_cpu_inc(*sem->read_count);
if (unlikely(!rcu_sync_is_idle(&sem->rss)))
__percpu_down_read(sem, false); /* Unconditional memory barrier */
barrier();
/*
* The barrier() prevents the compiler from
* bleeding the critical section out.
*/
}
static inline void percpu_down_read(struct percpu_rw_semaphore *sem)
{
percpu_down_read_preempt_disable(sem);
preempt_enable();
}
static inline int percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
{
int ret = 1;
preempt_disable();
/*
* Same as in percpu_down_read().
*/
__this_cpu_inc(*sem->read_count);
if (unlikely(!rcu_sync_is_idle(&sem->rss)))
ret = __percpu_down_read(sem, true); /* Unconditional memory barrier */
preempt_enable();
/*
* The barrier() from preempt_enable() prevents the compiler from
* bleeding the critical section out.
*/
if (ret)
rwsem_acquire_read(&sem->rw_sem.dep_map, 0, 1, _RET_IP_);
return ret;
}
static inline void percpu_up_read_preempt_enable(struct percpu_rw_semaphore *sem)
{
/*
* The barrier() prevents the compiler from
* bleeding the critical section out.
*/
barrier();
/*
* Same as in percpu_down_read().
*/
if (likely(rcu_sync_is_idle(&sem->rss)))
__this_cpu_dec(*sem->read_count);
else
__percpu_up_read(sem); /* Unconditional memory barrier */
preempt_enable();
rwsem_release(&sem->rw_sem.dep_map, 1, _RET_IP_);
}
static inline void percpu_up_read(struct percpu_rw_semaphore *sem)
{
preempt_disable();
percpu_up_read_preempt_enable(sem);
}
extern void percpu_down_write(struct percpu_rw_semaphore *); extern void percpu_down_write(struct percpu_rw_semaphore *);
extern void percpu_up_write(struct percpu_rw_semaphore *); extern void percpu_up_write(struct percpu_rw_semaphore *);
extern int __percpu_init_rwsem(struct percpu_rw_semaphore *, extern int __percpu_init_rwsem(struct percpu_rw_semaphore *,
const char *, struct lock_class_key *); const char *, struct lock_class_key *);
extern void percpu_free_rwsem(struct percpu_rw_semaphore *); extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
#define percpu_init_rwsem(brw) \ #define percpu_init_rwsem(sem) \
({ \ ({ \
static struct lock_class_key rwsem_key; \ static struct lock_class_key rwsem_key; \
__percpu_init_rwsem(brw, #brw, &rwsem_key); \ __percpu_init_rwsem(sem, #sem, &rwsem_key); \
}) })
#define percpu_rwsem_is_held(sem) lockdep_is_held(&(sem)->rw_sem) #define percpu_rwsem_is_held(sem) lockdep_is_held(&(sem)->rw_sem)
#define percpu_rwsem_assert_held(sem) \
lockdep_assert_held(&(sem)->rw_sem)
static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem, static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
bool read, unsigned long ip) bool read, unsigned long ip)
{ {
......
...@@ -59,6 +59,7 @@ static inline bool rcu_sync_is_idle(struct rcu_sync *rsp) ...@@ -59,6 +59,7 @@ static inline bool rcu_sync_is_idle(struct rcu_sync *rsp)
} }
extern void rcu_sync_init(struct rcu_sync *, enum rcu_sync_type); extern void rcu_sync_init(struct rcu_sync *, enum rcu_sync_type);
extern void rcu_sync_enter_start(struct rcu_sync *);
extern void rcu_sync_enter(struct rcu_sync *); extern void rcu_sync_enter(struct rcu_sync *);
extern void rcu_sync_exit(struct rcu_sync *); extern void rcu_sync_exit(struct rcu_sync *);
extern void rcu_sync_dtor(struct rcu_sync *); extern void rcu_sync_dtor(struct rcu_sync *);
......
...@@ -5627,6 +5627,12 @@ int __init cgroup_init(void) ...@@ -5627,6 +5627,12 @@ int __init cgroup_init(void)
BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files)); BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files)); BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));
/*
* The latency of the synchronize_sched() is too high for cgroups,
* avoid it at the cost of forcing all readers into the slow path.
*/
rcu_sync_enter_start(&cgroup_threadgroup_rwsem.rss);
get_user_ns(init_cgroup_ns.user_ns); get_user_ns(init_cgroup_ns.user_ns);
mutex_lock(&cgroup_mutex); mutex_lock(&cgroup_mutex);
......
...@@ -381,8 +381,12 @@ static inline int hb_waiters_pending(struct futex_hash_bucket *hb) ...@@ -381,8 +381,12 @@ static inline int hb_waiters_pending(struct futex_hash_bucket *hb)
#endif #endif
} }
/* /**
* We hash on the keys returned from get_futex_key (see below). * hash_futex - Return the hash bucket in the global hash
* @key: Pointer to the futex key for which the hash is calculated
*
* We hash on the keys returned from get_futex_key (see below) and return the
* corresponding hash bucket in the global hash.
*/ */
static struct futex_hash_bucket *hash_futex(union futex_key *key) static struct futex_hash_bucket *hash_futex(union futex_key *key)
{ {
...@@ -392,7 +396,12 @@ static struct futex_hash_bucket *hash_futex(union futex_key *key) ...@@ -392,7 +396,12 @@ static struct futex_hash_bucket *hash_futex(union futex_key *key)
return &futex_queues[hash & (futex_hashsize - 1)]; return &futex_queues[hash & (futex_hashsize - 1)];
} }
/*
/**
* match_futex - Check whether two futex keys are equal
* @key1: Pointer to key1
* @key2: Pointer to key2
*
* Return 1 if two futex_keys are equal, 0 otherwise. * Return 1 if two futex_keys are equal, 0 otherwise.
*/ */
static inline int match_futex(union futex_key *key1, union futex_key *key2) static inline int match_futex(union futex_key *key1, union futex_key *key2)
......
...@@ -117,7 +117,7 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout) ...@@ -117,7 +117,7 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\"" pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\""
" disables this message.\n"); " disables this message.\n");
sched_show_task(t); sched_show_task(t);
debug_show_held_locks(t); debug_show_all_locks();
touch_nmi_watchdog(); touch_nmi_watchdog();
......
...@@ -18,7 +18,6 @@ obj-$(CONFIG_LOCKDEP) += lockdep_proc.o ...@@ -18,7 +18,6 @@ obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
endif endif
obj-$(CONFIG_SMP) += spinlock.o obj-$(CONFIG_SMP) += spinlock.o
obj-$(CONFIG_LOCK_SPIN_ON_OWNER) += osq_lock.o obj-$(CONFIG_LOCK_SPIN_ON_OWNER) += osq_lock.o
obj-$(CONFIG_SMP) += lglock.o
obj-$(CONFIG_PROVE_LOCKING) += spinlock.o obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
obj-$(CONFIG_QUEUED_SPINLOCKS) += qspinlock.o obj-$(CONFIG_QUEUED_SPINLOCKS) += qspinlock.o
obj-$(CONFIG_RT_MUTEXES) += rtmutex.o obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
......
/* See include/linux/lglock.h for description */
#include <linux/module.h>
#include <linux/lglock.h>
#include <linux/cpu.h>
#include <linux/string.h>
/*
* Note there is no uninit, so lglocks cannot be defined in
* modules (but it's fine to use them from there)
* Could be added though, just undo lg_lock_init
*/
void lg_lock_init(struct lglock *lg, char *name)
{
LOCKDEP_INIT_MAP(&lg->lock_dep_map, name, &lg->lock_key, 0);
}
EXPORT_SYMBOL(lg_lock_init);
void lg_local_lock(struct lglock *lg)
{
arch_spinlock_t *lock;
preempt_disable();
lock_acquire_shared(&lg->lock_dep_map, 0, 0, NULL, _RET_IP_);
lock = this_cpu_ptr(lg->lock);
arch_spin_lock(lock);
}
EXPORT_SYMBOL(lg_local_lock);
void lg_local_unlock(struct lglock *lg)
{
arch_spinlock_t *lock;
lock_release(&lg->lock_dep_map, 1, _RET_IP_);
lock = this_cpu_ptr(lg->lock);
arch_spin_unlock(lock);
preempt_enable();
}
EXPORT_SYMBOL(lg_local_unlock);
void lg_local_lock_cpu(struct lglock *lg, int cpu)
{
arch_spinlock_t *lock;
preempt_disable();
lock_acquire_shared(&lg->lock_dep_map, 0, 0, NULL, _RET_IP_);
lock = per_cpu_ptr(lg->lock, cpu);
arch_spin_lock(lock);
}
EXPORT_SYMBOL(lg_local_lock_cpu);
void lg_local_unlock_cpu(struct lglock *lg, int cpu)
{
arch_spinlock_t *lock;
lock_release(&lg->lock_dep_map, 1, _RET_IP_);
lock = per_cpu_ptr(lg->lock, cpu);
arch_spin_unlock(lock);
preempt_enable();
}
EXPORT_SYMBOL(lg_local_unlock_cpu);
void lg_double_lock(struct lglock *lg, int cpu1, int cpu2)
{
BUG_ON(cpu1 == cpu2);
/* lock in cpu order, just like lg_global_lock */
if (cpu2 < cpu1)
swap(cpu1, cpu2);
preempt_disable();
lock_acquire_shared(&lg->lock_dep_map, 0, 0, NULL, _RET_IP_);
arch_spin_lock(per_cpu_ptr(lg->lock, cpu1));
arch_spin_lock(per_cpu_ptr(lg->lock, cpu2));
}
void lg_double_unlock(struct lglock *lg, int cpu1, int cpu2)
{
lock_release(&lg->lock_dep_map, 1, _RET_IP_);
arch_spin_unlock(per_cpu_ptr(lg->lock, cpu1));
arch_spin_unlock(per_cpu_ptr(lg->lock, cpu2));
preempt_enable();
}
void lg_global_lock(struct lglock *lg)
{
int i;
preempt_disable();
lock_acquire_exclusive(&lg->lock_dep_map, 0, 0, NULL, _RET_IP_);
for_each_possible_cpu(i) {
arch_spinlock_t *lock;
lock = per_cpu_ptr(lg->lock, i);
arch_spin_lock(lock);
}
}
EXPORT_SYMBOL(lg_global_lock);
void lg_global_unlock(struct lglock *lg)
{
int i;
lock_release(&lg->lock_dep_map, 1, _RET_IP_);
for_each_possible_cpu(i) {
arch_spinlock_t *lock;
lock = per_cpu_ptr(lg->lock, i);
arch_spin_unlock(lock);
}
preempt_enable();
}
EXPORT_SYMBOL(lg_global_unlock);
...@@ -8,152 +8,186 @@ ...@@ -8,152 +8,186 @@
#include <linux/sched.h> #include <linux/sched.h>
#include <linux/errno.h> #include <linux/errno.h>
int __percpu_init_rwsem(struct percpu_rw_semaphore *brw, int __percpu_init_rwsem(struct percpu_rw_semaphore *sem,
const char *name, struct lock_class_key *rwsem_key) const char *name, struct lock_class_key *rwsem_key)
{ {
brw->fast_read_ctr = alloc_percpu(int); sem->read_count = alloc_percpu(int);
if (unlikely(!brw->fast_read_ctr)) if (unlikely(!sem->read_count))
return -ENOMEM; return -ENOMEM;
/* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */ /* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */
__init_rwsem(&brw->rw_sem, name, rwsem_key); rcu_sync_init(&sem->rss, RCU_SCHED_SYNC);
rcu_sync_init(&brw->rss, RCU_SCHED_SYNC); __init_rwsem(&sem->rw_sem, name, rwsem_key);
atomic_set(&brw->slow_read_ctr, 0); init_waitqueue_head(&sem->writer);
init_waitqueue_head(&brw->write_waitq); sem->readers_block = 0;
return 0; return 0;
} }
EXPORT_SYMBOL_GPL(__percpu_init_rwsem); EXPORT_SYMBOL_GPL(__percpu_init_rwsem);
void percpu_free_rwsem(struct percpu_rw_semaphore *brw) void percpu_free_rwsem(struct percpu_rw_semaphore *sem)
{ {
/* /*
* XXX: temporary kludge. The error path in alloc_super() * XXX: temporary kludge. The error path in alloc_super()
* assumes that percpu_free_rwsem() is safe after kzalloc(). * assumes that percpu_free_rwsem() is safe after kzalloc().
*/ */
if (!brw->fast_read_ctr) if (!sem->read_count)
return; return;
rcu_sync_dtor(&brw->rss); rcu_sync_dtor(&sem->rss);
free_percpu(brw->fast_read_ctr); free_percpu(sem->read_count);
brw->fast_read_ctr = NULL; /* catch use after free bugs */ sem->read_count = NULL; /* catch use after free bugs */
} }
EXPORT_SYMBOL_GPL(percpu_free_rwsem); EXPORT_SYMBOL_GPL(percpu_free_rwsem);
/* int __percpu_down_read(struct percpu_rw_semaphore *sem, int try)
* This is the fast-path for down_read/up_read. If it succeeds we rely {
* on the barriers provided by rcu_sync_enter/exit; see the comments in /*
* percpu_down_write() and percpu_up_write(). * Due to having preemption disabled the decrement happens on
* the same CPU as the increment, avoiding the
* increment-on-one-CPU-and-decrement-on-another problem.
* *
* If this helper fails the callers rely on the normal rw_semaphore and * If the reader misses the writer's assignment of readers_block, then
* atomic_dec_and_test(), so in this case we have the necessary barriers. * the writer is guaranteed to see the reader's increment.
*
* Conversely, any readers that increment their sem->read_count after
* the writer looks are guaranteed to see the readers_block value,
* which in turn means that they are guaranteed to immediately
* decrement their sem->read_count, so that it doesn't matter that the
* writer missed them.
*/ */
static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
{
bool success;
preempt_disable(); smp_mb(); /* A matches D */
success = rcu_sync_is_idle(&brw->rss);
if (likely(success))
__this_cpu_add(*brw->fast_read_ctr, val);
preempt_enable();
return success; /*
} * If !readers_block the critical section starts here, matched by the
* release in percpu_up_write().
*/
if (likely(!smp_load_acquire(&sem->readers_block)))
return 1;
/* /*
* Like the normal down_read() this is not recursive, the writer can * Per the above comment; we still have preemption disabled and
* come after the first percpu_down_read() and create the deadlock. * will thus decrement on the same CPU as we incremented.
*
* Note: returns with lock_is_held(brw->rw_sem) == T for lockdep,
* percpu_up_read() does rwsem_release(). This pairs with the usage
* of ->rw_sem in percpu_down/up_write().
*/ */
void percpu_down_read(struct percpu_rw_semaphore *brw) __percpu_up_read(sem);
{
might_sleep();
rwsem_acquire_read(&brw->rw_sem.dep_map, 0, 0, _RET_IP_);
if (likely(update_fast_ctr(brw, +1))) if (try)
return; return 0;
/* Avoid rwsem_acquire_read() and rwsem_release() */ /*
__down_read(&brw->rw_sem); * We either call schedule() in the wait, or we'll fall through
atomic_inc(&brw->slow_read_ctr); * and reschedule on the preempt_enable() in percpu_down_read().
__up_read(&brw->rw_sem); */
} preempt_enable_no_resched();
EXPORT_SYMBOL_GPL(percpu_down_read);
int percpu_down_read_trylock(struct percpu_rw_semaphore *brw) /*
{ * Avoid lockdep for the down/up_read() we already have them.
if (unlikely(!update_fast_ctr(brw, +1))) { */
if (!__down_read_trylock(&brw->rw_sem)) __down_read(&sem->rw_sem);
return 0; this_cpu_inc(*sem->read_count);
atomic_inc(&brw->slow_read_ctr); __up_read(&sem->rw_sem);
__up_read(&brw->rw_sem);
}
rwsem_acquire_read(&brw->rw_sem.dep_map, 0, 1, _RET_IP_); preempt_disable();
return 1; return 1;
} }
EXPORT_SYMBOL_GPL(__percpu_down_read);
void percpu_up_read(struct percpu_rw_semaphore *brw) void __percpu_up_read(struct percpu_rw_semaphore *sem)
{ {
rwsem_release(&brw->rw_sem.dep_map, 1, _RET_IP_); smp_mb(); /* B matches C */
/*
if (likely(update_fast_ctr(brw, -1))) * In other words, if they see our decrement (presumably to aggregate
return; * zero, as that is the only time it matters) they will also see our
* critical section.
*/
__this_cpu_dec(*sem->read_count);
/* false-positive is possible but harmless */ /* Prod writer to recheck readers_active */
if (atomic_dec_and_test(&brw->slow_read_ctr)) wake_up(&sem->writer);
wake_up_all(&brw->write_waitq);
} }
EXPORT_SYMBOL_GPL(percpu_up_read); EXPORT_SYMBOL_GPL(__percpu_up_read);
#define per_cpu_sum(var) \
({ \
typeof(var) __sum = 0; \
int cpu; \
compiletime_assert_atomic_type(__sum); \
for_each_possible_cpu(cpu) \
__sum += per_cpu(var, cpu); \
__sum; \
})
static int clear_fast_ctr(struct percpu_rw_semaphore *brw) /*
* Return true if the modular sum of the sem->read_count per-CPU variable is
* zero. If this sum is zero, then it is stable due to the fact that if any
* newly arriving readers increment a given counter, they will immediately
* decrement that same counter.
*/
static bool readers_active_check(struct percpu_rw_semaphore *sem)
{ {
unsigned int sum = 0; if (per_cpu_sum(*sem->read_count) != 0)
int cpu; return false;
/*
* If we observed the decrement; ensure we see the entire critical
* section.
*/
for_each_possible_cpu(cpu) { smp_mb(); /* C matches B */
sum += per_cpu(*brw->fast_read_ctr, cpu);
per_cpu(*brw->fast_read_ctr, cpu) = 0;
}
return sum; return true;
} }
void percpu_down_write(struct percpu_rw_semaphore *brw) void percpu_down_write(struct percpu_rw_semaphore *sem)
{ {
/* Notify readers to take the slow path. */
rcu_sync_enter(&sem->rss);
down_write(&sem->rw_sem);
/* /*
* Make rcu_sync_is_idle() == F and thus disable the fast-path in * Notify new readers to block; up until now, and thus throughout the
* percpu_down_read() and percpu_up_read(), and wait for gp pass. * longish rcu_sync_enter() above, new readers could still come in.
*
* The latter synchronises us with the preceding readers which used
* the fast-past, so we can not miss the result of __this_cpu_add()
* or anything else inside their criticial sections.
*/ */
rcu_sync_enter(&brw->rss); WRITE_ONCE(sem->readers_block, 1);
/* exclude other writers, and block the new readers completely */ smp_mb(); /* D matches A */
down_write(&brw->rw_sem);
/* nobody can use fast_read_ctr, move its sum into slow_read_ctr */ /*
atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr); * If they don't see our writer of readers_block, then we are
* guaranteed to see their sem->read_count increment, and therefore
* will wait for them.
*/
/* wait for all readers to complete their percpu_up_read() */ /* Wait for all now active readers to complete. */
wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr)); wait_event(sem->writer, readers_active_check(sem));
} }
EXPORT_SYMBOL_GPL(percpu_down_write); EXPORT_SYMBOL_GPL(percpu_down_write);
void percpu_up_write(struct percpu_rw_semaphore *brw) void percpu_up_write(struct percpu_rw_semaphore *sem)
{ {
/* release the lock, but the readers can't use the fast-path */
up_write(&brw->rw_sem);
/* /*
* Enable the fast-path in percpu_down_read() and percpu_up_read() * Signal the writer is done, no fast path yet.
* but only after another gp pass; this adds the necessary barrier *
* to ensure the reader can't miss the changes done by us. * One reason that we cannot just immediately flip to readers_fast is
* that new readers might fail to see the results of this writer's
* critical section.
*
* Therefore we force it through the slow path which guarantees an
* acquire and thereby guarantees the critical section's consistency.
*/
smp_store_release(&sem->readers_block, 0);
/*
* Release the write lock, this will allow readers back in the game.
*/
up_write(&sem->rw_sem);
/*
* Once this completes (at least one RCU-sched grace period hence) the
* reader fast path will be available again. Safe to use outside the
* exclusive write lock because its counting.
*/ */
rcu_sync_exit(&brw->rss); rcu_sync_exit(&sem->rss);
} }
EXPORT_SYMBOL_GPL(percpu_up_write); EXPORT_SYMBOL_GPL(percpu_up_write);
...@@ -70,11 +70,14 @@ struct pv_node { ...@@ -70,11 +70,14 @@ struct pv_node {
static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock) static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock)
{ {
struct __qspinlock *l = (void *)lock; struct __qspinlock *l = (void *)lock;
int ret = !(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) &&
(cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0);
qstat_inc(qstat_pv_lock_stealing, ret); if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) &&
return ret; (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
qstat_inc(qstat_pv_lock_stealing, true);
return true;
}
return false;
} }
/* /*
...@@ -257,7 +260,6 @@ static struct pv_node *pv_unhash(struct qspinlock *lock) ...@@ -257,7 +260,6 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
static inline bool static inline bool
pv_wait_early(struct pv_node *prev, int loop) pv_wait_early(struct pv_node *prev, int loop)
{ {
if ((loop & PV_PREV_CHECK_MASK) != 0) if ((loop & PV_PREV_CHECK_MASK) != 0)
return false; return false;
...@@ -286,12 +288,10 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev) ...@@ -286,12 +288,10 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
{ {
struct pv_node *pn = (struct pv_node *)node; struct pv_node *pn = (struct pv_node *)node;
struct pv_node *pp = (struct pv_node *)prev; struct pv_node *pp = (struct pv_node *)prev;
int waitcnt = 0;
int loop; int loop;
bool wait_early; bool wait_early;
/* waitcnt processing will be compiled out if !QUEUED_LOCK_STAT */ for (;;) {
for (;; waitcnt++) {
for (wait_early = false, loop = SPIN_THRESHOLD; loop; loop--) { for (wait_early = false, loop = SPIN_THRESHOLD; loop; loop--) {
if (READ_ONCE(node->locked)) if (READ_ONCE(node->locked))
return; return;
...@@ -315,7 +315,6 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev) ...@@ -315,7 +315,6 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
if (!READ_ONCE(node->locked)) { if (!READ_ONCE(node->locked)) {
qstat_inc(qstat_pv_wait_node, true); qstat_inc(qstat_pv_wait_node, true);
qstat_inc(qstat_pv_wait_again, waitcnt);
qstat_inc(qstat_pv_wait_early, wait_early); qstat_inc(qstat_pv_wait_early, wait_early);
pv_wait(&pn->state, vcpu_halted); pv_wait(&pn->state, vcpu_halted);
} }
...@@ -456,12 +455,9 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node) ...@@ -456,12 +455,9 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
pv_wait(&l->locked, _Q_SLOW_VAL); pv_wait(&l->locked, _Q_SLOW_VAL);
/* /*
* The unlocker should have freed the lock before kicking the * Because of lock stealing, the queue head vCPU may not be
* CPU. So if the lock is still not free, it is a spurious * able to acquire the lock before it has to wait again.
* wakeup or another vCPU has stolen the lock. The current
* vCPU should spin again.
*/ */
qstat_inc(qstat_pv_spurious_wakeup, READ_ONCE(l->locked));
} }
/* /*
...@@ -544,7 +540,7 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock) ...@@ -544,7 +540,7 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
* unhash. Otherwise it would be possible to have multiple @lock * unhash. Otherwise it would be possible to have multiple @lock
* entries, which would be BAD. * entries, which would be BAD.
*/ */
locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0); locked = cmpxchg_release(&l->locked, _Q_LOCKED_VAL, 0);
if (likely(locked == _Q_LOCKED_VAL)) if (likely(locked == _Q_LOCKED_VAL))
return; return;
......
...@@ -24,8 +24,8 @@ ...@@ -24,8 +24,8 @@
* pv_latency_wake - average latency (ns) from vCPU kick to wakeup * pv_latency_wake - average latency (ns) from vCPU kick to wakeup
* pv_lock_slowpath - # of locking operations via the slowpath * pv_lock_slowpath - # of locking operations via the slowpath
* pv_lock_stealing - # of lock stealing operations * pv_lock_stealing - # of lock stealing operations
* pv_spurious_wakeup - # of spurious wakeups * pv_spurious_wakeup - # of spurious wakeups in non-head vCPUs
* pv_wait_again - # of vCPU wait's that happened after a vCPU kick * pv_wait_again - # of wait's after a queue head vCPU kick
* pv_wait_early - # of early vCPU wait's * pv_wait_early - # of early vCPU wait's
* pv_wait_head - # of vCPU wait's at the queue head * pv_wait_head - # of vCPU wait's at the queue head
* pv_wait_node - # of vCPU wait's at a non-head queue node * pv_wait_node - # of vCPU wait's at a non-head queue node
......
...@@ -121,16 +121,19 @@ enum rwsem_wake_type { ...@@ -121,16 +121,19 @@ enum rwsem_wake_type {
* - woken process blocks are discarded from the list after having task zeroed * - woken process blocks are discarded from the list after having task zeroed
* - writers are only marked woken if downgrading is false * - writers are only marked woken if downgrading is false
*/ */
static struct rw_semaphore * static void __rwsem_mark_wake(struct rw_semaphore *sem,
__rwsem_mark_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type,
enum rwsem_wake_type wake_type, struct wake_q_head *wake_q) struct wake_q_head *wake_q)
{ {
struct rwsem_waiter *waiter; struct rwsem_waiter *waiter, *tmp;
struct task_struct *tsk; long oldcount, woken = 0, adjustment = 0;
struct list_head *next;
long oldcount, woken, loop, adjustment; /*
* Take a peek at the queue head waiter such that we can determine
* the wakeup(s) to perform.
*/
waiter = list_first_entry(&sem->wait_list, struct rwsem_waiter, list);
waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
if (waiter->type == RWSEM_WAITING_FOR_WRITE) { if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
if (wake_type == RWSEM_WAKE_ANY) { if (wake_type == RWSEM_WAKE_ANY) {
/* /*
...@@ -142,19 +145,19 @@ __rwsem_mark_wake(struct rw_semaphore *sem, ...@@ -142,19 +145,19 @@ __rwsem_mark_wake(struct rw_semaphore *sem,
*/ */
wake_q_add(wake_q, waiter->task); wake_q_add(wake_q, waiter->task);
} }
goto out;
return;
} }
/* Writers might steal the lock before we grant it to the next reader. /*
* Writers might steal the lock before we grant it to the next reader.
* We prefer to do the first reader grant before counting readers * We prefer to do the first reader grant before counting readers
* so we can bail out early if a writer stole the lock. * so we can bail out early if a writer stole the lock.
*/ */
adjustment = 0;
if (wake_type != RWSEM_WAKE_READ_OWNED) { if (wake_type != RWSEM_WAKE_READ_OWNED) {
adjustment = RWSEM_ACTIVE_READ_BIAS; adjustment = RWSEM_ACTIVE_READ_BIAS;
try_reader_grant: try_reader_grant:
oldcount = atomic_long_fetch_add(adjustment, &sem->count); oldcount = atomic_long_fetch_add(adjustment, &sem->count);
if (unlikely(oldcount < RWSEM_WAITING_BIAS)) { if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
/* /*
* If the count is still less than RWSEM_WAITING_BIAS * If the count is still less than RWSEM_WAITING_BIAS
...@@ -164,7 +167,8 @@ __rwsem_mark_wake(struct rw_semaphore *sem, ...@@ -164,7 +167,8 @@ __rwsem_mark_wake(struct rw_semaphore *sem,
*/ */
if (atomic_long_add_return(-adjustment, &sem->count) < if (atomic_long_add_return(-adjustment, &sem->count) <
RWSEM_WAITING_BIAS) RWSEM_WAITING_BIAS)
goto out; return;
/* Last active locker left. Retry waking readers. */ /* Last active locker left. Retry waking readers. */
goto try_reader_grant; goto try_reader_grant;
} }
...@@ -176,38 +180,23 @@ __rwsem_mark_wake(struct rw_semaphore *sem, ...@@ -176,38 +180,23 @@ __rwsem_mark_wake(struct rw_semaphore *sem,
rwsem_set_reader_owned(sem); rwsem_set_reader_owned(sem);
} }
/* Grant an infinite number of read locks to the readers at the front /*
* of the queue. Note we increment the 'active part' of the count by * Grant an infinite number of read locks to the readers at the front
* the number of readers before waking any processes up. * of the queue. We know that woken will be at least 1 as we accounted
* for above. Note we increment the 'active part' of the count by the
* number of readers before waking any processes up.
*/ */
woken = 0; list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
do { struct task_struct *tsk;
woken++;
if (waiter->list.next == &sem->wait_list) if (waiter->type == RWSEM_WAITING_FOR_WRITE)
break; break;
waiter = list_entry(waiter->list.next, woken++;
struct rwsem_waiter, list);
} while (waiter->type != RWSEM_WAITING_FOR_WRITE);
adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
if (waiter->type != RWSEM_WAITING_FOR_WRITE)
/* hit end of list above */
adjustment -= RWSEM_WAITING_BIAS;
if (adjustment)
atomic_long_add(adjustment, &sem->count);
next = sem->wait_list.next;
loop = woken;
do {
waiter = list_entry(next, struct rwsem_waiter, list);
next = waiter->list.next;
tsk = waiter->task; tsk = waiter->task;
wake_q_add(wake_q, tsk); wake_q_add(wake_q, tsk);
list_del(&waiter->list);
/* /*
* Ensure that the last operation is setting the reader * Ensure that the last operation is setting the reader
* waiter to nil such that rwsem_down_read_failed() cannot * waiter to nil such that rwsem_down_read_failed() cannot
...@@ -215,13 +204,16 @@ __rwsem_mark_wake(struct rw_semaphore *sem, ...@@ -215,13 +204,16 @@ __rwsem_mark_wake(struct rw_semaphore *sem,
* to the task to wakeup. * to the task to wakeup.
*/ */
smp_store_release(&waiter->task, NULL); smp_store_release(&waiter->task, NULL);
} while (--loop); }
sem->wait_list.next = next; adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
next->prev = &sem->wait_list; if (list_empty(&sem->wait_list)) {
/* hit end of list above */
adjustment -= RWSEM_WAITING_BIAS;
}
out: if (adjustment)
return sem; atomic_long_add(adjustment, &sem->count);
} }
/* /*
...@@ -235,7 +227,6 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem) ...@@ -235,7 +227,6 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
struct task_struct *tsk = current; struct task_struct *tsk = current;
WAKE_Q(wake_q); WAKE_Q(wake_q);
/* set up my own style of waitqueue */
waiter.task = tsk; waiter.task = tsk;
waiter.type = RWSEM_WAITING_FOR_READ; waiter.type = RWSEM_WAITING_FOR_READ;
...@@ -247,7 +238,8 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem) ...@@ -247,7 +238,8 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
/* we're now waiting on the lock, but no longer actively locking */ /* we're now waiting on the lock, but no longer actively locking */
count = atomic_long_add_return(adjustment, &sem->count); count = atomic_long_add_return(adjustment, &sem->count);
/* If there are no active locks, wake the front queued process(es). /*
* If there are no active locks, wake the front queued process(es).
* *
* If there are no writers and we are first in the queue, * If there are no writers and we are first in the queue,
* wake our own waiter to join the existing active readers ! * wake our own waiter to join the existing active readers !
...@@ -255,7 +247,7 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem) ...@@ -255,7 +247,7 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
if (count == RWSEM_WAITING_BIAS || if (count == RWSEM_WAITING_BIAS ||
(count > RWSEM_WAITING_BIAS && (count > RWSEM_WAITING_BIAS &&
adjustment != -RWSEM_ACTIVE_READ_BIAS)) adjustment != -RWSEM_ACTIVE_READ_BIAS))
sem = __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q); __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irq(&sem->wait_lock); raw_spin_unlock_irq(&sem->wait_lock);
wake_up_q(&wake_q); wake_up_q(&wake_q);
...@@ -505,7 +497,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state) ...@@ -505,7 +497,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
if (count > RWSEM_WAITING_BIAS) { if (count > RWSEM_WAITING_BIAS) {
WAKE_Q(wake_q); WAKE_Q(wake_q);
sem = __rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q); __rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
/* /*
* The wakeup is normally called _after_ the wait_lock * The wakeup is normally called _after_ the wait_lock
* is released, but given that we are proactively waking * is released, but given that we are proactively waking
...@@ -614,9 +606,8 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem) ...@@ -614,9 +606,8 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
raw_spin_lock_irqsave(&sem->wait_lock, flags); raw_spin_lock_irqsave(&sem->wait_lock, flags);
locked: locked:
/* do nothing if list empty */
if (!list_empty(&sem->wait_list)) if (!list_empty(&sem->wait_list))
sem = __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q); __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irqrestore(&sem->wait_lock, flags); raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
wake_up_q(&wake_q); wake_up_q(&wake_q);
...@@ -638,9 +629,8 @@ struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem) ...@@ -638,9 +629,8 @@ struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
raw_spin_lock_irqsave(&sem->wait_lock, flags); raw_spin_lock_irqsave(&sem->wait_lock, flags);
/* do nothing if list empty */
if (!list_empty(&sem->wait_list)) if (!list_empty(&sem->wait_list))
sem = __rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q); __rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
raw_spin_unlock_irqrestore(&sem->wait_lock, flags); raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
wake_up_q(&wake_q); wake_up_q(&wake_q);
......
...@@ -68,6 +68,8 @@ void rcu_sync_lockdep_assert(struct rcu_sync *rsp) ...@@ -68,6 +68,8 @@ void rcu_sync_lockdep_assert(struct rcu_sync *rsp)
RCU_LOCKDEP_WARN(!gp_ops[rsp->gp_type].held(), RCU_LOCKDEP_WARN(!gp_ops[rsp->gp_type].held(),
"suspicious rcu_sync_is_idle() usage"); "suspicious rcu_sync_is_idle() usage");
} }
EXPORT_SYMBOL_GPL(rcu_sync_lockdep_assert);
#endif #endif
/** /**
...@@ -82,6 +84,18 @@ void rcu_sync_init(struct rcu_sync *rsp, enum rcu_sync_type type) ...@@ -82,6 +84,18 @@ void rcu_sync_init(struct rcu_sync *rsp, enum rcu_sync_type type)
rsp->gp_type = type; rsp->gp_type = type;
} }
/**
* Must be called after rcu_sync_init() and before first use.
*
* Ensures rcu_sync_is_idle() returns false and rcu_sync_{enter,exit}()
* pairs turn into NO-OPs.
*/
void rcu_sync_enter_start(struct rcu_sync *rsp)
{
rsp->gp_count++;
rsp->gp_state = GP_PASSED;
}
/** /**
* rcu_sync_enter() - Force readers onto slowpath * rcu_sync_enter() - Force readers onto slowpath
* @rsp: Pointer to rcu_sync structure to use for synchronization * @rsp: Pointer to rcu_sync structure to use for synchronization
......
...@@ -20,7 +20,6 @@ ...@@ -20,7 +20,6 @@
#include <linux/kallsyms.h> #include <linux/kallsyms.h>
#include <linux/smpboot.h> #include <linux/smpboot.h>
#include <linux/atomic.h> #include <linux/atomic.h>
#include <linux/lglock.h>
#include <linux/nmi.h> #include <linux/nmi.h>
/* /*
...@@ -47,13 +46,9 @@ struct cpu_stopper { ...@@ -47,13 +46,9 @@ struct cpu_stopper {
static DEFINE_PER_CPU(struct cpu_stopper, cpu_stopper); static DEFINE_PER_CPU(struct cpu_stopper, cpu_stopper);
static bool stop_machine_initialized = false; static bool stop_machine_initialized = false;
/* /* static data for stop_cpus */
* Avoids a race between stop_two_cpus and global stop_cpus, where static DEFINE_MUTEX(stop_cpus_mutex);
* the stoppers could get queued up in reverse order, leading to static bool stop_cpus_in_progress;
* system deadlock. Using an lglock means stop_two_cpus remains
* relatively cheap.
*/
DEFINE_STATIC_LGLOCK(stop_cpus_lock);
static void cpu_stop_init_done(struct cpu_stop_done *done, unsigned int nr_todo) static void cpu_stop_init_done(struct cpu_stop_done *done, unsigned int nr_todo)
{ {
...@@ -230,14 +225,26 @@ static int cpu_stop_queue_two_works(int cpu1, struct cpu_stop_work *work1, ...@@ -230,14 +225,26 @@ static int cpu_stop_queue_two_works(int cpu1, struct cpu_stop_work *work1,
struct cpu_stopper *stopper1 = per_cpu_ptr(&cpu_stopper, cpu1); struct cpu_stopper *stopper1 = per_cpu_ptr(&cpu_stopper, cpu1);
struct cpu_stopper *stopper2 = per_cpu_ptr(&cpu_stopper, cpu2); struct cpu_stopper *stopper2 = per_cpu_ptr(&cpu_stopper, cpu2);
int err; int err;
retry:
lg_double_lock(&stop_cpus_lock, cpu1, cpu2);
spin_lock_irq(&stopper1->lock); spin_lock_irq(&stopper1->lock);
spin_lock_nested(&stopper2->lock, SINGLE_DEPTH_NESTING); spin_lock_nested(&stopper2->lock, SINGLE_DEPTH_NESTING);
err = -ENOENT; err = -ENOENT;
if (!stopper1->enabled || !stopper2->enabled) if (!stopper1->enabled || !stopper2->enabled)
goto unlock; goto unlock;
/*
* Ensure that if we race with __stop_cpus() the stoppers won't get
* queued up in reverse order leading to system deadlock.
*
* We can't miss stop_cpus_in_progress if queue_stop_cpus_work() has
* queued a work on cpu1 but not on cpu2, we hold both locks.
*
* It can be falsely true but it is safe to spin until it is cleared,
* queue_stop_cpus_work() does everything under preempt_disable().
*/
err = -EDEADLK;
if (unlikely(stop_cpus_in_progress))
goto unlock;
err = 0; err = 0;
__cpu_stop_queue_work(stopper1, work1); __cpu_stop_queue_work(stopper1, work1);
...@@ -245,8 +252,12 @@ static int cpu_stop_queue_two_works(int cpu1, struct cpu_stop_work *work1, ...@@ -245,8 +252,12 @@ static int cpu_stop_queue_two_works(int cpu1, struct cpu_stop_work *work1,
unlock: unlock:
spin_unlock(&stopper2->lock); spin_unlock(&stopper2->lock);
spin_unlock_irq(&stopper1->lock); spin_unlock_irq(&stopper1->lock);
lg_double_unlock(&stop_cpus_lock, cpu1, cpu2);
if (unlikely(err == -EDEADLK)) {
while (stop_cpus_in_progress)
cpu_relax();
goto retry;
}
return err; return err;
} }
/** /**
...@@ -316,9 +327,6 @@ bool stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg, ...@@ -316,9 +327,6 @@ bool stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg,
return cpu_stop_queue_work(cpu, work_buf); return cpu_stop_queue_work(cpu, work_buf);
} }
/* static data for stop_cpus */
static DEFINE_MUTEX(stop_cpus_mutex);
static bool queue_stop_cpus_work(const struct cpumask *cpumask, static bool queue_stop_cpus_work(const struct cpumask *cpumask,
cpu_stop_fn_t fn, void *arg, cpu_stop_fn_t fn, void *arg,
struct cpu_stop_done *done) struct cpu_stop_done *done)
...@@ -332,7 +340,8 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask, ...@@ -332,7 +340,8 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
* preempted by a stopper which might wait for other stoppers * preempted by a stopper which might wait for other stoppers
* to enter @fn which can lead to deadlock. * to enter @fn which can lead to deadlock.
*/ */
lg_global_lock(&stop_cpus_lock); preempt_disable();
stop_cpus_in_progress = true;
for_each_cpu(cpu, cpumask) { for_each_cpu(cpu, cpumask) {
work = &per_cpu(cpu_stopper.stop_work, cpu); work = &per_cpu(cpu_stopper.stop_work, cpu);
work->fn = fn; work->fn = fn;
...@@ -341,7 +350,8 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask, ...@@ -341,7 +350,8 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
if (cpu_stop_queue_work(cpu, work)) if (cpu_stop_queue_work(cpu, work))
queued = true; queued = true;
} }
lg_global_unlock(&stop_cpus_lock); stop_cpus_in_progress = false;
preempt_enable();
return queued; return queued;
} }
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment