Commit 75063600 authored by Linus Torvalds's avatar Linus Torvalds

Merge branch 'futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  futex: fix restart in wait_requeue_pi
  futex: fix restart for early wakeup in futex_wait_requeue_pi()
  futex: cleanup error exit
  futex: remove the wait queue
  futex: add requeue-pi documentation
  futex: remove FUTEX_REQUEUE_PI (non CMP)
  futex: fix futex_wait_setup key handling
  sparc64: extend TI_RESTART_BLOCK space by 8 bytes
  futex: fixup unlocked requeue pi case
  futex: add requeue_pi functionality
  futex: split out futex value validation code
  futex: distangle futex_requeue()
  futex: add FUTEX_HAS_TIMEOUT flag to restart.futex.flags
  rt_mutex: add proxy lock routines
  futex: split out fixup owner logic from futex_lock_pi()
  futex: split out atomic logic from futex_lock_pi()
  futex: add helper to find the top prio waiter of a futex
  futex: separate futex_wait_queue_me() logic from futex_wait()
parents be15f9d6 2070887f
Futex Requeue PI
----------------
Requeueing of tasks from a non-PI futex to a PI futex requires
special handling in order to ensure the underlying rt_mutex is never
left without an owner if it has waiters; doing so would break the PI
boosting logic [see rt-mutex-desgin.txt] For the purposes of
brevity, this action will be referred to as "requeue_pi" throughout
this document. Priority inheritance is abbreviated throughout as
"PI".
Motivation
----------
Without requeue_pi, the glibc implementation of
pthread_cond_broadcast() must resort to waking all the tasks waiting
on a pthread_condvar and letting them try to sort out which task
gets to run first in classic thundering-herd formation. An ideal
implementation would wake the highest-priority waiter, and leave the
rest to the natural wakeup inherent in unlocking the mutex
associated with the condvar.
Consider the simplified glibc calls:
/* caller must lock mutex */
pthread_cond_wait(cond, mutex)
{
lock(cond->__data.__lock);
unlock(mutex);
do {
unlock(cond->__data.__lock);
futex_wait(cond->__data.__futex);
lock(cond->__data.__lock);
} while(...)
unlock(cond->__data.__lock);
lock(mutex);
}
pthread_cond_broadcast(cond)
{
lock(cond->__data.__lock);
unlock(cond->__data.__lock);
futex_requeue(cond->data.__futex, cond->mutex);
}
Once pthread_cond_broadcast() requeues the tasks, the cond->mutex
has waiters. Note that pthread_cond_wait() attempts to lock the
mutex only after it has returned to user space. This will leave the
underlying rt_mutex with waiters, and no owner, breaking the
previously mentioned PI-boosting algorithms.
In order to support PI-aware pthread_condvar's, the kernel needs to
be able to requeue tasks to PI futexes. This support implies that
upon a successful futex_wait system call, the caller would return to
user space already holding the PI futex. The glibc implementation
would be modified as follows:
/* caller must lock mutex */
pthread_cond_wait_pi(cond, mutex)
{
lock(cond->__data.__lock);
unlock(mutex);
do {
unlock(cond->__data.__lock);
futex_wait_requeue_pi(cond->__data.__futex);
lock(cond->__data.__lock);
} while(...)
unlock(cond->__data.__lock);
/* the kernel acquired the the mutex for us */
}
pthread_cond_broadcast_pi(cond)
{
lock(cond->__data.__lock);
unlock(cond->__data.__lock);
futex_requeue_pi(cond->data.__futex, cond->mutex);
}
The actual glibc implementation will likely test for PI and make the
necessary changes inside the existing calls rather than creating new
calls for the PI cases. Similar changes are needed for
pthread_cond_timedwait() and pthread_cond_signal().
Implementation
--------------
In order to ensure the rt_mutex has an owner if it has waiters, it
is necessary for both the requeue code, as well as the waiting code,
to be able to acquire the rt_mutex before returning to user space.
The requeue code cannot simply wake the waiter and leave it to
acquire the rt_mutex as it would open a race window between the
requeue call returning to user space and the waiter waking and
starting to run. This is especially true in the uncontended case.
The solution involves two new rt_mutex helper routines,
rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which
allow the requeue code to acquire an uncontended rt_mutex on behalf
of the waiter and to enqueue the waiter on a contended rt_mutex.
Two new system calls provide the kernel<->user interface to
requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_REQUEUE_CMP_PI.
FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait()
and pthread_cond_timedwait()) to block on the initial futex and wait
to be requeued to a PI-aware futex. The implementation is the
result of a high-speed collision between futex_wait() and
futex_lock_pi(), with some extra logic to check for the additional
wake-up scenarios.
FUTEX_REQUEUE_CMP_PI is called by the waker
(pthread_cond_broadcast() and pthread_cond_signal()) to requeue and
possibly wake the waiting tasks. Internally, this system call is
still handled by futex_requeue (by passing requeue_pi=1). Before
requeueing, futex_requeue() attempts to acquire the requeue target
PI futex on behalf of the top waiter. If it can, this waiter is
woken. futex_requeue() then proceeds to requeue the remaining
nr_wake+nr_requeue tasks to the PI futex, calling
rt_mutex_start_proxy_lock() prior to each requeue to prepare the
task as a waiter on the underlying rt_mutex. It is possible that
the lock can be acquired at this stage as well, if so, the next
waiter is woken to finish the acquisition of the lock.
FUTEX_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but
their sum is all that really matters. futex_requeue() will wake or
requeue up to nr_wake + nr_requeue tasks. It will wake only as many
tasks as it can acquire the lock for, which in the majority of cases
should be 0 as good programming practice dictates that the caller of
either pthread_cond_broadcast() or pthread_cond_signal() acquire the
mutex prior to making the call. FUTEX_REQUEUE_PI requires that
nr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for
signal.
...@@ -102,8 +102,8 @@ struct thread_info { ...@@ -102,8 +102,8 @@ struct thread_info {
#define TI_KERN_CNTD1 0x00000488 #define TI_KERN_CNTD1 0x00000488
#define TI_PCR 0x00000490 #define TI_PCR 0x00000490
#define TI_RESTART_BLOCK 0x00000498 #define TI_RESTART_BLOCK 0x00000498
#define TI_KUNA_REGS 0x000004c0 #define TI_KUNA_REGS 0x000004c8
#define TI_KUNA_INSN 0x000004c8 #define TI_KUNA_INSN 0x000004d0
#define TI_FPREGS 0x00000500 #define TI_FPREGS 0x00000500
/* We embed this in the uppermost byte of thread_info->flags */ /* We embed this in the uppermost byte of thread_info->flags */
......
...@@ -23,6 +23,8 @@ union ktime; ...@@ -23,6 +23,8 @@ union ktime;
#define FUTEX_TRYLOCK_PI 8 #define FUTEX_TRYLOCK_PI 8
#define FUTEX_WAIT_BITSET 9 #define FUTEX_WAIT_BITSET 9
#define FUTEX_WAKE_BITSET 10 #define FUTEX_WAKE_BITSET 10
#define FUTEX_WAIT_REQUEUE_PI 11
#define FUTEX_CMP_REQUEUE_PI 12
#define FUTEX_PRIVATE_FLAG 128 #define FUTEX_PRIVATE_FLAG 128
#define FUTEX_CLOCK_REALTIME 256 #define FUTEX_CLOCK_REALTIME 256
...@@ -38,6 +40,10 @@ union ktime; ...@@ -38,6 +40,10 @@ union ktime;
#define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLAG) #define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLAG)
#define FUTEX_WAIT_BITSET_PRIVATE (FUTEX_WAIT_BITS | FUTEX_PRIVATE_FLAG) #define FUTEX_WAIT_BITSET_PRIVATE (FUTEX_WAIT_BITS | FUTEX_PRIVATE_FLAG)
#define FUTEX_WAKE_BITSET_PRIVATE (FUTEX_WAKE_BITS | FUTEX_PRIVATE_FLAG) #define FUTEX_WAKE_BITSET_PRIVATE (FUTEX_WAKE_BITS | FUTEX_PRIVATE_FLAG)
#define FUTEX_WAIT_REQUEUE_PI_PRIVATE (FUTEX_WAIT_REQUEUE_PI | \
FUTEX_PRIVATE_FLAG)
#define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \
FUTEX_PRIVATE_FLAG)
/* /*
* Support for robust futexes: the kernel cleans up held futexes at * Support for robust futexes: the kernel cleans up held futexes at
......
...@@ -21,13 +21,14 @@ struct restart_block { ...@@ -21,13 +21,14 @@ struct restart_block {
struct { struct {
unsigned long arg0, arg1, arg2, arg3; unsigned long arg0, arg1, arg2, arg3;
}; };
/* For futex_wait */ /* For futex_wait and futex_wait_requeue_pi */
struct { struct {
u32 *uaddr; u32 *uaddr;
u32 val; u32 val;
u32 flags; u32 flags;
u32 bitset; u32 bitset;
u64 time; u64 time;
u32 *uaddr2;
} futex; } futex;
/* For nanosleep */ /* For nanosleep */
struct { struct {
......
This diff is collapsed.
This diff is collapsed.
...@@ -120,6 +120,14 @@ extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock, ...@@ -120,6 +120,14 @@ extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock,
struct task_struct *proxy_owner); struct task_struct *proxy_owner);
extern void rt_mutex_proxy_unlock(struct rt_mutex *lock, extern void rt_mutex_proxy_unlock(struct rt_mutex *lock,
struct task_struct *proxy_owner); struct task_struct *proxy_owner);
extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
struct rt_mutex_waiter *waiter,
struct task_struct *task,
int detect_deadlock);
extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
struct hrtimer_sleeper *to,
struct rt_mutex_waiter *waiter,
int detect_deadlock);
#ifdef CONFIG_DEBUG_RT_MUTEXES #ifdef CONFIG_DEBUG_RT_MUTEXES
# include "rtmutex-debug.h" # include "rtmutex-debug.h"
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment