[PATCH] shared thread signals

Support POSIX compliant thread signals on a kernel level with usable debugging (broadcast SIGSTOP, SIGCONT) and thread group management (broadcast SIGKILL), plus to load-balance 'process' signals between threads for better signal performance. Changes: - POSIX thread semantics for signals there are 7 'types' of actions a signal can take: specific, load-balance, kill-all, kill-all+core, stop-all, continue-all and ignore. Depending on the POSIX specifications each signal has one of the types defined for both the 'handler defined' and the 'handler not defined (kernel default)' case. Here is the table: ---------------------------------------------------------- | | userspace | kernel | ---------------------------------------------------------- | SIGHUP | load-balance | kill-all | | SIGINT | load-balance | kill-all | | SIGQUIT | load-balance | kill-all+core | | SIGILL | specific | kill-all+core | | SIGTRAP | specific | kill-all+core | | SIGABRT/SIGIOT | specific | kill-all+core | | SIGBUS | specific | kill-all+core | | SIGFPE | specific | kill-all+core | | SIGKILL | n/a | kill-all | | SIGUSR1 | load-balance | kill-all | | SIGSEGV | specific | kill-all+core | | SIGUSR2 | load-balance | kill-all | | SIGPIPE | specific | kill-all | | SIGALRM | load-balance | kill-all | | SIGTERM | load-balance | kill-all | | SIGCHLD | load-balance | ignore | | SIGCONT | load-balance | continue-all | | SIGSTOP | n/a | stop-all | | SIGTSTP | load-balance | stop-all | | SIGTTIN | load-balancen | stop-all | | SIGTTOU | load-balancen | stop-all | | SIGURG | load-balance | ignore | | SIGXCPU | specific | kill-all+core | | SIGXFSZ | specific | kill-all+core | | SIGVTALRM | load-balance | kill-all | | SIGPROF | specific | kill-all | | SIGPOLL/SIGIO | load-balance | kill-all | | SIGSYS/SIGUNUSED | specific | kill-all+core | | SIGSTKFLT | specific | kill-all | | SIGWINCH | load-balance | ignore | | SIGPWR | load-balance | kill-all | | SIGRTMIN-SIGRTMAX | load-balance | kill-all | ---------------------------------------------------------- as you can see it from the list, signals that have handlers defined never get broadcasted - they are either specific or load-balanced. - CLONE_THREAD implies CLONE_SIGHAND It does not make much sense to have a thread group that does not share signal handlers. In fact in the patch i'm using the signal spinlock to lock access to the thread group. I made the siglock IRQ-safe, thus we can load-balance signals from interrupt contexts as well. (we cannot take the tasklist lock in write mode from IRQ handlers.) this is not as clean as i'd like it to be, but it's the best i could come up with so far. - thread group list management reworked. threads are now removed from the group if the thread is unhashed from the PID table. This makes the most sense. This also helps with another feature that relies on an intact thread group list: multithreaded coredumps. - child reparenting reworked. the O(N) algorithm in forget_original_parent() causes massive performance problems if a large number of threads exit from the group. Performance improves more than 10-fold if the following simple rules are followed instead: - reparent children to the *previous* thread [exiting or not] - if a thread is detached then reparent to init. - fast broadcasting of kernel-internal SIGSTOP, SIGCONT, SIGKILL, etc. kernel-internal broadcasted signals are a potential DoS problem, since they might generate massive amounts of GFP_ATOMIC allocations of siginfo structures. The important thing to note is that the siginfo structure does not actually have to be allocated and queued - the signal processing code has all the information it needs, neither of these signals carries any information in the siginfo structure. This makes a broadcast SIGKILL a very simple operation: all threads get the bit 9 set in their pending bitmask. The speedup due to this was significant - and the robustness win is invaluable. - sys_execve() should not kill off 'all other' threads. the 'exec kills all threads if the master thread does the exec()' is a POSIX(-ish) thing that should not be hardcoded in the kernel in this case. to handle POSIX exec() semantics, glibc uses a special syscall, which kills 'all but self' threads: sys_exit_allbutself(). the straightforward exec() implementation just calls sys_exit_allbutself() and then sys_execve(). (this syscall is also be used internally if the thread group leader thread sys_exit()s or sys_exec()s, to ensure the integrity of the thread group.)

[PATCH] shared thread signals
Support POSIX compliant thread signals on a kernel level with usable debugging (broadcast SIGSTOP, SIGCONT) and thread group management (broadcast SIGKILL), plus to load-balance 'process' signals between threads for better signal performance. Changes: - POSIX thread semantics for signals there are 7 'types' of actions a signal can take: specific, load-balance, kill-all, kill-all+core, stop-all, continue-all and ignore. Depending on the POSIX specifications each signal has one of the types defined for both the 'handler defined' and the 'handler not defined (kernel default)' case. Here is the table: ---------------------------------------------------------- | | userspace | kernel | ---------------------------------------------------------- | SIGHUP | load-balance | kill-all | | SIGINT | load-balance | kill-all | | SIGQUIT | load-balance | kill-all+core | | SIGILL | specific | kill-all+core | | SIGTRAP | specific | kill-all+core | | SIGABRT/SIGIOT | specific | kill-all+core | | SIGBUS | specific | kill-all+core | | SIGFPE | specific | kill-all+core | | SIGKILL | n/a | kill-all | | SIGUSR1 | load-balance | kill-all | | SIGSEGV | specific | kill-all+core | | SIGUSR2 | load-balance | kill-all | | SIGPIPE | specific | kill-all | | SIGALRM | load-balance | kill-all | | SIGTERM | load-balance | kill-all | | SIGCHLD | load-balance | ignore | | SIGCONT | load-balance | continue-all | | SIGSTOP | n/a | stop-all | | SIGTSTP | load-balance | stop-all | | SIGTTIN | load-balancen | stop-all | | SIGTTOU | load-balancen | stop-all | | SIGURG | load-balance | ignore | | SIGXCPU | specific | kill-all+core | | SIGXFSZ | specific | kill-all+core | | SIGVTALRM | load-balance | kill-all | | SIGPROF | specific | kill-all | | SIGPOLL/SIGIO | load-balance | kill-all | | SIGSYS/SIGUNUSED | specific | kill-all+core | | SIGSTKFLT | specific | kill-all | | SIGWINCH | load-balance | ignore | | SIGPWR | load-balance | kill-all | | SIGRTMIN-SIGRTMAX | load-balance | kill-all | ---------------------------------------------------------- as you can see it from the list, signals that have handlers defined never get broadcasted - they are either specific or load-balanced. - CLONE_THREAD implies CLONE_SIGHAND It does not make much sense to have a thread group that does not share signal handlers. In fact in the patch i'm using the signal spinlock to lock access to the thread group. I made the siglock IRQ-safe, thus we can load-balance signals from interrupt contexts as well. (we cannot take the tasklist lock in write mode from IRQ handlers.) this is not as clean as i'd like it to be, but it's the best i could come up with so far. - thread group list management reworked. threads are now removed from the group if the thread is unhashed from the PID table. This makes the most sense. This also helps with another feature that relies on an intact thread group list: multithreaded coredumps. - child reparenting reworked. the O(N) algorithm in forget_original_parent() causes massive performance problems if a large number of threads exit from the group. Performance improves more than 10-fold if the following simple rules are followed instead: - reparent children to the *previous* thread [exiting or not] - if a thread is detached then reparent to init. - fast broadcasting of kernel-internal SIGSTOP, SIGCONT, SIGKILL, etc. kernel-internal broadcasted signals are a potential DoS problem, since they might generate massive amounts of GFP_ATOMIC allocations of siginfo structures. The important thing to note is that the siginfo structure does not actually have to be allocated and queued - the signal processing code has all the information it needs, neither of these signals carries any information in the siginfo structure. This makes a broadcast SIGKILL a very simple operation: all threads get the bit 9 set in their pending bitmask. The speedup due to this was significant - and the robustness win is invaluable. - sys_execve() should not kill off 'all other' threads. the 'exec kills all threads if the master thread does the exec()' is a POSIX(-ish) thing that should not be hardcoded in the kernel in this case. to handle POSIX exec() semantics, glibc uses a special syscall, which kills 'all but self' threads: sys_exit_allbutself(). the straightforward exec() implementation just calls sys_exit_allbutself() and then sys_execve(). (this syscall is also be used internally if the thread group leader thread sys_exit()s or sys_exec()s, to ensure the integrity of the thread group.)
6dfc8897 · Ingo Molnar · Linus Torvalds · 36780249 · 6dfc8897 · 6dfc8897
Commit 6dfc8897 authored Sep 07, 2002 by Ingo Molnar Committed by Linus Torvalds Sep 07, 2002
7 changed files
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -504,6 +504,8 @@ static inline int make_private_signals(void)
 {
 	struct signal_struct * newsig;

+	remove_thread_group(current, current->sig);
+
 	if (atomic_read(&current->sig->count) <= 1)
 		return 0;
 	newsig = kmem_cache_alloc(sigact_cachep, GFP_KERNEL);
@@ -575,42 +577,10 @@ static inline void flush_old_files(struct files_struct * files)
 */
 static void de_thread(struct task_struct *tsk)
 {
-	struct task_struct *sub;
-	struct list_head *head, *ptr;
-	struct siginfo info;
-	int pause;
-
-	write_lock_irq(&tasklist_lock);
-
-	if (tsk->tgid != tsk->pid) {
-		/* subsidiary thread - just escapes the group */
-		list_del_init(&tsk->thread_group);
-		tsk->tgid = tsk->pid;
-		pause = 0;
-	}
-	else {
-		/* master thread - kill all subsidiary threads */
-		info.si_signo = SIGKILL;
-		info.si_errno = 0;
-		info.si_code = SI_DETHREAD;
-		info.si_pid = current->pid;
-		info.si_uid = current->uid;
-
-		head = tsk->thread_group.next;
-		list_del_init(&tsk->thread_group);
-
-		list_for_each(ptr,head) {
-			sub = list_entry(ptr,struct task_struct,thread_group);
-			send_sig_info(SIGKILL,&info,sub);
-		}
-
-		pause = 1;
-	}
-
-	write_unlock_irq(&tasklist_lock);
-
-	/* give the subsidiary threads a chance to clean themselves up */
-	if (pause) yield();
+	if (!list_empty(&tsk->thread_group))
+		BUG();
+	/* An exec() starts a new thread group: */
+	tsk->tgid = tsk->pid;
 }

 int flush_old_exec(struct linux_binprm * bprm)
@@ -633,6 +603,8 @@ int flush_old_exec(struct linux_binprm * bprm)
 	if (retval) goto mmap_failed;

 	/* This is the point of no return */
+	de_thread(current);
+
 	release_old_signals(oldsig);

 	current->sas_ss_sp = current->sas_ss_size = 0;
@@ -651,9 +623,6 @@ int flush_old_exec(struct linux_binprm * bprm)

 	flush_thread();

-	if (!list_empty(&current->thread_group))
-		de_thread(current);
-
 	if (bprm->e_uid != current->euid || bprm->e_gid != current->egid || 
 	    permission(bprm->file->f_dentry->d_inode,MAY_READ))
 		current->mm->dumpable = 0;

--- a/include/asm-i386/spinlock.h
+++ b/include/asm-i386/spinlock.h
@@ -158,6 +158,8 @@ typedef struct {

 #define rwlock_init(x)	do { *(x) = RW_LOCK_UNLOCKED; } while(0)

+#define rwlock_is_locked(x) ((x)->lock != RW_LOCK_BIAS)
+
 /*
 * On x86, we implement read-write locks as a 32-bit counter
 * with the high bit (sign) being the "contended" bit.

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -211,6 +211,11 @@ struct signal_struct {
 	atomic_t		count;
 	struct k_sigaction	action[_NSIG];
 	spinlock_t		siglock;
+
+        /* current thread group signal load-balancing target: */
+        task_t                  *curr_target;
+
+	struct sigpending	shared_pending;
 };

 /*
@@ -356,7 +361,7 @@ struct task_struct {
 	spinlock_t sigmask_lock;	/* Protects signal and blocked */
 	struct signal_struct *sig;

-	sigset_t blocked;
+	sigset_t blocked, real_blocked, shared_unblocked;
 	struct sigpending pending;

 	unsigned long sas_ss_sp;
@@ -431,6 +436,7 @@ extern void set_cpus_allowed(task_t *p, unsigned long new_mask);
 extern void set_user_nice(task_t *p, long nice);
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
+extern int task_curr(task_t *p);
 extern int idle_cpu(int cpu);

 void yield(void);
@@ -535,7 +541,7 @@ extern void proc_caches_init(void);
 extern void flush_signals(struct task_struct *);
 extern void flush_signal_handlers(struct task_struct *);
 extern void sig_exit(int, int, struct siginfo *);
-extern int dequeue_signal(sigset_t *, siginfo_t *);
+extern int dequeue_signal(struct sigpending *pending, sigset_t *mask, siginfo_t *info);
 extern void block_all_signals(int (*notifier)(void *priv), void *priv,
 			      sigset_t *mask);
 extern void unblock_all_signals(void);
@@ -654,6 +660,7 @@ extern void exit_thread(void);
 extern void exit_mm(struct task_struct *);
 extern void exit_files(struct task_struct *);
 extern void exit_sighand(struct task_struct *);
+extern void remove_thread_group(struct task_struct *tsk, struct signal_struct *sig);

 extern void reparent_to_init(void);
 extern void daemonize(void);
@@ -786,8 +793,29 @@ static inline struct task_struct *younger_sibling(struct task_struct *p)
 #define for_each_thread(task) \
 	for (task = next_thread(current) ; task != current ; task = next_thread(task))

-#define next_thread(p) \
-	list_entry((p)->thread_group.next, struct task_struct, thread_group)
+static inline task_t *next_thread(task_t *p)
+{
+	if (!p->sig)
+		BUG();
+#if CONFIG_SMP
+	if (!spin_is_locked(&p->sig->siglock) &&
+				!rwlock_is_locked(&tasklist_lock))
+		BUG();
+#endif
+	return list_entry((p)->thread_group.next, task_t, thread_group);
+}
+
+static inline task_t *prev_thread(task_t *p)
+{
+	if (!p->sig)
+		BUG();
+#if CONFIG_SMP
+	if (!spin_is_locked(&p->sig->siglock) &&
+				!rwlock_is_locked(&tasklist_lock))
+		BUG();
+#endif
+	return list_entry((p)->thread_group.prev, task_t, thread_group);
+}

 #define thread_group_leader(p)	(p->pid == p->tgid)

@@ -903,21 +931,8 @@ static inline void cond_resched(void)
   This is required every time the blocked sigset_t changes.
   Athread cathreaders should have t->sigmask_lock.  */

-static inline void recalc_sigpending_tsk(struct task_struct *t)
-{
-	if (has_pending_signals(&t->pending.signal, &t->blocked))
-		set_tsk_thread_flag(t, TIF_SIGPENDING);
-	else
-		clear_tsk_thread_flag(t, TIF_SIGPENDING);
-}
-
-static inline void recalc_sigpending(void)
-{
-	if (has_pending_signals(&current->pending.signal, &current->blocked))
-		set_thread_flag(TIF_SIGPENDING);
-	else
-		clear_thread_flag(TIF_SIGPENDING);
-}
+extern FASTCALL(void recalc_sigpending_tsk(struct task_struct *t));
+extern void recalc_sigpending(void);

 /*
 * Wrappers for p->thread_info->cpu access. No-op on UP.

--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -36,7 +36,6 @@ static inline void __unhash_process(struct task_struct *p)
 	nr_threads--;
 	unhash_pid(p);
 	REMOVE_LINKS(p);
-	list_del(&p->thread_group);
 	p->pid = 0;
 	proc_dentry = p->proc_dentry;
 	if (unlikely(proc_dentry != NULL)) {
@@ -73,6 +72,7 @@ static void release_task(struct task_struct * p)
 	}
 	BUG_ON(!list_empty(&p->ptrace_list) || !list_empty(&p->ptrace_children));
 	unhash_process(p);
+	exit_sighand(p);

 	release_thread(p);
 	if (p != current) {
@@ -244,7 +244,8 @@ void daemonize(void)
 static void reparent_thread(task_t *p, task_t *reaper, task_t *child_reaper)
 {
 	/* We dont want people slaying init */
-	p->exit_signal = SIGCHLD;
+	if (p->exit_signal != -1)
+		p->exit_signal = SIGCHLD;
 	p->self_exec_id++;

 	/* Make sure we're not reparenting to ourselves */
@@ -412,18 +413,15 @@ void exit_mm(struct task_struct *tsk)
 */
 static inline void forget_original_parent(struct task_struct * father)
 {
-	struct task_struct *p, *reaper;
+	struct task_struct *p, *reaper = father;
 	struct list_head *_p;

-	read_lock(&tasklist_lock);
+	write_lock_irq(&tasklist_lock);

-	/* Next in our thread group, if they're not already exiting */
-	reaper = father;
-	do {
-		reaper = next_thread(reaper);
-		if (!(reaper->flags & PF_EXITING))
-			break;
-	} while (reaper != father);
+	if (father->exit_signal != -1)
+		reaper = prev_thread(reaper);
+	else
+		reaper = child_reaper;

 	if (reaper == father)
 		reaper = child_reaper;
@@ -444,7 +442,7 @@ static inline void forget_original_parent(struct task_struct * father)
 		p = list_entry(_p,struct task_struct,ptrace_list);
 		reparent_thread(p, reaper, child_reaper);
 	}
-	read_unlock(&tasklist_lock);
+	write_unlock_irq(&tasklist_lock);
 }

 static inline void zap_thread(task_t *p, task_t *father, int traced)
@@ -604,7 +602,6 @@ NORET_TYPE void do_exit(long code)
 	__exit_files(tsk);
 	__exit_fs(tsk);
 	exit_namespace(tsk);
-	exit_sighand(tsk);
 	exit_thread();

 	if (current->leader)
@@ -763,6 +760,8 @@ asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struc
 		if (options & __WNOTHREAD)
 			break;
 		tsk = next_thread(tsk);
+		if (tsk->sig != current->sig)
+			BUG();
 	} while (tsk != current);
 	read_unlock(&tasklist_lock);
 	if (flag) {

--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -630,6 +630,9 @@ static inline int copy_sighand(unsigned long clone_flags, struct task_struct * t
 	spin_lock_init(&sig->siglock);
 	atomic_set(&sig->count, 1);
 	memcpy(tsk->sig->action, current->sig->action, sizeof(tsk->sig->action));
+	sig->curr_target = NULL;
+	init_sigpending(&sig->shared_pending);
+
 	return 0;
 }

@@ -664,6 +667,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);

+	/*
+	 * Thread groups must share signals as well:
+	 */
+	if (clone_flags & CLONE_THREAD)
+		clone_flags |= CLONE_SIGHAND;
+
 	retval = security_ops->task_create(clone_flags);
 	if (retval)
 		goto fork_out;
@@ -843,8 +852,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->parent = p->real_parent;

 	if (clone_flags & CLONE_THREAD) {
+		spin_lock(&current->sig->siglock);
 		p->tgid = current->tgid;
 		list_add(&p->thread_group, &current->thread_group);
+		spin_unlock(&current->sig->siglock);
 	}

 	SET_LINKS(p);

--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1335,6 +1335,15 @@ int task_nice(task_t *p)
 	return TASK_NICE(p);
 }

+/**
+ * task_curr - is this task currently executing on a CPU?
+ * @p: the task in question.
+ */
+int task_curr(task_t *p)
+{
+	return cpu_curr(task_cpu(p)) == p;
+}
+
 /**
 * idle_cpu - is a given cpu idle currently?
 * @cpu: the processor in question.

--- a/kernel/signal.c
+++ b/kernel/signal.c