[PATCH] signal-fixes-2.5.59-A4

this is the current threading patchset, which accumulated up during the past two weeks. It consists of a biggest set of changes from Roland, to make threaded signals work. There were still tons of testcases and boundary conditions (mostly in the signal/exit/ptrace area) that we did not handle correctly. Roland's thread-signal semantics/behavior/ptrace fixes: - fix signal delivery race with do_exit() => signals are re-queued to the 'process' if do_exit() finds pending unhandled ones. This prevents signals getting lost upon thread-sys_exit(). - a non-main thread has died on one processor and gone to TASK_ZOMBIE, but before it's gotten to release_task a sys_wait4 on the other processor reaps it. It's only because it's ptraced that this gets through eligible_child. Somewhere in there the main thread is also dying so it reparents the child thread to hit that case. This means that there is a race where P might be totally invalid. - forget_original_parent is not doing the right thing when the group leader dies, i.e. reparenting threads to init when there is a zombie group leader. Perhaps it doesn't matter for any practical purpose without ptrace, though it makes for ppid=1 for each thread in core dumps, which looks funny. Incidentally, SIGCHLD here really should be p->exit_signal. - one of the gdb tests makes a questionable assumption about what kill will do when it has some threads stopped by ptrace and others running. exit races: 1. Processor A is in sys_wait4 case TASK_STOPPED considering task P. Processor B is about to resume P and then switch to it. While A is inside that case block, B starts running P and it clears P->exit_code, or takes a pending fatal signal and sets it to a new value. Depending on the interleaving, the possible failure modes are: a. A gets to its put_user after B has cleared P->exit_code => returns with WIFSTOPPED, WSTOPSIG==0 b. A gets to its put_user after B has set P->exit_code anew => returns with e.g. WIFSTOPPED, WSTOPSIG==SIGKILL A can spend an arbitrarily long time in that case block, because there's getrusage and put_user that can take page faults, and write_lock'ing of the tasklist_lock that can block. But even if it's short the race is there in principle. 2. This is new with NPTL, i.e. CLONE_THREAD. Two processors A and B are both in sys_wait4 case TASK_STOPPED considering task P. Both get through their tests and fetches of P->exit_code before either gets to P->exit_code = 0. => two threads return the same pid from waitpid. In other interleavings where one processor gets to its put_user after the other has cleared P->exit_code, it's like case 1(a). 3. SMP races with stop/cont signals First, take: kill(pid, SIGSTOP); kill(pid, SIGCONT); or: kill(pid, SIGSTOP); kill(pid, SIGKILL); It's possible for this to leave the process stopped with a pending SIGCONT/SIGKILL. That's a state that should never be possible. Moreover, kill(pid, SIGKILL) without any repetition should always be enough to kill a process. (Likewise SIGCONT when you know it's sequenced after the last stop signal, must be sufficient to resume a process.) 4. take: kill(pid, SIGKILL); // or any fatal signal kill(pid, SIGCONT); // or SIGKILL it's possible for this to cause pid to be reaped with status 0 instead of its true termination status. The equivalent scenario happens when the process being killed is in an _exit call or a trap-induced fatal signal before the kills. plus i've done stability fixes for bugs that popped up during beta-testing, and minor tidying of Roland's changes: - a rare tasklist corruption during exec, causing some very spurious and colorful crashes. - a copy_process()-related dereference of already freed thread structure if hit with a SIGKILL in the wrong moment. - SMP spinlock deadlocks in the signal code this patchset has been tested quite well in the 2.4 backport of the threading changes - and i've done some stresstesting on 2.5.59 SMP as well, and did an x86 UP testcompile + testboot as well.

[PATCH] signal-fixes-2.5.59-A4
this is the current threading patchset, which accumulated up during the past two weeks. It consists of a biggest set of changes from Roland, to make threaded signals work. There were still tons of testcases and boundary conditions (mostly in the signal/exit/ptrace area) that we did not handle correctly. Roland's thread-signal semantics/behavior/ptrace fixes: - fix signal delivery race with do_exit() => signals are re-queued to the 'process' if do_exit() finds pending unhandled ones. This prevents signals getting lost upon thread-sys_exit(). - a non-main thread has died on one processor and gone to TASK_ZOMBIE, but before it's gotten to release_task a sys_wait4 on the other processor reaps it. It's only because it's ptraced that this gets through eligible_child. Somewhere in there the main thread is also dying so it reparents the child thread to hit that case. This means that there is a race where P might be totally invalid. - forget_original_parent is not doing the right thing when the group leader dies, i.e. reparenting threads to init when there is a zombie group leader. Perhaps it doesn't matter for any practical purpose without ptrace, though it makes for ppid=1 for each thread in core dumps, which looks funny. Incidentally, SIGCHLD here really should be p->exit_signal. - one of the gdb tests makes a questionable assumption about what kill will do when it has some threads stopped by ptrace and others running. exit races: 1. Processor A is in sys_wait4 case TASK_STOPPED considering task P. Processor B is about to resume P and then switch to it. While A is inside that case block, B starts running P and it clears P->exit_code, or takes a pending fatal signal and sets it to a new value. Depending on the interleaving, the possible failure modes are: a. A gets to its put_user after B has cleared P->exit_code => returns with WIFSTOPPED, WSTOPSIG==0 b. A gets to its put_user after B has set P->exit_code anew => returns with e.g. WIFSTOPPED, WSTOPSIG==SIGKILL A can spend an arbitrarily long time in that case block, because there's getrusage and put_user that can take page faults, and write_lock'ing of the tasklist_lock that can block. But even if it's short the race is there in principle. 2. This is new with NPTL, i.e. CLONE_THREAD. Two processors A and B are both in sys_wait4 case TASK_STOPPED considering task P. Both get through their tests and fetches of P->exit_code before either gets to P->exit_code = 0. => two threads return the same pid from waitpid. In other interleavings where one processor gets to its put_user after the other has cleared P->exit_code, it's like case 1(a). 3. SMP races with stop/cont signals First, take: kill(pid, SIGSTOP); kill(pid, SIGCONT); or: kill(pid, SIGSTOP); kill(pid, SIGKILL); It's possible for this to leave the process stopped with a pending SIGCONT/SIGKILL. That's a state that should never be possible. Moreover, kill(pid, SIGKILL) without any repetition should always be enough to kill a process. (Likewise SIGCONT when you know it's sequenced after the last stop signal, must be sufficient to resume a process.) 4. take: kill(pid, SIGKILL); // or any fatal signal kill(pid, SIGCONT); // or SIGKILL it's possible for this to cause pid to be reaped with status 0 instead of its true termination status. The equivalent scenario happens when the process being killed is in an _exit call or a trap-induced fatal signal before the kills. plus i've done stability fixes for bugs that popped up during beta-testing, and minor tidying of Roland's changes: - a rare tasklist corruption during exec, causing some very spurious and colorful crashes. - a copy_process()-related dereference of already freed thread structure if hit with a SIGKILL in the wrong moment. - SMP spinlock deadlocks in the signal code this patchset has been tested quite well in the 2.4 backport of the threading changes - and i've done some stresstesting on 2.5.59 SMP as well, and did an x86 UP testcompile + testboot as well.
ebf5ebe3 · Ingo Molnar · Linus Torvalds · 44a5a59c · ebf5ebe3 · ebf5ebe3
Commit ebf5ebe3 authored Feb 05, 2003 by Ingo Molnar Committed by Linus Torvalds Feb 05, 2003
6 changed files
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -587,7 +587,7 @@ static inline int de_thread(struct signal_struct *oldsig)
 		return -EAGAIN;
 	}
 	oldsig->group_exit = 1;
-	__broadcast_thread_group(current, SIGKILL);
+	zap_other_threads(current);

 	/*
 	 * Account for the thread group leader hanging around:
@@ -659,7 +659,8 @@ static inline int de_thread(struct signal_struct *oldsig)
 			current->ptrace = ptrace;
 			__ptrace_link(current, parent);
 		}
-		
+
+		list_del(&current->tasks);
 		list_add_tail(&current->tasks, &init_task.tasks);
 		current->exit_signal = SIGCHLD;
 		state = leader->state;
@@ -680,6 +681,7 @@ static inline int de_thread(struct signal_struct *oldsig)
 	newsig->group_exit = 0;
 	newsig->group_exit_code = 0;
 	newsig->group_exit_task = NULL;
+	newsig->group_stop_count = 0;
 	memcpy(newsig->action, current->sig->action, sizeof(newsig->action));
 	init_sigpending(&newsig->shared_pending);


--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -235,6 +235,9 @@ struct signal_struct {
 	int			group_exit;
 	int			group_exit_code;
 	struct task_struct	*group_exit_task;
+
+	/* thread group stop support, overloads group_exit_code too */
+	int			group_stop_count;
 };

 /*
@@ -508,7 +511,6 @@ extern int in_egroup_p(gid_t);
 extern void proc_caches_init(void);
 extern void flush_signals(struct task_struct *);
 extern void flush_signal_handlers(struct task_struct *);
-extern void sig_exit(int, int, struct siginfo *);
 extern int dequeue_signal(sigset_t *mask, siginfo_t *info);
 extern void block_all_signals(int (*notifier)(void *priv), void *priv,
 			      sigset_t *mask);
@@ -525,7 +527,7 @@ extern void do_notify_parent(struct task_struct *, int);
 extern void force_sig(int, struct task_struct *);
 extern void force_sig_specific(int, struct task_struct *);
 extern int send_sig(int, struct task_struct *, int);
-extern int __broadcast_thread_group(struct task_struct *p, int sig);
+extern void zap_other_threads(struct task_struct *p);
 extern int kill_pg(pid_t, int, int);
 extern int kill_sl(pid_t, int, int);
 extern int kill_proc(pid_t, int, int);
@@ -590,6 +592,8 @@ extern void exit_files(struct task_struct *);
 extern void exit_sighand(struct task_struct *);
 extern void __exit_sighand(struct task_struct *);

+extern NORET_TYPE void do_group_exit(int);
+
 extern void reparent_to_init(void);
 extern void daemonize(void);
 extern task_t *child_reaper;
@@ -762,6 +766,8 @@ static inline void cond_resched_lock(spinlock_t * lock)
 extern FASTCALL(void recalc_sigpending_tsk(struct task_struct *t));
 extern void recalc_sigpending(void);

+extern void signal_wake_up(struct task_struct *t, int resume_stopped);
+
 /*
 * Wrappers for p->thread_info->cpu access. No-op on UP.
 */

--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -647,7 +647,7 @@ NORET_TYPE void do_exit(long code)
 	exit_namespace(tsk);
 	exit_thread();

-	if (current->leader)
+	if (tsk->leader)
 		disassociate_ctty(1);

 	module_put(tsk->thread_info->exec_domain->module);
@@ -657,8 +657,31 @@ NORET_TYPE void do_exit(long code)
 	tsk->exit_code = code;
 	exit_notify();
 	preempt_disable();
-	if (current->exit_signal == -1)
-		release_task(current);
+	if (signal_pending(tsk) && !tsk->sig->group_exit
+	    && !thread_group_empty(tsk)) {
+		/*
+		 * This occurs when there was a race between our exit
+		 * syscall and a group signal choosing us as the one to
+		 * wake up.  It could be that we are the only thread
+		 * alerted to check for pending signals, but another thread
+		 * should be woken now to take the signal since we will not.
+		 * Now we'll wake all the threads in the group just to make
+		 * sure someone gets all the pending signals.
+		 */
+		struct task_struct *t;
+		read_lock(&tasklist_lock);
+		spin_lock_irq(&tsk->sig->siglock);
+		for (t = next_thread(tsk); t != tsk; t = next_thread(t))
+			if (!signal_pending(t) && !(t->flags & PF_EXITING)) {
+				recalc_sigpending_tsk(t);
+				if (signal_pending(t))
+					signal_wake_up(t, 0);
+			}
+		spin_unlock_irq(&tsk->sig->siglock);
+		read_unlock(&tasklist_lock);
+	}
+	if (tsk->exit_signal == -1)
+		release_task(tsk);
 	schedule();
 	BUG();
 /*
@@ -710,31 +733,44 @@ task_t *next_thread(task_t *p)
 }

 /*
- * this kills every thread in the thread group. Note that any externally
- * wait4()-ing process will get the correct exit code - even if this 
- * thread is not the thread group leader.
+ * Take down every thread in the group.  This is called by fatal signals
+ * as well as by sys_exit_group (below).
 */
-asmlinkage long sys_exit_group(int error_code)
+NORET_TYPE void
+do_group_exit(int exit_code)
 {
-	unsigned int exit_code = (error_code & 0xff) << 8;
-
-	if (!thread_group_empty(current)) {
-		struct signal_struct *sig = current->sig;
+	BUG_ON(exit_code & 0x80); /* core dumps don't get here */

+	if (current->sig->group_exit)
+		exit_code = current->sig->group_exit_code;
+	else if (!thread_group_empty(current)) {
+		struct signal_struct *const sig = current->sig;
+		read_lock(&tasklist_lock);
 		spin_lock_irq(&sig->siglock);
-		if (sig->group_exit) {
-			spin_unlock_irq(&sig->siglock);
-
-			/* another thread was faster: */
-			do_exit(sig->group_exit_code);
-		}
+		if (sig->group_exit)
+			/* Another thread got here before we took the lock.  */
+			exit_code = sig->group_exit_code;
+		else {
 		sig->group_exit = 1;
 		sig->group_exit_code = exit_code;
-		__broadcast_thread_group(current, SIGKILL);
+			zap_other_threads(current);
+		}
 		spin_unlock_irq(&sig->siglock);
+		read_unlock(&tasklist_lock);
 	}

 	do_exit(exit_code);
+	/* NOTREACHED */
+}
+
+/*
+ * this kills every thread in the thread group. Note that any externally
+ * wait4()-ing process will get the correct exit code - even if this
+ * thread is not the thread group leader.
+ */
+asmlinkage long sys_exit_group(int error_code)
+{
+	do_group_exit((error_code & 0xff) << 8);
 }

 static int eligible_child(pid_t pid, int options, task_t *p)
@@ -800,6 +836,8 @@ asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struc
 		int ret;

 		list_for_each(_p,&tsk->children) {
+			int exit_code;
+
 			p = list_entry(_p,struct task_struct,sibling);

 			ret = eligible_child(pid, options, p);
@@ -813,20 +851,69 @@ asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struc
 					continue;
 				if (!(options & WUNTRACED) && !(p->ptrace & PT_PTRACED))
 					continue;
+				if (ret == 2 && !(p->ptrace & PT_PTRACED) &&
+				    p->sig && p->sig->group_stop_count > 0)
+					/*
+					 * A group stop is in progress and
+					 * we are the group leader.  We won't
+					 * report until all threads have
+					 * stopped.
+					 */
+					continue;
 				read_unlock(&tasklist_lock);

 				/* move to end of parent's list to avoid starvation */
 				write_lock_irq(&tasklist_lock);
 				remove_parent(p);
 				add_parent(p, p->parent);
+
+				/*
+				 * This uses xchg to be atomic with
+				 * the thread resuming and setting it.
+				 * It must also be done with the write
+				 * lock held to prevent a race with the
+				 * TASK_ZOMBIE case (below).
+				 */
+				exit_code = xchg(&p->exit_code, 0);
+				if (unlikely(p->state > TASK_STOPPED)) {
+					/*
+					 * The task resumed and then died.
+					 * Let the next iteration catch it
+					 * in TASK_ZOMBIE.  Note that
+					 * exit_code might already be zero
+					 * here if it resumed and did
+					 * _exit(0).  The task itself is
+					 * dead and won't touch exit_code
+					 * again; other processors in
+					 * this function are locked out.
+					 */
+					p->exit_code = exit_code;
+					exit_code = 0;
+				}
+				if (unlikely(exit_code == 0)) {
+					/*
+					 * Another thread in this function
+					 * got to it first, or it resumed,
+					 * or it resumed and then died.
+					 */
+					write_unlock_irq(&tasklist_lock);
+					continue;
+				}
+				/*
+				 * Make sure this doesn't get reaped out from
+				 * under us while we are examining it below.
+				 * We don't want to keep holding onto the
+				 * tasklist_lock while we call getrusage and
+				 * possibly take page faults for user memory.
+				 */
+				get_task_struct(p);
 				write_unlock_irq(&tasklist_lock);
 				retval = ru ? getrusage(p, RUSAGE_BOTH, ru) : 0; 
 				if (!retval && stat_addr) 
-					retval = put_user((p->exit_code << 8) | 0x7f, stat_addr);
-				if (!retval) {
-					p->exit_code = 0;
+					retval = put_user((exit_code << 8) | 0x7f, stat_addr);
+				if (!retval)
 					retval = p->pid;
-				}
+				put_task_struct(p);
 				goto end_wait4;
 			case TASK_ZOMBIE:
 				/*
@@ -841,6 +928,13 @@ asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struc
 				state = xchg(&p->state, TASK_DEAD);
 				if (state != TASK_ZOMBIE)
 					continue;
+				if (unlikely(p->exit_signal == -1))
+					/*
+					 * This can only happen in a race with
+					 * a ptraced thread dying on another
+					 * processor.
+					 */
+					continue;
 				read_unlock(&tasklist_lock);

 				retval = ru ? getrusage(p, RUSAGE_BOTH, ru) : 0;
@@ -857,11 +951,17 @@ asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struc
 				retval = p->pid;
 				if (p->real_parent != p->parent) {
 					write_lock_irq(&tasklist_lock);
+					/* Double-check with lock held.  */
+					if (p->real_parent != p->parent) {
 					__ptrace_unlink(p);
-					do_notify_parent(p, SIGCHLD);
+						do_notify_parent(
+							p, p->exit_signal);
 					p->state = TASK_ZOMBIE;
+						p = NULL;
+					}
 					write_unlock_irq(&tasklist_lock);
-				} else
+				}
+				if (p != NULL)
 					release_task(p);
 				goto end_wait4;
 			default:

--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -680,6 +680,7 @@ static inline int copy_sighand(unsigned long clone_flags, struct task_struct * t
 	sig->group_exit = 0;
 	sig->group_exit_code = 0;
 	sig->group_exit_task = NULL;
+	sig->group_stop_count = 0;
 	memcpy(sig->action, current->sig->action, sizeof(sig->action));
 	sig->curr_target = NULL;
 	init_sigpending(&sig->shared_pending);
@@ -801,7 +802,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	spin_lock_init(&p->alloc_lock);
 	spin_lock_init(&p->switch_lock);

-	clear_tsk_thread_flag(p,TIF_SIGPENDING);
+	clear_tsk_thread_flag(p, TIF_SIGPENDING);
 	init_sigpending(&p->pending);

 	p->it_real_value = p->it_virt_value = p->it_prof_value = 0;
@@ -910,6 +911,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	 */
 	if (sigismember(&current->pending.signal, SIGKILL)) {
 		write_unlock_irq(&tasklist_lock);
+		retval = -EINTR;
 		goto bad_fork_cleanup_namespace;
 	}

@@ -934,6 +936,17 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		}
 		p->tgid = current->tgid;
 		p->group_leader = current->group_leader;
+
+		if (current->sig->group_stop_count > 0) {
+			/*
+			 * There is an all-stop in progress for the group.
+			 * We ourselves will stop as soon as we check signals.
+			 * Make the new thread part of that group stop too.
+			 */
+			current->sig->group_stop_count++;
+			set_tsk_thread_flag(p, TIF_SIGPENDING);
+		}
+
 		spin_unlock(&current->sig->siglock);
 	}

@@ -1036,8 +1049,13 @@ struct task_struct *do_fork(unsigned long clone_flags,
 			init_completion(&vfork);
 		}

-		if (p->ptrace & PT_PTRACED)
-			send_sig(SIGSTOP, p, 1);
+		if (p->ptrace & PT_PTRACED) {
+			/*
+			 * We'll start up with an immediate SIGSTOP.
+			 */
+			sigaddset(&p->pending.signal, SIGSTOP);
+			set_tsk_thread_flag(p, TIF_SIGPENDING);
+		}

 		wake_up_forked_process(p);		/* do this last */
 		++total_forks;

--- a/kernel/signal.c
+++ b/kernel/signal.c
--- a/kernel/suspend.c
+++ b/kernel/suspend.c
@@ -65,7 +65,6 @@
 #include <asm/pgtable.h>
 #include <asm/io.h>

-extern void signal_wake_up(struct task_struct *t);
 extern int sys_sync(void);

 unsigned char software_suspend_enabled = 0;
@@ -220,7 +219,7 @@ int freeze_processes(void)
 			   without locking */
 			p->flags |= PF_FREEZE;
 			spin_lock_irqsave(&p->sig->siglock, flags);
-			signal_wake_up(p);
+			signal_wake_up(p, 0);
 			spin_unlock_irqrestore(&p->sig->siglock, flags);
 			todo++;
 		} while_each_thread(g, p);