Index | Thread | Search

From:
Martin Pieuchot <mpi@grenadille.net>
Subject:
Re: Faster _exit(2) for a faster userland: R.I.P the reaper
To:
tech@openbsd.org
Date:
Fri, 2 May 2025 17:04:46 +0200

Download raw body.

Thread
On 02/05/25(Fri) 13:30, Martin Pieuchot wrote:
> In the past 6 months, since Valencia, I investigated two different
> performance issues:
>  - "Why configure scripts are slow?" (With aja@ and jan@), and
>  - "Why single-threaded page fault performances are so bad?" (by myself)
> 
> I don't even want to share numbers.  That said both issues lead me to
> exit1() in the kernel.  Its too many context switches related to the
> reaper, the extremely costly insertion in the RB-tree of many pages when
> tearing down a VM space, as well as the serialization of such teardowns
> in LIFO order... 
> 
> The diff below is not a complete answer to all theses points, however I
> believe it to be the most complex piece.  It already greatly improves
> performances even if processes are now charged for a bigger amount of
> %sys time due to reaping their own address space.
> 
> The diff below has been tested on amd64, arm64 and i386.  It includes
> multiple pieces that can be reviewed independently:
> 
> - arch/amd64/amd64/locore.S: always update `ci_proc_pmap' even if the
>   %cr3 are similar.  This is required because we now use pmap_kernel
>   on non-kernel threads to be able to reap the user pmap & user space.
> 
> - kern/kern_event.c:  Use a mutex instead of a rwlock for the 
>   `kqueue_ps_list_lock'.  This is necessary to remove possible sleep
>   point in knote_processexit().
> 
> - kern/kern_rwlock.c:  Add two assertwaitok() to ensure non-contended
>   rwlocks are not taken in code path that MUST NOT sleep.
> 
> - kern/subr_xxx.c:  Add a check to ensure SDEAD thread never execute a
>   code path that might sleep.
> 
> The rest is a shuffling of the current exit1() logic which includes:
> 
> - Remove an extra synchronization for `ps_mainproc' for multi-threaded
>   process.  Rely instead on single_thread_set().  That means the last
>   thread will cleanup per-process states and free it siblings states
>   and stack.  As a bonus `ps_mainproc' is also killed.
> 
> - Move uvm_exit() inside exit1().  This is still executed without kernel
>   lock and now in parallel.  We now borrow proc0's vmspace and pmap to
>   finish the execution of the dead process.
> 
> - Move re-parenting and NOTE_EXIT notification to exit1().
> 
> - Change dowait6() to allow init(8) to reap non-zombie processes.
> 
> A lot more cleanups and improvements can be done on top of this.  We
> should now be able to call mi_switch() instead of sched_toidle() after
> cpu_exit().  This should remove an extra context switch and give us
> another performance/latency boost.
> 
> Accounting could also certainly be improved.  I must admit I don't
> understand the API and I'd appreciate if someone (claudio@?) could look
> at the many tuag_*, ruadd(), calcru() & co.
> 
> I'll look at improving the tear down of the VM space and pmap next.
> 
> I'd appreciate tests reports on many different setups as well as other
> architectures.

Below are the libkvm bits related to the removal of `ps_mainproc'.

Index: kvm_proc2.c
===================================================================
RCS file: /cvs/src/lib/libkvm/kvm_proc2.c,v
diff -u -p -r1.39 kvm_proc2.c
--- kvm_proc2.c	8 Jul 2024 13:18:26 -0000	1.39
+++ kvm_proc2.c	2 May 2025 15:02:18 -0000
@@ -294,7 +294,7 @@ kvm_proclist(kvm_t *kd, int op, int arg,
 
 #define do_copy_str(_d, _s, _l)	kvm_read(kd, (u_long)(_s), (_d), (_l)-1)
 		FILL_KPROC(&kp, do_copy_str, &proc, &process,
-		    &ucred, &pgrp, process.ps_mainproc, pr, &sess,
+		    &ucred, &pgrp, TAILQ_FIRST(&process.ps_threads), pr, &sess,
 		    vmp, limp, sap, &process.ps_tu, 0, 1);
 
 		/* stuff that's too painful to generalize */
@@ -331,10 +331,11 @@ kvm_proclist(kvm_t *kd, int op, int arg,
 
 		/* update %cpu for all threads */
 		if (dothreads) {
-			if (KREAD(kd, (u_long)process.ps_mainproc, &proc)) {
+			if (KREAD(kd, (u_long)TAILQ_FIRST(&process.ps_threads),
+				&proc)) {
 				_kvm_err(kd, kd->program,
 				    "can't read proc at %lx",
-				    (u_long)process.ps_mainproc);
+				    (u_long)TAILQ_FIRST(&process.ps_threads));
 				return (-1);
 			}
 			kp.p_pctcpu = proc.p_pctcpu;