From: Martin Pieuchot Subject: Re: Faster _exit(2) for a faster userland: R.I.P the reaper To: tech@openbsd.org Date: Fri, 2 May 2025 17:04:46 +0200 On 02/05/25(Fri) 13:30, Martin Pieuchot wrote: > In the past 6 months, since Valencia, I investigated two different > performance issues: > - "Why configure scripts are slow?" (With aja@ and jan@), and > - "Why single-threaded page fault performances are so bad?" (by myself) > > I don't even want to share numbers. That said both issues lead me to > exit1() in the kernel. Its too many context switches related to the > reaper, the extremely costly insertion in the RB-tree of many pages when > tearing down a VM space, as well as the serialization of such teardowns > in LIFO order... > > The diff below is not a complete answer to all theses points, however I > believe it to be the most complex piece. It already greatly improves > performances even if processes are now charged for a bigger amount of > %sys time due to reaping their own address space. > > The diff below has been tested on amd64, arm64 and i386. It includes > multiple pieces that can be reviewed independently: > > - arch/amd64/amd64/locore.S: always update `ci_proc_pmap' even if the > %cr3 are similar. This is required because we now use pmap_kernel > on non-kernel threads to be able to reap the user pmap & user space. > > - kern/kern_event.c: Use a mutex instead of a rwlock for the > `kqueue_ps_list_lock'. This is necessary to remove possible sleep > point in knote_processexit(). > > - kern/kern_rwlock.c: Add two assertwaitok() to ensure non-contended > rwlocks are not taken in code path that MUST NOT sleep. > > - kern/subr_xxx.c: Add a check to ensure SDEAD thread never execute a > code path that might sleep. > > The rest is a shuffling of the current exit1() logic which includes: > > - Remove an extra synchronization for `ps_mainproc' for multi-threaded > process. Rely instead on single_thread_set(). That means the last > thread will cleanup per-process states and free it siblings states > and stack. As a bonus `ps_mainproc' is also killed. > > - Move uvm_exit() inside exit1(). This is still executed without kernel > lock and now in parallel. We now borrow proc0's vmspace and pmap to > finish the execution of the dead process. > > - Move re-parenting and NOTE_EXIT notification to exit1(). > > - Change dowait6() to allow init(8) to reap non-zombie processes. > > A lot more cleanups and improvements can be done on top of this. We > should now be able to call mi_switch() instead of sched_toidle() after > cpu_exit(). This should remove an extra context switch and give us > another performance/latency boost. > > Accounting could also certainly be improved. I must admit I don't > understand the API and I'd appreciate if someone (claudio@?) could look > at the many tuag_*, ruadd(), calcru() & co. > > I'll look at improving the tear down of the VM space and pmap next. > > I'd appreciate tests reports on many different setups as well as other > architectures. Below are the libkvm bits related to the removal of `ps_mainproc'. Index: kvm_proc2.c =================================================================== RCS file: /cvs/src/lib/libkvm/kvm_proc2.c,v diff -u -p -r1.39 kvm_proc2.c --- kvm_proc2.c 8 Jul 2024 13:18:26 -0000 1.39 +++ kvm_proc2.c 2 May 2025 15:02:18 -0000 @@ -294,7 +294,7 @@ kvm_proclist(kvm_t *kd, int op, int arg, #define do_copy_str(_d, _s, _l) kvm_read(kd, (u_long)(_s), (_d), (_l)-1) FILL_KPROC(&kp, do_copy_str, &proc, &process, - &ucred, &pgrp, process.ps_mainproc, pr, &sess, + &ucred, &pgrp, TAILQ_FIRST(&process.ps_threads), pr, &sess, vmp, limp, sap, &process.ps_tu, 0, 1); /* stuff that's too painful to generalize */ @@ -331,10 +331,11 @@ kvm_proclist(kvm_t *kd, int op, int arg, /* update %cpu for all threads */ if (dothreads) { - if (KREAD(kd, (u_long)process.ps_mainproc, &proc)) { + if (KREAD(kd, (u_long)TAILQ_FIRST(&process.ps_threads), + &proc)) { _kvm_err(kd, kd->program, "can't read proc at %lx", - (u_long)process.ps_mainproc); + (u_long)TAILQ_FIRST(&process.ps_threads)); return (-1); } kp.p_pctcpu = proc.p_pctcpu;