Download raw body.
Faster _exit(2) for a faster userland: R.I.P the reaper
Tested on AMD Ryzen 7 8845HS.
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: AMD Ryzen 7 8845HS w/ Radeon 780M Graphics, 3800.01 MHz, 19-75-02, patch 0a705206
cpu0: cpuid 1 edx=178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT> ecx=76f8320b<SSE3,PCLMUL,MWAIT,SSSE3,FMA3,CX16,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND>
cpu0: cpuid 6 eax=4<ARAT> ecx=1<EFFFREQ>
cpu0: cpuid 7.0 ebx=f1bf97a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,PQM,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,AVX512IFMA,CLFLUSHOPT,CLWB,AVX512CD,SHA,AVX512BW,AVX512VL> ecx=405fce<AVX512VBMI,UMIP,PKU> edx=10000010<L1DF>
cpu0: cpuid d.1 eax=f<XSAVEOPT,XSAVEC,XGETBV1,XSAVES>
cpu0: cpuid 80000001 edx=2fd3fbff<NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG> ecx=75c237ff<LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,IBS,SKINIT,TCE,TOPEXT,CPCTR,DBKP,PCTRL3,MWAITX>
cpu0: cpuid 80000007 edx=e799<HWPSTATE,ITSC>
cpu0: cpuid 80000008 ebx=191ef257<IBPB,IBRS,STIBP,STIBP_ALL,IBRS_PREF,IBRS_SM,SSBD>
cpu0: cpuid 8000001F eax=1<SME>
cpu0: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB 64b/line 8-way L2 cache, 16MB 64b/line 16-way L3 cache
cpu0: smt 0, core 0, package 0
cpu0: apic clock running at 24MHz
cpu0: mwait min=64, max=64, C-substates=1.1, IBE
acpicpu0 at acpi0: C3(0@350 io@0x415), C2(0@18 io@0x414), C1(0@1 mwait), PSS
cpu0: 3800 MHz: speeds: 3800 2200 1600 MHz
Kernel build without your diff:
136.93 real 544.57 user 280.55 sys
Kernel build with your diff:
125.83 real 538.58 user 249.35 sys
127.12 real 536.06 user 256.71 sys
All in all promising result. Thank you, I will continue to observe it in
my daily work.
Rafael
On Fri May 02, 2025 at 01:30:01PM +0200, Martin Pieuchot wrote:
> In the past 6 months, since Valencia, I investigated two different
> performance issues:
> - "Why configure scripts are slow?" (With aja@ and jan@), and
> - "Why single-threaded page fault performances are so bad?" (by myself)
>
> I don't even want to share numbers. That said both issues lead me to
> exit1() in the kernel. Its too many context switches related to the
> reaper, the extremely costly insertion in the RB-tree of many pages when
> tearing down a VM space, as well as the serialization of such teardowns
> in LIFO order...
>
> The diff below is not a complete answer to all theses points, however I
> believe it to be the most complex piece. It already greatly improves
> performances even if processes are now charged for a bigger amount of
> %sys time due to reaping their own address space.
>
> The diff below has been tested on amd64, arm64 and i386. It includes
> multiple pieces that can be reviewed independently:
>
> - arch/amd64/amd64/locore.S: always update `ci_proc_pmap' even if the
> %cr3 are similar. This is required because we now use pmap_kernel
> on non-kernel threads to be able to reap the user pmap & user space.
>
> - kern/kern_event.c: Use a mutex instead of a rwlock for the
> `kqueue_ps_list_lock'. This is necessary to remove possible sleep
> point in knote_processexit().
>
> - kern/kern_rwlock.c: Add two assertwaitok() to ensure non-contended
> rwlocks are not taken in code path that MUST NOT sleep.
>
> - kern/subr_xxx.c: Add a check to ensure SDEAD thread never execute a
> code path that might sleep.
>
> The rest is a shuffling of the current exit1() logic which includes:
>
> - Remove an extra synchronization for `ps_mainproc' for multi-threaded
> process. Rely instead on single_thread_set(). That means the last
> thread will cleanup per-process states and free it siblings states
> and stack. As a bonus `ps_mainproc' is also killed.
>
> - Move uvm_exit() inside exit1(). This is still executed without kernel
> lock and now in parallel. We now borrow proc0's vmspace and pmap to
> finish the execution of the dead process.
>
> - Move re-parenting and NOTE_EXIT notification to exit1().
>
> - Change dowait6() to allow init(8) to reap non-zombie processes.
>
> A lot more cleanups and improvements can be done on top of this. We
> should now be able to call mi_switch() instead of sched_toidle() after
> cpu_exit(). This should remove an extra context switch and give us
> another performance/latency boost.
>
> Accounting could also certainly be improved. I must admit I don't
> understand the API and I'd appreciate if someone (claudio@?) could look
> at the many tuag_*, ruadd(), calcru() & co.
>
> I'll look at improving the tear down of the VM space and pmap next.
>
> I'd appreciate tests reports on many different setups as well as other
> architectures.
>
> Thanks,
> Martin
>
> Index: arch/amd64/amd64/locore.S
> ===================================================================
> RCS file: /cvs/src/sys/arch/amd64/amd64/locore.S,v
> diff -u -p -r1.150 locore.S
> --- arch/amd64/amd64/locore.S 2 Feb 2025 05:45:20 -0000 1.150
> +++ arch/amd64/amd64/locore.S 30 Apr 2025 09:41:57 -0000
> @@ -409,13 +409,14 @@ restore_saved:
> cmpq %rcx,CPUVAR(PROC_PMAP)
> jnz .Lbogus_proc_pmap
> #endif
> - /* record which pmap this CPU should get IPIs for */
> - movq %rbx,CPUVAR(PROC_PMAP)
>
> .Lset_cr3:
> movq %rax,%cr3 /* %rax used below too */
>
> .Lsame_cr3:
> + /* record which pmap this CPU should get IPIs for */
> + movq %rbx,CPUVAR(PROC_PMAP)
> +
> /*
> * If we switched from a userland thread with a shallow call stack
> * (e.g interrupt->ast->mi_ast->prempt->mi_switch->cpu_switchto)
> Index: kern/init_main.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/init_main.c,v
> diff -u -p -r1.328 init_main.c
> --- kern/init_main.c 1 Jan 2025 07:44:54 -0000 1.328
> +++ kern/init_main.c 1 May 2025 13:00:03 -0000
> @@ -117,7 +117,6 @@ struct plimit limit0;
> struct vmspace vmspace0;
> struct sigacts sigacts0;
> struct process *initprocess;
> -struct proc *reaperproc;
>
> extern struct user *proc0paddr;
>
> @@ -496,10 +495,6 @@ main(void *framep)
> /* Create the pageout daemon kernel thread. */
> if (kthread_create(uvm_pageout, NULL, NULL, "pagedaemon"))
> panic("fork pagedaemon");
> -
> - /* Create the reaper daemon kernel thread. */
> - if (kthread_create(reaper, NULL, &reaperproc, "reaper"))
> - panic("fork reaper");
>
> /* Create the cleaner daemon kernel thread. */
> if (kthread_create(buf_daemon, NULL, &cleanerproc, "cleaner"))
> Index: kern/kern_event.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_event.c,v
> diff -u -p -r1.201 kern_event.c
> --- kern/kern_event.c 10 Feb 2025 16:45:46 -0000 1.201
> +++ kern/kern_event.c 1 May 2025 12:20:55 -0000
> @@ -183,7 +183,7 @@ const struct filterops timer_filtops = {
> struct pool knote_pool;
> struct pool kqueue_pool;
> struct mutex kqueue_klist_lock = MUTEX_INITIALIZER(IPL_MPFLOOR);
> -struct rwlock kqueue_ps_list_lock = RWLOCK_INITIALIZER("kqpsl");
> +struct mutex kqueue_ps_list_lock = MUTEX_INITIALIZER(IPL_MPFLOOR);
> int kq_ntimeouts = 0;
> int kq_timeoutmax = (4 * 1024);
>
> @@ -340,7 +340,7 @@ int
> filt_procattach(struct knote *kn)
> {
> struct process *pr;
> - int nolock;
> + int locked = 0;
>
> if ((curproc->p_p->ps_flags & PS_PLEDGE) &&
> (curproc->p_pledge & PLEDGE_PROC) == 0)
> @@ -368,18 +368,17 @@ filt_procattach(struct knote *kn)
> kn->kn_data = kn->kn_sdata; /* ppid */
> kn->kn_fflags = NOTE_CHILD;
> kn->kn_flags &= ~EV_FLAG1;
> - rw_assert_wrlock(&kqueue_ps_list_lock);
> + MUTEX_ASSERT_LOCKED(&kqueue_ps_list_lock);
> + locked = 1;
> }
>
> - /* this needs both the ps_mtx and exclusive kqueue_ps_list_lock. */
> - nolock = (rw_status(&kqueue_ps_list_lock) == RW_WRITE);
> - if (!nolock)
> - rw_enter_write(&kqueue_ps_list_lock);
> + if (!locked)
> + mtx_enter(&kqueue_ps_list_lock);
> mtx_enter(&pr->ps_mtx);
> klist_insert_locked(&pr->ps_klist, kn);
> mtx_leave(&pr->ps_mtx);
> - if (!nolock)
> - rw_exit_write(&kqueue_ps_list_lock);
> + if (!locked)
> + mtx_leave(&kqueue_ps_list_lock);
>
> KERNEL_UNLOCK();
>
> @@ -404,8 +403,8 @@ filt_procdetach(struct knote *kn)
> struct process *pr = kn->kn_ptr.p_process;
> int status;
>
> - /* this needs both the ps_mtx and exclusive kqueue_ps_list_lock. */
> - rw_enter_write(&kqueue_ps_list_lock);
> + /* this needs both the ps_mtx and kqueue_ps_list_lock. */
> + mtx_enter(&kqueue_ps_list_lock);
> mtx_enter(&pr->ps_mtx);
> status = kn->kn_status;
>
> @@ -413,7 +412,7 @@ filt_procdetach(struct knote *kn)
> klist_remove_locked(&pr->ps_klist, kn);
>
> mtx_leave(&pr->ps_mtx);
> - rw_exit_write(&kqueue_ps_list_lock);
> + mtx_leave(&kqueue_ps_list_lock);
> }
>
> int
> @@ -435,6 +434,7 @@ filt_proc(struct knote *kn, long hint)
> kn->kn_fflags |= event;
>
> /*
> + KASSERT((p->p_p->ps_flags & PS_ZOMBIE) == 0);
> * process is gone, so flag the event as finished and remove it
> * from the process's klist
> */
> @@ -471,7 +471,7 @@ filt_proc(struct knote *kn, long hint)
> kev.data = kn->kn_id; /* parent */
> kev.udata = kn->kn_udata; /* preserve udata */
>
> - rw_assert_wrlock(&kqueue_ps_list_lock);
> + MUTEX_ASSERT_LOCKED(&kqueue_ps_list_lock);
> mtx_leave(&pr->ps_mtx);
> error = kqueue_register(kq, &kev, 0, NULL);
> mtx_enter(&pr->ps_mtx);
> @@ -531,12 +531,12 @@ filt_sigattach(struct knote *kn)
> kn->kn_ptr.p_process = pr;
> kn->kn_flags |= EV_CLEAR; /* automatically set */
>
> - /* this needs both the ps_mtx and exclusive kqueue_ps_list_lock. */
> - rw_enter_write(&kqueue_ps_list_lock);
> + /* this needs both the ps_mtx and kqueue_ps_list_lock. */
> + mtx_enter(&kqueue_ps_list_lock);
> mtx_enter(&pr->ps_mtx);
> klist_insert_locked(&pr->ps_klist, kn);
> mtx_leave(&pr->ps_mtx);
> - rw_exit_write(&kqueue_ps_list_lock);
> + mtx_leave(&kqueue_ps_list_lock);
>
> return (0);
> }
> @@ -546,11 +546,11 @@ filt_sigdetach(struct knote *kn)
> {
> struct process *pr = kn->kn_ptr.p_process;
>
> - rw_enter_write(&kqueue_ps_list_lock);
> + mtx_enter(&kqueue_ps_list_lock);
> mtx_enter(&pr->ps_mtx);
> klist_remove_locked(&pr->ps_klist, kn);
> mtx_leave(&pr->ps_mtx);
> - rw_exit_write(&kqueue_ps_list_lock);
> + mtx_leave(&kqueue_ps_list_lock);
> }
>
> int
> @@ -2058,12 +2058,12 @@ knote_fdclose(struct proc *p, int fd)
> void
> knote_processexit(struct process *pr)
> {
> - /* this needs both the ps_mtx and exclusive kqueue_ps_list_lock. */
> - rw_enter_write(&kqueue_ps_list_lock);
> + /* this needs both the ps_mtx and kqueue_ps_list_lock. */
> + mtx_enter(&kqueue_ps_list_lock);
> mtx_enter(&pr->ps_mtx);
> knote_locked(&pr->ps_klist, NOTE_EXIT);
> mtx_leave(&pr->ps_mtx);
> - rw_exit_write(&kqueue_ps_list_lock);
> + mtx_leave(&kqueue_ps_list_lock);
>
> /* remove other knotes hanging off the process */
> klist_invalidate(&pr->ps_klist);
> @@ -2072,12 +2072,12 @@ knote_processexit(struct process *pr)
> void
> knote_processfork(struct process *pr, pid_t pid)
> {
> - /* this needs both the ps_mtx and exclusive kqueue_ps_list_lock. */
> - rw_enter_write(&kqueue_ps_list_lock);
> + /* this needs both the ps_mtx and kqueue_ps_list_lock. */
> + mtx_enter(&kqueue_ps_list_lock);
> mtx_enter(&pr->ps_mtx);
> knote_locked(&pr->ps_klist, NOTE_FORK | pid);
> mtx_leave(&pr->ps_mtx);
> - rw_exit_write(&kqueue_ps_list_lock);
> + mtx_leave(&kqueue_ps_list_lock);
> }
>
> void
> Index: kern/kern_exit.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_exit.c,v
> diff -u -p -r1.245 kern_exit.c
> --- kern/kern_exit.c 2 May 2025 05:04:38 -0000 1.245
> +++ kern/kern_exit.c 2 May 2025 10:43:55 -0000
> @@ -118,16 +118,22 @@ exit1(struct proc *p, int xexit, int xsi
> {
> struct process *pr, *qr, *nqr;
> struct rusage *rup;
> + int wakeinit = 0, lastthread = 0;
>
> atomic_setbits_int(&p->p_flag, P_WEXIT);
>
> pr = p->p_p;
>
> - /* single-threaded? */
> + /*
> + * For multi-threaded processes, the first thread reaching exit1()
> + * for full process exit (not thread exit) notifies and wait for
> + * it siblings via the single-thread API.
> + * Once they all reached exit1() they wake it up. Then it releases
> + * per-process resources (address space, PID, etc).
> + */
> if (!P_HASSIBLING(p)) {
> flags = EXIT_NORMAL;
> } else {
> - /* nope, multi-threaded */
> if (flags == EXIT_NORMAL)
> single_thread_set(p, SINGLE_EXIT);
> }
> @@ -157,11 +163,10 @@ exit1(struct proc *p, int xexit, int xsi
> refcnt_finalize(&pr->ps_refcnt, "psdtor");
> }
>
> - /* unlink ourselves from the active threads */
> mtx_enter(&pr->ps_mtx);
> - TAILQ_REMOVE(&pr->ps_threads, p, p_thr_link);
> - pr->ps_threadcnt--;
> pr->ps_exitcnt++;
> + if (pr->ps_exitcnt == pr->ps_threadcnt)
> + lastthread = 1;
>
> /*
> * if somebody else wants to take us to single threaded mode
> @@ -170,16 +175,9 @@ exit1(struct proc *p, int xexit, int xsi
> if (pr->ps_single != NULL || ISSET(pr->ps_flags, PS_STOPPING))
> process_suspend_signal(pr);
>
> - /* proc is off ps_threads list so update accounting of process now */
> + /* update accounting of process now */
> tuagg_add_runtime();
> tuagg_add_process(pr, p);
> -
> - if ((p->p_flag & P_THREAD) == 0) {
> - /* main thread gotta wait because it has the pid, et al */
> - while (pr->ps_threadcnt + pr->ps_exitcnt > 1)
> - msleep_nsec(&pr->ps_threads, &pr->ps_mtx, PWAIT,
> - "thrdeath", INFSLP);
> - }
> mtx_leave(&pr->ps_mtx);
>
> rup = pr->ps_ru;
> @@ -193,7 +191,7 @@ exit1(struct proc *p, int xexit, int xsi
> }
> }
> p->p_siglist = 0;
> - if ((p->p_flag & P_THREAD) == 0)
> + if (lastthread)
> pr->ps_siglist = 0;
>
> kqpoll_exit();
> @@ -202,7 +200,9 @@ exit1(struct proc *p, int xexit, int xsi
> kcov_exit(p);
> #endif
>
> - if ((p->p_flag & P_THREAD) == 0) {
> + if (lastthread) {
> + struct proc *q, *qn;
> +
> if (pr->ps_flags & PS_PROFIL)
> stopprofclock(pr);
>
> @@ -241,6 +241,32 @@ exit1(struct proc *p, int xexit, int xsi
> */
> if (pr->ps_pptr->ps_sigacts->ps_sigflags & SAS_NOCLDWAIT)
> atomic_setbits_int(&pr->ps_flags, PS_NOZOMBIE);
> +
> + /*
> + * Free the VM resources we're still holding on to.
> + * We must do this from a valid thread because doing
> + * so may block.
> + */
> +#ifdef MULTIPROCESSOR
> + __mp_release_all(&kernel_lock);
> +#endif
> + uvm_exit(pr);
> + KERNEL_LOCK();
> +
> + /* Free siblings. */
> + TAILQ_FOREACH_SAFE(q, &pr->ps_threads, p_thr_link, qn) {
> + if (q == curproc)
> + continue;
> + mtx_enter(&pr->ps_mtx);
> + TAILQ_REMOVE(&pr->ps_threads, q, p_thr_link);
> + pr->ps_threadcnt--;
> + pr->ps_exitcnt--;
> + /* account the remainder of time spent in exit1() */
> + tuagg_add_process(pr, q);
> + mtx_leave(&pr->ps_mtx);
> + WITNESS_THREAD_EXIT(q);
> + proc_free(q);
> + }
> }
>
> p->p_fd = NULL; /* zap the thread's copy */
> @@ -256,11 +282,7 @@ exit1(struct proc *p, int xexit, int xsi
>
> /*
> * Remove proc from pidhash chain and allproc so looking
> - * it up won't work. We will put the proc on the
> - * deadproc list later (using the p_hash member), and
> - * wake up the reaper when we do. If this is the last
> - * thread of a process that isn't PS_NOZOMBIE, we'll put
> - * the process on the zombprocess list below.
> + * it up won't work.
> */
> /*
> * NOTE: WE ARE NO LONGER ALLOWED TO SLEEP!
> @@ -270,7 +292,7 @@ exit1(struct proc *p, int xexit, int xsi
> LIST_REMOVE(p, p_hash);
> LIST_REMOVE(p, p_list);
>
> - if ((p->p_flag & P_THREAD) == 0) {
> + if (lastthread) {
> LIST_REMOVE(pr, ps_hash);
> LIST_REMOVE(pr, ps_list);
>
> @@ -291,7 +313,7 @@ exit1(struct proc *p, int xexit, int xsi
> */
> qr = LIST_FIRST(&pr->ps_children);
> if (qr) /* only need this if any child is S_ZOMB */
> - wakeup(initprocess);
> + wakeinit = 1;
> for (; qr != NULL; qr = nqr) {
> nqr = LIST_NEXT(qr, ps_sibling);
> /*
> @@ -340,7 +362,7 @@ exit1(struct proc *p, int xexit, int xsi
> */
> p->p_pctcpu = 0;
>
> - if ((p->p_flag & P_THREAD) == 0) {
> + if (lastthread) {
> /*
> * Final thread has died, so add on our children's rusage
> * and calculate the total times.
> @@ -361,32 +383,36 @@ exit1(struct proc *p, int xexit, int xsi
> if (pr->ps_flags & PS_NOZOMBIE) {
> struct process *ppr = pr->ps_pptr;
> process_reparent(pr, initprocess);
> + wakeinit = 1;
> atomic_setbits_int(&ppr->ps_flags, PS_WAITEVENT);
> wakeup(ppr);
> + } else {
> + /* Process is now a true zombie. */
> + atomic_setbits_int(&pr->ps_flags, PS_ZOMBIE);
> }
> mtx_leave(&pr->ps_mtx);
> - }
>
> - /* just a thread? check if last one standing. */
> - if (p->p_flag & P_THREAD) {
> - /* scheduler_wait_hook(pr->ps_mainproc, p); XXX */
> - mtx_enter(&pr->ps_mtx);
> - pr->ps_exitcnt--;
> - if (pr->ps_threadcnt + pr->ps_exitcnt == 1)
> - wakeup(&pr->ps_threads);
> - mtx_leave(&pr->ps_mtx);
> - }
> + /* Notify listeners of our demise and clean up. */
> + KERNEL_ASSERT_LOCKED();
> + knote_processexit(pr);
> +
> + if (pr->ps_flags & PS_ZOMBIE) {
> + /* Post SIGCHLD and wake up parent. */
> + prsignal(pr->ps_pptr, SIGCHLD);
> + atomic_setbits_int(&pr->ps_pptr->ps_flags,
> + PS_WAITEVENT);
> + wakeup(pr->ps_pptr);
> + }
>
> - /*
> - * Other substructures are freed from reaper and wait().
> - */
> + if (wakeinit)
> + wakeup(initprocess);
> + }
>
> /*
> - * Finally, call machine-dependent code to switch to a new
> - * context (possibly the idle context). Once we are no longer
> - * using the dead process's vmspace and stack, exit2() will be
> - * called to schedule those resources to be released by the
> - * reaper thread.
> + * Finally, call machine-dependent code to free MD per-thread
> + * resources and switch to a new context. Once we are no longer
> + * using the dead process's stack, it will be freed, along with
> + * other substructures from wait().
> *
> * Note that cpu_exit() will end with a call equivalent to
> * cpu_switch(), finishing our execution (pun intended).
> @@ -395,120 +421,27 @@ exit1(struct proc *p, int xexit, int xsi
> panic("cpu_exit returned");
> }
>
> -/*
> - * Locking of this proclist is special; it's accessed in a
> - * critical section of process exit, and thus locking it can't
> - * modify interrupt state. We use a simple spin lock for this
> - * proclist. We use the p_hash member to linkup to deadproc.
> - */
> -struct mutex deadproc_mutex =
> - MUTEX_INITIALIZER_FLAGS(IPL_NONE, "deadproc", MTX_NOWITNESS);
> -struct proclist deadproc = LIST_HEAD_INITIALIZER(deadproc);
> -
> -/*
> - * We are called from sched_idle() once it is safe to schedule the
> - * dead process's resources to be freed. So this is not allowed to sleep.
> - *
> - * We lock the deadproc list, place the proc on that list (using
> - * the p_hash member), and wake up the reaper.
> - */
> -void
> -exit2(struct proc *p)
> -{
> - /* account the remainder of time spent in exit1() */
> - mtx_enter(&p->p_p->ps_mtx);
> - tuagg_add_process(p->p_p, p);
> - mtx_leave(&p->p_p->ps_mtx);
> -
> - mtx_enter(&deadproc_mutex);
> - LIST_INSERT_HEAD(&deadproc, p, p_hash);
> - mtx_leave(&deadproc_mutex);
> -
> - wakeup(&deadproc);
> -}
> -
> void
> proc_free(struct proc *p)
> {
> + uvm_uarea_free(p);
> + p->p_vmspace = NULL; /* zap the thread's copy */
> +
> crfree(p->p_ucred);
> pool_put(&proc_pool, p);
> atomic_dec_int(&nthreads);
> }
>
> -/*
> - * Process reaper. This is run by a kernel thread to free the resources
> - * of a dead process. Once the resources are free, the process becomes
> - * a zombie, and the parent is allowed to read the undead's status.
> - */
> -void
> -reaper(void *arg)
> -{
> - struct proc *p;
> -
> - KERNEL_UNLOCK();
> -
> - SCHED_ASSERT_UNLOCKED();
> -
> - for (;;) {
> - mtx_enter(&deadproc_mutex);
> - while ((p = LIST_FIRST(&deadproc)) == NULL)
> - msleep_nsec(&deadproc, &deadproc_mutex, PVM, "reaper",
> - INFSLP);
> -
> - /* Remove us from the deadproc list. */
> - LIST_REMOVE(p, p_hash);
> - mtx_leave(&deadproc_mutex);
> -
> - WITNESS_THREAD_EXIT(p);
> -
> - /*
> - * Free the VM resources we're still holding on to.
> - * We must do this from a valid thread because doing
> - * so may block.
> - */
> - uvm_uarea_free(p);
> - p->p_vmspace = NULL; /* zap the thread's copy */
> -
> - if (p->p_flag & P_THREAD) {
> - /* Just a thread */
> - proc_free(p);
> - } else {
> - struct process *pr = p->p_p;
> -
> - /* Release the rest of the process's vmspace */
> - uvm_exit(pr);
> -
> - KERNEL_LOCK();
> - if ((pr->ps_flags & PS_NOZOMBIE) == 0) {
> - /* Process is now a true zombie. */
> - atomic_setbits_int(&pr->ps_flags, PS_ZOMBIE);
> - }
> -
> - /* Notify listeners of our demise and clean up. */
> - knote_processexit(pr);
> -
> - if (pr->ps_flags & PS_ZOMBIE) {
> - /* Post SIGCHLD and wake up parent. */
> - prsignal(pr->ps_pptr, SIGCHLD);
> - atomic_setbits_int(&pr->ps_pptr->ps_flags,
> - PS_WAITEVENT);
> - wakeup(pr->ps_pptr);
> - } else {
> - /* No one will wait for us, just zap it. */
> - process_zap(pr);
> - }
> - KERNEL_UNLOCK();
> - }
> - }
> -}
> -
> int
> dowait6(struct proc *q, idtype_t idtype, id_t id, int *statusp, int options,
> struct rusage *rusage, siginfo_t *info, register_t *retval)
> {
> int nfound;
> struct process *pr;
> - int error;
> + int error, isinit;
> +
> + /* init must look at PS_NOZOMBIE to reap them. */
> + isinit = (curproc->p_p == initprocess);
>
> if (info != NULL)
> memset(info, 0, sizeof(*info));
> @@ -518,7 +451,7 @@ loop:
> nfound = 0;
> LIST_FOREACH(pr, &q->p_p->ps_children, ps_sibling) {
> mtx_enter(&pr->ps_mtx);
> - if ((pr->ps_flags & PS_NOZOMBIE) ||
> + if ((!isinit && (pr->ps_flags & PS_NOZOMBIE)) ||
> (idtype == P_PID && id != pr->ps_pid) ||
> (idtype == P_PGID && id != pr->ps_pgid)) {
> mtx_leave(&pr->ps_mtx);
> @@ -764,7 +697,7 @@ proc_finish_wait(struct proc *waiter, st
> wakeup(tr);
> } else {
> mtx_leave(&pr->ps_mtx);
> - scheduler_wait_hook(waiter, pr->ps_mainproc);
> + scheduler_wait_hook(waiter, TAILQ_FIRST(&pr->ps_threads));
> rup = &waiter->p_p->ps_cru;
> ruadd(rup, pr->ps_ru);
> LIST_REMOVE(pr, ps_list); /* off zombprocess */
> @@ -837,7 +770,9 @@ void
> process_zap(struct process *pr)
> {
> struct vnode *otvp;
> - struct proc *p = pr->ps_mainproc;
> + struct proc *p = TAILQ_FIRST(&pr->ps_threads);
> +
> + TAILQ_REMOVE(&pr->ps_threads, p, p_thr_link);
>
> /*
> * Finally finished with old proc entry.
> @@ -860,7 +795,7 @@ process_zap(struct process *pr)
> if (otvp)
> vrele(otvp);
>
> - KASSERT(pr->ps_threadcnt == 0);
> + KASSERT(pr->ps_threadcnt == 1);
> KASSERT(pr->ps_exitcnt == 1);
> if (pr->ps_ptstat != NULL)
> free(pr->ps_ptstat, M_SUBPROC, sizeof(*pr->ps_ptstat));
> @@ -872,5 +807,6 @@ process_zap(struct process *pr)
> pool_put(&process_pool, pr);
> nprocesses--;
>
> + WITNESS_THREAD_EXIT(p);
> proc_free(p);
> }
> Index: kern/kern_fork.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_fork.c,v
> diff -u -p -r1.270 kern_fork.c
> --- kern/kern_fork.c 14 Apr 2025 09:15:24 -0000 1.270
> +++ kern/kern_fork.c 30 Apr 2025 09:56:57 -0000
> @@ -182,7 +182,6 @@ process_initialize(struct process *pr, s
> refcnt_init(&pr->ps_refcnt);
>
> /* initialize the thread links */
> - pr->ps_mainproc = p;
> TAILQ_INIT(&pr->ps_threads);
> TAILQ_INSERT_TAIL(&pr->ps_threads, p, p_thr_link);
> pr->ps_threadcnt = 1;
> Index: kern/kern_rwlock.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_rwlock.c,v
> diff -u -p -r1.55 kern_rwlock.c
> --- kern/kern_rwlock.c 29 Jan 2025 15:10:09 -0000 1.55
> +++ kern/kern_rwlock.c 30 Apr 2025 14:27:15 -0000
> @@ -232,7 +232,10 @@ rw_do_enter_write(struct rwlock *rwl, in
> if (!ISSET(flags, RW_NOSLEEP))
> WITNESS_CHECKORDER(&rwl->rwl_lock_obj, lop_flags, NULL);
> #endif
> -
> +#ifdef DIAGNOSTIC
> + if (!ISSET(flags, RW_NOSLEEP))
> + assertwaitok();
> +#endif
> owner = rw_cas(&rwl->rwl_owner, 0, self);
> if (owner == 0) {
> /* wow, we won. so easy */
> @@ -351,7 +354,10 @@ rw_do_enter_read(struct rwlock *rwl, int
> if (!ISSET(flags, RW_NOSLEEP))
> WITNESS_CHECKORDER(&rwl->rwl_lock_obj, lop_flags, NULL);
> #endif
> -
> +#ifdef DIAGNOSTIC
> + if (!ISSET(flags, RW_NOSLEEP))
> + assertwaitok();
> +#endif
> owner = rw_cas(&rwl->rwl_owner, 0, RWLOCK_READ_INCR);
> if (owner == 0) {
> /* ermagerd, we won! */
> Index: kern/kern_sched.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_sched.c,v
> diff -u -p -r1.104 kern_sched.c
> --- kern/kern_sched.c 10 Mar 2025 09:28:56 -0000 1.104
> +++ kern/kern_sched.c 1 May 2025 13:01:12 -0000
> @@ -160,17 +160,10 @@ sched_idle(void *v)
>
> while (1) {
> while (!cpu_is_idle(curcpu())) {
> - struct proc *dead;
> -
> SCHED_LOCK();
> p->p_stat = SSLEEP;
> mi_switch();
> SCHED_UNLOCK();
> -
> - while ((dead = LIST_FIRST(&spc->spc_deadproc))) {
> - LIST_REMOVE(dead, p_hash);
> - exit2(dead);
> - }
> }
>
> splassert(IPL_NONE);
> @@ -212,10 +205,6 @@ sched_idle(void *v)
> void
> sched_exit(struct proc *p)
> {
> - struct schedstate_percpu *spc = &curcpu()->ci_schedstate;
> -
> - LIST_INSERT_HEAD(&spc->spc_deadproc, p, p_hash);
> -
> tuagg_add_runtime();
>
> KERNEL_ASSERT_LOCKED();
> Index: kern/kern_sig.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_sig.c,v
> diff -u -p -r1.364 kern_sig.c
> --- kern/kern_sig.c 10 Mar 2025 09:28:56 -0000 1.364
> +++ kern/kern_sig.c 29 Apr 2025 09:21:10 -0000
> @@ -1602,7 +1602,7 @@ process_stop(struct process *pr, int fla
> KASSERT(ISSET(p->p_flag, P_SUSPSIG | P_SUSPSINGLE) == 0);
> }
>
> - pr->ps_suspendcnt = pr->ps_threadcnt;
> + pr->ps_suspendcnt = (pr->ps_threadcnt - pr->ps_exitcnt);
> TAILQ_FOREACH(q, &pr->ps_threads, p_thr_link) {
> if (q == p)
> continue;
> Index: kern/kern_sysctl.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_sysctl.c,v
> diff -u -p -r1.465 kern_sysctl.c
> --- kern/kern_sysctl.c 27 Apr 2025 00:58:55 -0000 1.465
> +++ kern/kern_sysctl.c 30 Apr 2025 09:57:00 -0000
> @@ -2018,7 +2018,7 @@ fill_kproc(struct process *pr, struct ki
>
> isthread = p != NULL;
> if (!isthread) {
> - p = pr->ps_mainproc; /* XXX */
> + p = TAILQ_FIRST(&pr->ps_threads);
> tuagg_get_process(&tu, pr);
> } else
> tuagg_get_proc(&tu, p);
> Index: kern/subr_xxx.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/subr_xxx.c,v
> diff -u -p -r1.17 subr_xxx.c
> --- kern/subr_xxx.c 17 May 2019 03:53:08 -0000 1.17
> +++ kern/subr_xxx.c 30 Apr 2025 13:30:16 -0000
> @@ -40,6 +40,7 @@
> #include <sys/systm.h>
> #include <sys/conf.h>
> #include <sys/smr.h>
> +#include <sys/proc.h>
>
>
> /*
> @@ -163,7 +164,8 @@ assertwaitok(void)
> SMR_ASSERT_NONCRITICAL();
> #ifdef DIAGNOSTIC
> if (curcpu()->ci_mutex_level != 0)
> - panic("assertwaitok: non-zero mutex count: %d",
> - curcpu()->ci_mutex_level);
> + panic("non-zero mutex count: %d", curcpu()->ci_mutex_level);
> + if (!cold && curproc->p_stat == SDEAD)
> + panic("deads don't sleep");
> #endif
> }
> Index: uvm/uvm_glue.c
> ===================================================================
> RCS file: /cvs/src/sys/uvm/uvm_glue.c,v
> diff -u -p -r1.88 uvm_glue.c
> --- uvm/uvm_glue.c 21 Mar 2025 13:19:33 -0000 1.88
> +++ uvm/uvm_glue.c 1 May 2025 13:12:08 -0000
> @@ -287,13 +287,25 @@ uvm_uarea_free(struct proc *p)
>
> /*
> * uvm_exit: exit a virtual address space
> + *
> + * - borrow process0's address space to free the vmspace and pmap
> + * of the dead process.
> */
> void
> uvm_exit(struct process *pr)
> {
> struct vmspace *vm = pr->ps_vmspace;
> + int s;
> +
> + KERNEL_ASSERT_UNLOCKED();
> + KASSERT(curproc->p_p == pr);
> +
> + s = intr_disable();
> + pmap_deactivate(curproc);
> + curproc->p_vmspace = pr->ps_vmspace = process0.ps_vmspace;
> + pmap_activate(curproc);
> + intr_restore(s);
>
> - pr->ps_vmspace = NULL;
> uvmspace_free(vm);
> }
>
> Index: sys/proc.h
> ===================================================================
> RCS file: /cvs/src/sys/sys/proc.h,v
> diff -u -p -r1.387 proc.h
> --- sys/proc.h 2 May 2025 05:04:38 -0000 1.387
> +++ sys/proc.h 2 May 2025 10:27:29 -0000
> @@ -154,12 +154,6 @@ struct pinsyscall {
> struct process {
> struct refcnt ps_refcnt;
>
> - /*
> - * ps_mainproc is the original thread in the process.
> - * It's only still special for the handling of
> - * some signal and ptrace behaviors that need to be fixed.
> - */
> - struct proc *ps_mainproc;
> struct ucred *ps_ucred; /* Process owner's identity. */
>
> LIST_ENTRY(process) ps_list; /* List of all processes. */
> @@ -544,7 +538,6 @@ extern struct processlist zombprocess; /
> extern struct proclist allproc; /* List of all threads. */
>
> extern struct process *initprocess; /* Process slot for init. */
> -extern struct proc *reaperproc; /* Thread slot for reaper. */
> extern struct proc *syncerproc; /* filesystem syncer daemon */
>
> extern struct pool process_pool; /* memory pool for processes */
> @@ -578,7 +571,6 @@ void setrunnable(struct proc *);
> void endtsleep(void *);
> int wakeup_proc(struct proc *);
> void unsleep(struct proc *);
> -void reaper(void *);
> __dead void exit1(struct proc *, int, int, int);
> void exit2(struct proc *);
> void cpu_fork(struct proc *_curp, struct proc *_child, void *_stack,
>
>
Faster _exit(2) for a faster userland: R.I.P the reaper