Index | Thread | Search

From:
Claudio Jeker <cjeker@diehard.n-r-g.com>
Subject:
Re: Another attempt to get rid of the reaper
To:
tech@openbsd.org
Date:
Mon, 13 Oct 2025 16:48:59 +0200

Download raw body.

Thread
On Fri, Oct 10, 2025 at 12:10:56PM +0200, Martin Pieuchot wrote:
> On 08/10/25(Wed) 17:07, Claudio Jeker wrote:
> > On Thu, Sep 18, 2025 at 05:45:46PM +0200, Martin Pieuchot wrote:
> > [...]  
> > > On 16/09/25(Tue) 11:34, Claudio Jeker wrote:
> > > > On Sun, Sep 14, 2025 at 10:36:51PM +0200, Christian Ludwig wrote:
> > > > > Hi,
> > > > > 
> > > > > this is another attempt to get rid of the dedicated reaper thread.
> > > > 
> > > > Why is this a goal? What problem are you trying to solve with this?
> > > 
> > > I don't know what are Christian goals.  Here are the ones, I believe,
> > > we will all benefit from:
> > > 
> > > - The first goal is to ensure userland processes pay the price for their
> > > cleanup.  When this is not possible the best option is to make parents
> > > pay for their children.
> > 
> > I think by moving most into exit1() this goal is achieved. 
> 
> No it's not.  You're just not interested or not curious.

Don't acuse me of being not interested, not curious or blind. You should
know better, I shared a lot of profiling data with you.  I run a lot of
profiling and I'm probably one of the few people that looks at lltrace
output to better understand what is actually going on in detail.

Also I did a fair amount of cleanup to get the accounting more accurate.
I just think there is no need to get accounting 100% right and introduce
crazy hoops for doing that. There are many small accounting misstakes in
the kernel and as long as the effect is small and visible it does not
really matter.

The reaper does a lot less work now that uvm_purge is in place.
We want to reduce the latency between _exit(2) and the time the parent is
notified. That is the important step. This has always been the important
bit, signal/wakeup ppid as soon as possible.
 
> >                                                            It only makes
> > sense to have parents pay the price if parent are actually around to
> > collect that penalty quickly. The problem is they are not.
> 
> You're wrong and/or blind.  It you'd want to see the problem you'd ask
> for examples.  But you're not interested.  You continue arguing without
> looking at what's happening.  Just go in /usr/src are run make includes.
> You'll see the reaper is doing non negligible work which should be charged
> to the exiting process or its parent.

I looked at my data and there the reaper is doing very little work apart
from spinning on the kernel lock. So no, I did look at what is going on.
I am very well aware of that and I did look into that a lot to understand
where the current wall is that we hit.

The problem is not the reaper the problem is the kernel lock used in the
reaper. Plus a few other locks on top of this. Since you don't trust me
have a look at the data at the end of this mail.
 
> > > - Another goal is to reduce latency at exit by removing unnecessary context
> > > switches and contention.
> > 
> > This is a dumb argument. The number one reason for context switches right
> > now are rwlocks especially uobjlk, kmmaplk and a few other uvm related rw
> > locks. Also by moving the signaling of the parent into exit1() none of
> > those context switches matter.
> 
> You're mixing everything, I'm so tired.  This has nothing do to with
> UVM, rwlocks the SCHED_LOCK() or the KERNEL_LOCK().

No, it has very much todo with this. You want to reduce context switches
in the exit path. You think there are 2 extra switches there that hurt us
badly.

Now lets look at the data:
A kernel build with -j 32 on a 32 CPU machine:
Before: kern.forkstat.forks=767
After:  kern.forkstat.forks=5780
We did 5000 forks and the same amount of exits. So you think there are
10'000 context switches that matter and need to be optimised away.

The problem I have with this is that during that build we did:
   4162656  voluntary context switches
     20299  involuntary context switches

So over 4mio context switches. So the 10k you try to save is around 0.2%.
So in my book this is not something to run after with extreme force.
 
> I already told you that the problem with uobjlk is fixed by the
> parallel diff that cannot be enabled because of sparc64.  The problem is
> sparc64 which has a broken pmap and nobody care.  Theo doesn't care enough
> to send a bug report to bugs@.  Mark doesn't care to answer my emails
> and discredit my work.
> 
> I had to ask you after weeks to get a bug report.  Since then what
> happen?  Nothing.  The bug is there.  The release will ship with it.
> I even sent a workaround for sparc64 and nobody replied.  So yeah, great
> team work.

Yes, it is shitty team work. You are part of that team.
I have enough going on right now that has higher prio for me than hunting
down this bug. I already invested a lot of time into this and right now I
have no extra spare time to throw at this issue. I know this sucks.
 
> The problem is not uobjlk, the problem is that nobody care and I'm tired
> of arguing with people that do not care and do not want improvements.
> 
> We can't enable parallel faults on sparc64 because its pmap is broken.
> That's not new, it has been there since almost a year.
> 
> This has nothing to do with the reaper.  Nor the SCHED_LOCK.  Please
> stop mixing stuff.

I only said that if you want to make a difference in the number of context
switches that you should look at rwlocks and esp. those that are very
busy. Since those generate a maginitude more context switches then those
in the exit path.
 
> > Not sure which contention you are after but taking away 1 or 2 context
> > switches at process exit will not move the needle.
> 
> Sure, computer are complicated let's go shopping.
> 
> > > - Another goal is to see the cost of ripping processes to help us pick the
> > > right algorithms and not only measure the parts we are interested in. 
> > 
> > I think right now the cost is more visible then adding it into the wait
> > system call. Ideally the cost of uarea free should be minimal (which is
> > not true right now).
> 
> It is obvious the cost is not visible enough.  Your argument proves it.

I know very well how much CPU time the reaper eats. It is clearly visible
but it seems it is better to hide this information into process like
init(8).
 
> > > > In my opinion this diff makes the current exit situation worse. Instead of
> > > > having a clear reaper process that does the cleanup of the proc and
> > > > process we now end up delegating this work to init(8) or the parent
> > > > process. Neither are really ideal to do this work.
> > > 
> > > Please note that the parent process is already doing the cleanup via
> > > process_zap().  This is what we want.  We want all the work not done 
> > > in exit1() to be done in the parent.
> > 
> > Do we really want that? Do we want to have zombies with full uarea etc
> > sitting around?
> 
> Then clean the uarea in exit.  This is a no brainer.

Is it a no brainer? How do rip out the stack your running on without
crashing?
 
> > > > You can not assume the parent will be sitting in wait(2) / dowait6() on
> > > > exit of a child.
> > > 
> > > If this is not a bug its because the parent called sigaction(2) with 
> > > SA_NOCLDWAIT in this case the child will not become a zombie and should
> > > be reaped by init(8). Then we can see how much time init(8) spent ripping
> > > non-zombie processes.
> > 
> > No, not really there are many cases where the parent is doing lots of work
> > and is happy to collect childs once at a later time.
> 
> Fine that's his responsibility.  That's what we want.
> 
> > I have a big issue with pushing work onto init(8). It is not the right
> > process to do that. Especially since we have something else that is better
> > suited to do that. Original unix did that because there was no better
> > option but it is a workaround for a problem that can now be solved by
> > kthreads or tasks.
> 
> There's nothing wrong with that.  init(8) becomes the parent of
> non-zombie process.  That's his purpose.
> 
> > > >                   Actually you can not assume that the parent will ever
> > > > call wait(2) so this change would allow the collection of many fat zombies
> > > > for no good reason.
> > > 
> > > This sentence makes no sense.  Zombies are already collected by their
> > > parents, non-zombies are currently collected by the reaper.  In this
> > > regard the suggested diff doesn't change much.  A single process context
> > > still continue to do the cleanup.
> > 
> > Zombies are only collected by the parents if they call wait(2).
> > So it is a big change, since now uarea is only deleted once that is done
> > and not like before close by the exit.
> 
> Fine, let's clean the uarea in exit1().
> 
> > > If you're worried about parents that do not collect their zombies, this
> > > diff doesn't change anything in that regard and the behavior stays the
> > > same.
> > 
> > It does. See above.
> >  
> > > > Wait(2) only needs a few bit of information (r- and tusage, signal and exit
> > > > information) to work, so things like the uarea really should be removed
> > > > early on and not linger around until the zombie is collected.
> > > 
> > > I agree with that.  I'm all for putting as much as possible in exit1().
> > > 
> > > > The reaper right now is no longer a bottle neck, it uses very CPU little
> > > > time.  I agree that moving more from the reaper into exit1() is a good
> > > > thing, like the wakeup signaling to the parent process. But I think this
> > > > goes a few steps to far and introduces complex problems for very little
> > > > benefit.
> > > 
> > > I disagree with you on both points.  The reaper is still a bottle neck
> > > for many use cases.  I'd happy to explain further if you're interested.
> > 
> > Very much. Right now I only see the reaper because it hammers locks that
> > any other process would do so as well. That needs to be fixed anyway.
> 
> And the way to fix it is to get rid of it.
> 
> > > IMHO Christian's work is awesome.  It is going in the right direction.
> > > Yes there are some points that can be improved but this is much better
> > > than what we currently have.
> > > 
> > > > For me the reaper thread by itself is not an issue, it helps to finish up
> > > > the tricky bits of cleanup on exit quickly.
> > > 
> > > The reaper is an issue because:
> > > 
> > > - it is a bottleneck in the signaling of dead processes which adds latency
> > 
> > Agreed, I already mentioned multiple times that this is the important but
> > and it should be moved into exit1() to reduce latency to a minimum.
> > 
> > > - it requires extra context switches which add extra contention and latency
> > 
> > I very much doubt this. If the signaling is done in exit1() most will
> > work.
> 
> How can you doubt a fact?  Just go read the code.

You think it is a fact, but it is not. The numbers show it clearly that
you are running after a ghost.
 
> > > - it is one of the reasons OpenBSD %sys time is so high
> > 
> > Ha ha ha ha. Moving it to dowait6() will keep the %sys at the same level.
> > This code is still run in the kernel.
> 
> No it won't.
> 
> > > - it makes us blind to the cost of reaping processes (hint rb-tree)
> > 
> > We already reap most of the heave bits (uvm_purge) in exit1().
> > Is disposing of an empty vmspace still such a heavy operation?
> 
> Of course it is.  That's why I sent you my diff to get rid of the
> rb-tree.

From my understanding all the heavy rb-tree work should be done in exit1()
and not delayed for later. Now I agree there is things happening in the
reaper that are not quite right yet.

There is a lot in uvm and pmap that need to be looked at and especially in
the boundary between the two.
 
> > > - all of that prevents tools like dpb(1) which gather data about
> > >   processes execution to do a better job
> >  
> > dpb(1) is primarily file system bound. It can not scale because it hammers
> > the disks and VFS so hard that everything spins.
> 
> Great nihilistic argument to not do anything.

No, my argument is that if you want to improve dpb build time then you
would look into VFS. Since this is the wall dpb runs into. It is not the
reaper. It is not exit1, it is not the parent signaling. It is the
filesystem access that causes the high spin.

So if you want to improve dpb time you would look into different things
first.
 
> > > If you aren't convinced after reading all of this I'm available to
> > > answer your questions.  For me there's no doubt the reaper should go
> > > away.  Now I'm well aware this is not the time to push forward.
> > 
> > I'm only partially convinced.
> 
> Too bad.
> 
> >                               As I said, lets move the signaling of the
> > parent to exit1() as a first step. Don't over do it.
> 
> Once the uarea is freed and the signaling is done there's nothing left
> in the reaper but the overhead of context switching.

In my book the reaper would just reduce the process to its minimal shell
which is the disposal of the uarea and vmspace. It does that in a rather
simple and clean way.
If those things are moved up then sure we don't need a reaper. I see no
real way to clean out the vmspace in exit1() since we need that space to
run at that point. At least I did not see an simple way that is not a
total hack.

> >                                                      Don't abuse init(8).
> 
> There is no abuse.  init is already the parent of those processes and is
> designed to do just that.
 
I disagree with you on that. I think init should not be overloaded.
It is not a dumping ground to clean up firefox's threads on pthread_exit.

-- 
:wq Claudio

Reaper samplings during a make -j 32 of GENERIC.MP
dur is duration in nanoseconds.

Stopped	496.6ms
Running	4.384ms
dur		category	arg_set_id__symbol__	arg_set_id__caller__
12417639	rwlock		kernel_map_store	vm_map_lock_ln
1751151		kernel		kernel_lock		reaper
204509		mutex		sched_lock		sleep_setup
94644		mutex		sched_lock		wakeup_n
83850		mutex		sched_lock		sleep_finish
29967		mutex		sched_lock		ptsignal_locked
24174		mutex		uvm			uvm_pmr_freepageq

Stopped	450.6ms
Running	3.236ms
dur		category	arg_set_id__symbol__	arg_set_id__caller__
12944890	rwlock		kernel_map_store	vm_map_lock_ln
169790		mutex		sched_lock		sleep_setup
155688		kernel		kernel_lock		reaper
68502		mutex		sched_lock		ptsignal_locked
59293		mutex		sched_lock		wakeup_n
54204		mutex		sched_lock		sleep_finish
31847		kernel		kernel_lock		intr_handler
28620		mutex		uvm			uvm_pmr_freepageq
1603		mutex		uvm_map_entry_kmem_pool	pool_get

Stopped	388.9ms
Running	3.868ms
dur		category	arg_set_id__symbol__	arg_set_id__caller__
2911007		kernel		kernel_lock		reaper
10603		mutex		uvm			uvm_pmr_freepageq
9404		mutex		sched_lock		sleep_setup
684		rwlock		kernel_map_store	vm_map_lock_ln

Stopped	669.4ms
Running	7.508ms
dur		category	arg_set_id__symbol__	arg_set_id__caller__
6552644		kernel		kernel_lock		reaper
21203		mutex		uvm			uvm_pmr_freepageq
3869		rwlock		kernel_map_store	vm_map_lock_ln
1475		mutex		sched_lock		wakeup_n

Stopped	310.2ms
Running	2.780ms

dur		category	arg_set_id__symbol__	arg_set_id__caller__
1725415		kernel		kernel_lock		reaper
132476		kernel		kernel_lock		intr_handler
41807		mutex		uvm			uvm_pmr_freepageq
25029		rwlock		kernel_map_store	vm_map_lock_ln
8827		mutex		sched_lock		sleep_finish
2458		mutex		sched_lock		sleep_setup