Index | Thread | Search

From:
Claudio Jeker <cjeker@diehard.n-r-g.com>
Subject:
Re: improve UVM performance on unmap
To:
tech@openbsd.org
Date:
Mon, 30 Mar 2026 13:36:52 +0200

Download raw body.

Thread
On Fri, Mar 27, 2026 at 02:03:44PM +0100, Claudio Jeker wrote:
> While working on prometheus I had a version where the daemon did a
> munmap(2) and mmap(2) for every write to that memory map. This resulted in
> high spin and system time.
> 
> For curiosity I used dtrace kprofile and it pointed the problem
> straight at a very inefficent use of uvm_pagelookup(), e.g.:
> 
> uvm_objtree_RBT_COMPARE+0xb
> uvm_pagelookup+0x3e
> uvn_flush+0x17e
> uvn_detach+0x7e
> uvm_unmap_detach+0xf1
> sys_munmap+0x185
> syscall+0x5f9
> Xsyscall+0x128
> kernel
> 
> So unv_flush() is very inefficent at unmapping the file and burns most of
> its time in uvm_pagelookup().
> 
> There are a few UVM functions that roughly do:
> 
> 	for (curoff = start ; curoff < end ; curoff += PAGE_SIZE) {
> 		pp = uvm_pagelookup(uobj, curoff);
> 		if (pp == NULL)
> 			contine;
> 		...
> 	}
> 
> This is a very expensive way to write something that is almost an
> RB_FOREACH(). I went and implemented uvm_pagerangefirst() and
> uvm_pagerangenext() to build an iterator that is more efficent.
> As usual with these iterators if the page is removed from the RB tree the
> code needs to prefetch the next element similar to how RB_FOREACH_SAFE()
> works.
> 
> On top of that we have the following issue:
> If there is a sleep point (which releases the object lock) prefeting is no
> longer enough. Once we release the lock someone else can modify the
> tree and the page we cached. So after such a sleep the loop needs to
> restarted with a new start point by calling uvm_pagerangefirst().
> 
> This is what uvm_km_pgremove() does for the PG_BUSY case. uao_flush() is
> very similar to uvm_km_pgremove() but was written in a very strange way.
> 
> uvm_obj_unwire() is the most trivial conversion since it does not alter
> the RB tree at all.
> 
> Finally uvn_flush(), that function is a beast and the clustered page out
> alters the tree as well and so any pageout requires a restart like the
> PG_BUSY sleep.
> 
> For reference here are the GENERIC.MP build times for make -j 16 and
> make -j 32 on my amd64 box:
> 
> Before:
> build with 16 jobs (smt = 0)
> run 1 98.94 real 773.49 user 468.25 sys
> run 2 98.31 real 773.19 user 462.84 sys
> run 3 98.49 real 771.10 user 466.63 sys
> run 4 98.60 real 773.76 user 464.28 sys
> run 5 98.52 real 770.54 user 468.98 sys
> avg over 5 runs: 98.572 real 772.416 user 466.196 sys
> build with 32 jobs (smt = 1)
> run 1 102.44 real 1001.73 user 1471.51 sys
> run 2 102.56 real 999.52 user 1474.85 sys
> run 3 102.23 real 999.06 user 1467.81 sys
> run 4 102.30 real 1000.40 user 1468.36 sys
> run 5 102.44 real 1001.60 user 1474.12 sys
> avg over 5 runs: 102.394 real 1000.46 user 1471.33 sys
> 
> With diff:
> build with 16 jobs (smt = 0)
> run 1 95.79 real 769.67 user 436.77 sys
> run 2 95.97 real 770.48 user 435.73 sys
> run 3 95.73 real 769.60 user 436.77 sys
> run 4 95.65 real 769.33 user 433.68 sys
> run 5 96.29 real 770.49 user 435.78 sys
> avg over 5 runs: 95.886 real 769.914 user 435.746 sys
> build with 32 jobs (smt = 1)
> run 1 93.70 real 1018.68 user 1250.32 sys
> run 2 93.94 real 1014.58 user 1261.76 sys
> run 3 93.89 real 1015.09 user 1260.52 sys
> run 4 93.65 real 1016.37 user 1253.42 sys
> run 5 94.04 real 1014.79 user 1262.75 sys
> avg over 5 runs: 93.844 real 1015.9 user 1257.75 sys
> 
> With the diff and 16 jobs the build time is about 3sec fater and around
> 30sec (~7%) of system time is safed.
> For 32 jobs on 32 CPUS the results are even better. Realtime drops by
> 9sec (~10%) and system time drops by >200sec (>15%).
> 
> uvn_flush() is a big abuser of the page queue lock and so making that
> loop better reduces contention and therefor helps to reduce spin time on
> one of the busiest mutexes in the system. The page queue lock usage in
> uvn_flush() has a lot of bad smell.  This is something to look into in a
> future diff.

sthen@ ran this on his i386 ports build cluster and hit various problems.
It is certain that the uvm_vnode.c change is not quite right but even the
other code paths seem to have some issue when the system is swapping.

I will shelf this for now as I don't have time to properly dig into this.
If anyone wants to debug this be my guest. 

-- 
:wq Claudio