From: Claudio Jeker Subject: Re: improve UVM performance on unmap To: tech@openbsd.org Date: Mon, 30 Mar 2026 13:36:52 +0200 On Fri, Mar 27, 2026 at 02:03:44PM +0100, Claudio Jeker wrote: > While working on prometheus I had a version where the daemon did a > munmap(2) and mmap(2) for every write to that memory map. This resulted in > high spin and system time. > > For curiosity I used dtrace kprofile and it pointed the problem > straight at a very inefficent use of uvm_pagelookup(), e.g.: > > uvm_objtree_RBT_COMPARE+0xb > uvm_pagelookup+0x3e > uvn_flush+0x17e > uvn_detach+0x7e > uvm_unmap_detach+0xf1 > sys_munmap+0x185 > syscall+0x5f9 > Xsyscall+0x128 > kernel > > So unv_flush() is very inefficent at unmapping the file and burns most of > its time in uvm_pagelookup(). > > There are a few UVM functions that roughly do: > > for (curoff = start ; curoff < end ; curoff += PAGE_SIZE) { > pp = uvm_pagelookup(uobj, curoff); > if (pp == NULL) > contine; > ... > } > > This is a very expensive way to write something that is almost an > RB_FOREACH(). I went and implemented uvm_pagerangefirst() and > uvm_pagerangenext() to build an iterator that is more efficent. > As usual with these iterators if the page is removed from the RB tree the > code needs to prefetch the next element similar to how RB_FOREACH_SAFE() > works. > > On top of that we have the following issue: > If there is a sleep point (which releases the object lock) prefeting is no > longer enough. Once we release the lock someone else can modify the > tree and the page we cached. So after such a sleep the loop needs to > restarted with a new start point by calling uvm_pagerangefirst(). > > This is what uvm_km_pgremove() does for the PG_BUSY case. uao_flush() is > very similar to uvm_km_pgremove() but was written in a very strange way. > > uvm_obj_unwire() is the most trivial conversion since it does not alter > the RB tree at all. > > Finally uvn_flush(), that function is a beast and the clustered page out > alters the tree as well and so any pageout requires a restart like the > PG_BUSY sleep. > > For reference here are the GENERIC.MP build times for make -j 16 and > make -j 32 on my amd64 box: > > Before: > build with 16 jobs (smt = 0) > run 1 98.94 real 773.49 user 468.25 sys > run 2 98.31 real 773.19 user 462.84 sys > run 3 98.49 real 771.10 user 466.63 sys > run 4 98.60 real 773.76 user 464.28 sys > run 5 98.52 real 770.54 user 468.98 sys > avg over 5 runs: 98.572 real 772.416 user 466.196 sys > build with 32 jobs (smt = 1) > run 1 102.44 real 1001.73 user 1471.51 sys > run 2 102.56 real 999.52 user 1474.85 sys > run 3 102.23 real 999.06 user 1467.81 sys > run 4 102.30 real 1000.40 user 1468.36 sys > run 5 102.44 real 1001.60 user 1474.12 sys > avg over 5 runs: 102.394 real 1000.46 user 1471.33 sys > > With diff: > build with 16 jobs (smt = 0) > run 1 95.79 real 769.67 user 436.77 sys > run 2 95.97 real 770.48 user 435.73 sys > run 3 95.73 real 769.60 user 436.77 sys > run 4 95.65 real 769.33 user 433.68 sys > run 5 96.29 real 770.49 user 435.78 sys > avg over 5 runs: 95.886 real 769.914 user 435.746 sys > build with 32 jobs (smt = 1) > run 1 93.70 real 1018.68 user 1250.32 sys > run 2 93.94 real 1014.58 user 1261.76 sys > run 3 93.89 real 1015.09 user 1260.52 sys > run 4 93.65 real 1016.37 user 1253.42 sys > run 5 94.04 real 1014.79 user 1262.75 sys > avg over 5 runs: 93.844 real 1015.9 user 1257.75 sys > > With the diff and 16 jobs the build time is about 3sec fater and around > 30sec (~7%) of system time is safed. > For 32 jobs on 32 CPUS the results are even better. Realtime drops by > 9sec (~10%) and system time drops by >200sec (>15%). > > uvn_flush() is a big abuser of the page queue lock and so making that > loop better reduces contention and therefor helps to reduce spin time on > one of the busiest mutexes in the system. The page queue lock usage in > uvn_flush() has a lot of bad smell. This is something to look into in a > future diff. sthen@ ran this on his i386 ports build cluster and hit various problems. It is certain that the uvm_vnode.c change is not quite right but even the other code paths seem to have some issue when the system is swapping. I will shelf this for now as I don't have time to properly dig into this. If anyone wants to debug this be my guest. -- :wq Claudio