From: Tim Leslie <tleslie@protonmail.com>
Subject: Re: Change to quadratic scaling of estcpu penalty
To: Greg Schaefer <gsgs7878@proton.me>
Cc: "tech@openbsd.org" <tech@openbsd.org>
Date: Sat, 28 Feb 2026 12:51:33 +0000

> Below are some runs with the 7.8-stock/your-mod showing no material difference with my 125H four P-Core config. The logic behind your change makes sense. 

The intention behind this change, and a number of others I have in tree, is to modernize functions that are broadly correct today but carry forward 4BSD assumptions that will either constrain future work or require rewriting to reduce sched_lock scope further. The integer-only arithmetic in the decay loops is one example: sub-integer granularity for sleep durations or estcpu accumulation is not currently expressible. Some logic is also folded into macros in ways that limit legibility without adding abstraction value, and a number of comments range from dated to significantly longer than the code they describe. mi_switch is a prime example where a refactor will be critical to splitting the lock between runqueue and sleep/state elements. Some of these changes will look like marginal gains in isolation; that is expected.

As such, a kernel compile -j4 on a homogeneous four P-core machine is deliberately not the interesting workload here. The change is most visible under mixed-priority load, and most acutely for high priority sustained CPU hogs, which previously hit the estcpu ceiling at 36 ticks and stopped accumulating penalty. The quadratic curve removes that effective cap on the high end while being considerably gentler at the low end, the short-burst case that setpriority was over-penalizing.

> spc->spc_curpriority is only updated by kern_sig/userret() and kernsynch/sleep_finish(). Shouldn't it also be updated by kern_sched/sched_chooseproc()? Otherwise, seems like "if (prio < spc_spc_curpriority)" is often comparing against stale priority data.

I agree that the coverage gap is real, and that the field does get stale on operations that do not take a trip through sleep_finish or userret. I would not suggest fixing it in isolation, but instead to deal with it in a broader context-switch refactor where the ownership and update semantics of per-CPU scheduler state can be addressed consistently. Such a refactor could reduce lock coverage as well as consolidate the sched_toidle mi_switch reimplementation and the deferred cleanup in sched_exit. A targeted one-liner in sched_chooseproc trades one narrow fix for a field that has several readers and a coherence story that deserves proper treatment.

Tim