From: Scott Cheloha <scottcheloha@gmail.com>
Subject: dt(4): profile: don't stagger clock interrupts
To: tech@openbsd.org
Cc: mpi@openbsd.org
Date: Fri, 9 Feb 2024 14:22:48 -0600

Now that the profile probe is separated from the hardclock() we can
start improving it.

The simplest thing we can do to reduce profiling overhead is to get
rid of clock interrupt staggering.  It's an artifact of the
hardclock().  The problem is intuitive: on average, reading N
profiling events during a single wakeup is cheaper than reading a
single profiling event across N separate wakeups.

Two "gotchas" to take note of:

1. The event buffer in btrace(8) is fixed-size.  On machines with
   lots of CPUs there may not be enough room to grab all the
   profiling events in one read(2).

2. There is a hotspot in dt_pcb_ring_consume() where every
   CPU on the system will try to enter ds_mtx simultaneously
   to increment ds_evtcnt.

Both can be fixed separately.  Plus, the overhead of the mutex
contention in (2) is miniscule compared to the overhead of the extra
wakeups under the current scheme.

This can wait a few days, just in case we need to back out the recent
dt(4) changes.

Index: dt_dev.c
===================================================================
RCS file: /cvs/src/sys/dev/dt/dt_dev.c,v
diff -u -p -r1.30 dt_dev.c
--- dt_dev.c	9 Feb 2024 17:42:18 -0000	1.30
+++ dt_dev.c	9 Feb 2024 20:06:00 -0000
@@ -497,8 +497,6 @@ dt_ioctl_record_start(struct dt_softc *s
 		if (dp->dp_nsecs != 0) {
 			clockintr_bind(&dp->dp_clockintr, dp->dp_cpu, dt_clock,
 			    dp);
-			clockintr_stagger(&dp->dp_clockintr, dp->dp_nsecs,
-			    CPU_INFO_UNIT(dp->dp_cpu), MAXCPUS);
 			clockintr_advance(&dp->dp_clockintr, dp->dp_nsecs);
 		}
 	}