From: Mark Kettenis Subject: Re: cpu_xcall for amd64, was: rewrite wbinvd_on_all_cpus_acked() using cpu_xcall(9) To: David Gwynne Cc: tech@openbsd.org Date: Sat, 08 Nov 2025 18:18:50 +0100 > Date: Fri, 7 Nov 2025 20:20:53 +1000 > From: David Gwynne > > On Thu, Nov 06, 2025 at 11:02:00AM -0800, Mike Larkin wrote: > > On Thu, Nov 06, 2025 at 07:12:36PM +0100, hshoexer wrote: > > > Hi, > > > > > > On Thu, Nov 06, 2025 at 12:18:27PM -0500, Dave Voutila wrote: > > > > hshoexer writes: > > > > > > > > > Hi, > > > > > > > > > > I'd like to simplify the implementation of amd64's > > > > > wbinvd_on_all_cpus_acked() by using cpu_xcall(9). > > > > > > > > > > The diff below rewrites wbinvd_on_all_cpus_acked() accordingly. > > > > > > > > > > The required xcall glue code for amd64 is from this [1] mail by > > > > > dlg@. If the wbinvd_on_all_cpus_acked() is ok, I'm fine with > > > > > splitting this diff. > > > > > > > > > > Thoughts? > > > > > > > > > > > > > I believe dlg@ has some changes he may not have shared that are not in > > > > tree that shuffle a bit of how things split across MI/MD. > > > > > > > > Given we're post-release, I think it's a good time to start that > > > > conversation again. > > > > > > excellent! The wbinvd rewrite is quite simple. I'd be happy to use > > > it to test drive cpu_xcall. > > > > > > > yeah, the xcall stuff has to get finished first then we can look at > > this diff (wbinvd) > > ok. > > the problem i found was that it's not safe to call the MI softintr > code from IPL_IPI, which the in tree version of cpu_xcall relies > on. to cope i've redrawn the line between the MD and MI code needed to > implement xcalls. the arch is now responsible for calling the xcall > dispatch code from IPL_SOFTCLOCK, rather than calling softintr > dispatch from IPL_IPI. A few remarks about the new MD code below. > this means there's more MD glue, which means it'll be harder to get an > MI xcall API in to the tree in one go. however, so far i've only found a > use for the xcall API on amd64 and arm64, so i don't think we need to > wait. Yes, I don't immediately see a truly MI use case for these. > the diff below is my latest version of the amd64 xcall glue. > > while i'm here, i'll include a bit of discussion i had with dv@ about > using the xcall api: > > cpu_xcall_sync exists because sending an xcall and sleeping > for it to finish is most (but not all) of what i do with xcalls, so it > made sense to factor it out. the fact that it exists makes it > tempting to use as a hammer, but it's not great for every nail. > > the rest of the cpu_xcall uses i had were either waiting for multiple > cpus, or firing work and forgetting about it. that stuff was varied > enough that i didnt see obvious common code to factor out. > > xcalls are supposed to feel the same as other openbsd APIs, and be > composable with other apis. xcalls are kind of like tasks/timeouts. > > it might be a bit more obvious with some examples: > > if you want to get a remote cpu to do something and sleep for the > result: > > void > something(void *null) > { > ... > } > > void > dispatch(struct cpu_info *ci) > { > cpu_xcall_sync(ci, something, NULL) > } > > if you want to get a remote cpu to do something and spin for the result: > > void > something(void *var) > { > volatile u_int *c = var; > ... > *c = 1; > } > > void > dispatch(struct cpu_info *ci) > { > u_int c = 0; > struct xcall xc = XCALL_INITIALIZER(something, &c); > > cpu_xcall(ci, &xc); > > while (c == 0) { > CPU_BUSY_CYCLE(); > /* MP_LOCKDEBUG spinout stuff */ > } > } > > if you want to broadcast work and sleep waiting for it: > > void > something(void *arg) > { > struct refcnt *r = arg; > struct cpu_info *ci = curcpu(); > > start_vmm_on_cpu(ci); > > refcnt_rele_wake(r); > } > > void > dispatch(void) > { > struct cpu_info *ci; > CPU_INFO_ITERATOR cii; > struct refcnt r = REFCNT_INITIALIZER(); > struct xcall xc = XCALL_INITIALIZER(something, &r); > > CPU_INFO_FOREACH(cii, ci) { > refcnt_take(&r); > cpu_xcall(ci, &xc) > } > > refcnt_finalize(&r, "vmmstart"); > } > > if you want to broadcast work and spin waiting for it: > > void > something(void *arg) > { > unsigned int *r = arg; > struct cpu_info *ci = curcpu(); > > start_vmm_on_cpu(ci); > > atomic_dec_int(r); > } > > void > dispatch(void) > { > struct cpu_info *ci; > CPU_INFO_ITERATOR cii; > unsigned int r = 0; > struct xcall xc = XCALL_INITIALIZER(something, &r); > > CPU_INFO_FOREACH(cii, ci) { > atomic_inc_int(&r); > cpu_xcall(ci, &xc) > } > > refcnt_finalize(&r, "vmmstart"); > > while (atomic_load_int(&r) > 0) { > CPU_BUSY_CYCLE(); > /* MP_LOCKDEBUG spinout stuff */ > } > } > > if you want to broadcast work and spin on something each cpu/state has > to change: > > void > something(void *arg) > { > struct cpu_info *ci = curcpu(); > > start_vmm_on_cpu(ci); > > atomic_setbits_int(ci->ci_thing, CPU_FLAG); > } > > void > dispatch(void) > { > struct cpu_info *ci; > CPU_INFO_ITERATOR cii; > struct xcall xc = XCALL_INITIALIZER(something, NULL); > > CPU_INFO_FOREACH(cii, ci) > cpu_xcall(ci, &xc) > > CPU_INFO_FOREACH(cii, ci) { > while (!ISSET(atomic_load_int(&ci->ci_thing), CPU_FLAG)) { > CPU_BUSY_CYCLE(); > /* MP_LOCKDEBUG spinout stuff */ > } > } > } > > the last one shows that you can avoid serialising and waiting for each > cpu by firing the work at all cpus before waiting for any of them to > finish it. all the cpus have to finish it eventually, so you can iterate > over them in any order to check. > > these also demonstrate that you can cpu_xcall against yourself > (curcpu()) just fine. > > Index: arch/amd64/amd64/cpu.c > =================================================================== > RCS file: /cvs/src/sys/arch/amd64/amd64/cpu.c,v > diff -u -p -r1.196 cpu.c > --- arch/amd64/amd64/cpu.c 5 Jun 2025 09:29:54 -0000 1.196 > +++ arch/amd64/amd64/cpu.c 16 Sep 2025 12:07:26 -0000 > @@ -637,6 +633,8 @@ cpu_attach(struct device *parent, struct > #endif > > #if defined(MULTIPROCESSOR) > + cpu_xcall_establish(ci); > + > /* > * Allocate UPAGES contiguous pages for the idle PCB and stack. > */ > Index: arch/amd64/amd64/intr.c > =================================================================== > RCS file: /cvs/src/sys/arch/amd64/amd64/intr.c,v > diff -u -p -r1.63 intr.c > --- arch/amd64/amd64/intr.c 11 Jun 2025 09:57:01 -0000 1.63 > +++ arch/amd64/amd64/intr.c 16 Sep 2025 12:07:26 -0000 > @@ -577,7 +589,9 @@ struct intrhand fake_softclock_intrhand; > struct intrhand fake_softnet_intrhand; > struct intrhand fake_softtty_intrhand; > struct intrhand fake_timer_intrhand; > +#ifdef MULTIPROCESSOR > struct intrhand fake_ipi_intrhand; > +#endif > #if NXEN > 0 > struct intrhand fake_xen_intrhand; > #endif > @@ -644,6 +658,16 @@ cpu_intr_init(struct cpu_info *ci) > isp->is_handlers = &fake_ipi_intrhand; > isp->is_pic = &local_pic; > ci->ci_isources[LIR_IPI] = isp; > + > + isp = malloc(sizeof(*isp), M_DEVBUF, M_NOWAIT|M_ZERO); > + if (isp == NULL) > + panic("can't allocate fixed interrupt source"); > + isp->is_recurse = Xxcallintr; > + isp->is_resume = Xxcallintr; > + fake_ipi_intrhand.ih_level = IPL_SOFTCLOCK; > + isp->is_handlers = &fake_ipi_intrhand; > + isp->is_pic = &local_pic; > + ci->ci_isources[SIR_XCALL] = isp; > #endif > #if NXEN > 0 > isp = malloc(sizeof (struct intrsource), M_DEVBUF, M_NOWAIT|M_ZERO); > Index: arch/amd64/amd64/ipifuncs.c > =================================================================== > RCS file: /cvs/src/sys/arch/amd64/amd64/ipifuncs.c,v > diff -u -p -r1.39 ipifuncs.c > --- arch/amd64/amd64/ipifuncs.c 7 Jun 2024 16:53:35 -0000 1.39 > +++ arch/amd64/amd64/ipifuncs.c 16 Sep 2025 12:07:26 -0000 > @@ -61,6 +61,7 @@ > void x86_64_ipi_nop(struct cpu_info *); > void x86_64_ipi_halt(struct cpu_info *); > void x86_64_ipi_wbinvd(struct cpu_info *); > +void x86_64_ipi_xcall(struct cpu_info *); > > #if NVMM > 0 > void x86_64_ipi_vmclear_vmm(struct cpu_info *); > @@ -108,6 +109,7 @@ void (*ipifunc[X86_NIPI])(struct cpu_inf > NULL, > #endif > x86_64_ipi_wbinvd, > + x86_64_ipi_xcall, > }; > > void > @@ -169,3 +171,9 @@ x86_64_ipi_wbinvd(struct cpu_info *ci) > { > wbinvd(); > } > + > +void > +x86_64_ipi_xcall(struct cpu_info *ci) > +{ > + x86_atomic_setbits_u64(&ci->ci_ipending, 1UL << SIR_XCALL); > +}; > Index: arch/amd64/amd64/vector.S > =================================================================== > RCS file: /cvs/src/sys/arch/amd64/amd64/vector.S,v > diff -u -p -r1.103 vector.S > --- arch/amd64/amd64/vector.S 11 Jul 2025 20:04:20 -0000 1.103 > +++ arch/amd64/amd64/vector.S 16 Sep 2025 12:07:26 -0000 > @@ -1408,3 +1408,19 @@ KIDTVEC(softclock) > jmp retpoline_r13 > CODEPATCH_END(CPTAG_RETPOLINE_R13) > END(Xsoftclock) > + > +#ifdef MULTIPROCESSOR > +KIDTVEC(xcallintr) > + endbr64 > + movl $IPL_SOFTCLOCK, CPUVAR(ILEVEL) > + sti > + incl CPUVAR(IDEPTH) > + movq CPUVAR(SELF),%rdi > + call cpu_xcall_dispatch > + movq $0,-8(%rsp) > + decl CPUVAR(IDEPTH) > + CODEPATCH_START > + jmp retpoline_r13 > + CODEPATCH_END(CPTAG_RETPOLINE_R13) > +END(Xxcallintr) > +#endif /* MULTIPROCESSOR */ > Index: arch/amd64/include/cpu.h > =================================================================== > RCS file: /cvs/src/sys/arch/amd64/include/cpu.h,v > diff -u -p -r1.180 cpu.h > --- arch/amd64/include/cpu.h 28 Apr 2025 16:18:25 -0000 1.180 > +++ arch/amd64/include/cpu.h 16 Sep 2025 12:07:26 -0000 > @@ -53,6 +53,7 @@ > #include > #include > #include > +#include > #include > > #ifdef _KERNEL > @@ -215,6 +216,7 @@ struct cpu_info { > > #ifdef MULTIPROCESSOR > struct srp_hazard ci_srp_hazards[SRP_HAZARD_NUM]; > + struct xcall_cpu ci_xcall; > #define __HAVE_UVM_PERCPU > struct uvm_pmr_cache ci_uvm; /* [o] page cache */ > #endif > Index: arch/amd64/include/intr.h > =================================================================== > RCS file: /cvs/src/sys/arch/amd64/include/intr.h,v > diff -u -p -r1.36 intr.h > --- arch/amd64/include/intr.h 11 Jun 2025 09:57:01 -0000 1.36 > +++ arch/amd64/include/intr.h 16 Sep 2025 12:07:26 -0000 > @@ -222,7 +222,10 @@ void x86_ipi_handler(void); > void x86_setperf_ipi(struct cpu_info *); > > extern void (*ipifunc[X86_NIPI])(struct cpu_info *); > -#endif > + > +#define cpu_xcall_ipi(_ci) x86_send_ipi((_ci), X86_IPI_XCALL) > +void Xxcallintr(void); > +#endif /* MULTIPROCESSOR */ > > #endif /* !_LOCORE */ > > Index: arch/amd64/include/intrdefs.h > =================================================================== > RCS file: /cvs/src/sys/arch/amd64/include/intrdefs.h,v > diff -u -p -r1.24 intrdefs.h > --- arch/amd64/include/intrdefs.h 26 May 2024 13:37:31 -0000 1.24 > +++ arch/amd64/include/intrdefs.h 16 Sep 2025 12:07:26 -0000 > @@ -55,9 +55,10 @@ > #define SIR_CLOCK 61 > #define SIR_NET 60 > #define SIR_TTY 59 > +#define SIR_XCALL 58 I think these should be in order of decreasing priotity. So SIR_XCALL should be immediately before or immediately after SIR_CLOCK. This consumes an extra interrupt "slot" on each CPU. And we're already short on those... > -#define LIR_XEN 58 > -#define LIR_HYPERV 57 > +#define LIR_XEN 57 > +#define LIR_HYPERV 56 But maybe we can reclaim these two slots on systems that don't need them? That's a different diff though. > > /* > * Maximum # of interrupt sources per CPU. 64 to fit in one word. > @@ -85,8 +86,9 @@ > #define X86_IPI_START_VMM 0x00000100 > #define X86_IPI_STOP_VMM 0x00000200 > #define X86_IPI_WBINVD 0x00000400 > +#define X86_IPI_XCALL 0x00000800 > > -#define X86_NIPI 12 > +#define X86_NIPI 13 > > #define IREENT_MAGIC 0x18041969 > > Index: kern/kern_xcall.c > =================================================================== > RCS file: /cvs/src/sys/kern/kern_xcall.c,v > diff -u -p -r1.1 kern_xcall.c > --- kern/kern_xcall.c 13 Jul 2025 05:45:21 -0000 1.1 > +++ kern/kern_xcall.c 16 Sep 2025 12:07:26 -0000 > @@ -106,10 +106,10 @@ cpu_xcall_sync(struct cpu_info *ci, void > /* > * This is the glue between the MD IPI code and the calling of xcalls > */ > -static void > -cpu_xcall_handler(void *arg) > +void > +cpu_xcall_dispatch(struct cpu_info *ci) > { > - struct xcall_cpu *xci = arg; > + struct xcall_cpu *xci = &ci->ci_xcall; > struct xcall *xc; > size_t i; > > @@ -130,17 +130,6 @@ cpu_xcall_establish(struct cpu_info *ci) > > for (i = 0; i < nitems(xci->xci_xcalls); i++) > xci->xci_xcalls[i] = NULL; > - > - xci->xci_softintr = softintr_establish(IPL_XCALL | IPL_MPSAFE, > - cpu_xcall_handler, xci); > -} > - > -void > -cpu_xcall_dispatch(struct cpu_info *ci) > -{ > - struct xcall_cpu *xci = &ci->ci_xcall; > - > - softintr_schedule(xci->xci_softintr); > } > > #else /* MULTIPROCESSOR */ > Index: sys/xcall.h > =================================================================== > RCS file: /cvs/src/sys/sys/xcall.h,v > diff -u -p -r1.1 xcall.h > --- sys/xcall.h 13 Jul 2025 05:45:21 -0000 1.1 > +++ sys/xcall.h 16 Sep 2025 12:07:26 -0000 > @@ -45,7 +45,7 @@ > * > * 4. cpu_xcall_ipi has to be provided by machine/intr.h. > * > - * 5. The MD xcall IPI handler has to call cpu_xcall_dispatch. > + * 5. call cpu_xcall_dispatch at IPL_SOFTCLOCK on the target CPU. > */ > > #ifndef _SYS_XCALL_H > >