From: David Gwynne Subject: cpu_xcall for amd64, was: rewrite wbinvd_on_all_cpus_acked() using cpu_xcall(9) To: tech@openbsd.org Date: Fri, 7 Nov 2025 20:20:53 +1000 On Thu, Nov 06, 2025 at 11:02:00AM -0800, Mike Larkin wrote: > On Thu, Nov 06, 2025 at 07:12:36PM +0100, hshoexer wrote: > > Hi, > > > > On Thu, Nov 06, 2025 at 12:18:27PM -0500, Dave Voutila wrote: > > > hshoexer writes: > > > > > > > Hi, > > > > > > > > I'd like to simplify the implementation of amd64's > > > > wbinvd_on_all_cpus_acked() by using cpu_xcall(9). > > > > > > > > The diff below rewrites wbinvd_on_all_cpus_acked() accordingly. > > > > > > > > The required xcall glue code for amd64 is from this [1] mail by > > > > dlg@. If the wbinvd_on_all_cpus_acked() is ok, I'm fine with > > > > splitting this diff. > > > > > > > > Thoughts? > > > > > > > > > > I believe dlg@ has some changes he may not have shared that are not in > > > tree that shuffle a bit of how things split across MI/MD. > > > > > > Given we're post-release, I think it's a good time to start that > > > conversation again. > > > > excellent! The wbinvd rewrite is quite simple. I'd be happy to use > > it to test drive cpu_xcall. > > > > yeah, the xcall stuff has to get finished first then we can look at > this diff (wbinvd) ok. the problem i found was that it's not safe to call the MI softintr code from IPL_IPI, which the in tree version of cpu_xcall relies on. to cope i've redrawn the line between the MD and MI code needed to implement xcalls. the arch is now responsible for calling the xcall dispatch code from IPL_SOFTCLOCK, rather than calling softintr dispatch from IPL_IPI. this means there's more MD glue, which means it'll be harder to get an MI xcall API in to the tree in one go. however, so far i've only found a use for the xcall API on amd64 and arm64, so i don't think we need to wait. the diff below is my latest version of the amd64 xcall glue. while i'm here, i'll include a bit of discussion i had with dv@ about using the xcall api: cpu_xcall_sync exists because sending an xcall and sleeping for it to finish is most (but not all) of what i do with xcalls, so it made sense to factor it out. the fact that it exists makes it tempting to use as a hammer, but it's not great for every nail. the rest of the cpu_xcall uses i had were either waiting for multiple cpus, or firing work and forgetting about it. that stuff was varied enough that i didnt see obvious common code to factor out. xcalls are supposed to feel the same as other openbsd APIs, and be composable with other apis. xcalls are kind of like tasks/timeouts. it might be a bit more obvious with some examples: if you want to get a remote cpu to do something and sleep for the result: void something(void *null) { ... } void dispatch(struct cpu_info *ci) { cpu_xcall_sync(ci, something, NULL) } if you want to get a remote cpu to do something and spin for the result: void something(void *var) { volatile u_int *c = var; ... *c = 1; } void dispatch(struct cpu_info *ci) { u_int c = 0; struct xcall xc = XCALL_INITIALIZER(something, &c); cpu_xcall(ci, &xc); while (c == 0) { CPU_BUSY_CYCLE(); /* MP_LOCKDEBUG spinout stuff */ } } if you want to broadcast work and sleep waiting for it: void something(void *arg) { struct refcnt *r = arg; struct cpu_info *ci = curcpu(); start_vmm_on_cpu(ci); refcnt_rele_wake(r); } void dispatch(void) { struct cpu_info *ci; CPU_INFO_ITERATOR cii; struct refcnt r = REFCNT_INITIALIZER(); struct xcall xc = XCALL_INITIALIZER(something, &r); CPU_INFO_FOREACH(cii, ci) { refcnt_take(&r); cpu_xcall(ci, &xc) } refcnt_finalize(&r, "vmmstart"); } if you want to broadcast work and spin waiting for it: void something(void *arg) { unsigned int *r = arg; struct cpu_info *ci = curcpu(); start_vmm_on_cpu(ci); atomic_dec_int(r); } void dispatch(void) { struct cpu_info *ci; CPU_INFO_ITERATOR cii; unsigned int r = 0; struct xcall xc = XCALL_INITIALIZER(something, &r); CPU_INFO_FOREACH(cii, ci) { atomic_inc_int(&r); cpu_xcall(ci, &xc) } refcnt_finalize(&r, "vmmstart"); while (atomic_load_int(&r) > 0) { CPU_BUSY_CYCLE(); /* MP_LOCKDEBUG spinout stuff */ } } if you want to broadcast work and spin on something each cpu/state has to change: void something(void *arg) { struct cpu_info *ci = curcpu(); start_vmm_on_cpu(ci); atomic_setbits_int(ci->ci_thing, CPU_FLAG); } void dispatch(void) { struct cpu_info *ci; CPU_INFO_ITERATOR cii; struct xcall xc = XCALL_INITIALIZER(something, NULL); CPU_INFO_FOREACH(cii, ci) cpu_xcall(ci, &xc) CPU_INFO_FOREACH(cii, ci) { while (!ISSET(atomic_load_int(&ci->ci_thing), CPU_FLAG)) { CPU_BUSY_CYCLE(); /* MP_LOCKDEBUG spinout stuff */ } } } the last one shows that you can avoid serialising and waiting for each cpu by firing the work at all cpus before waiting for any of them to finish it. all the cpus have to finish it eventually, so you can iterate over them in any order to check. these also demonstrate that you can cpu_xcall against yourself (curcpu()) just fine. Index: arch/amd64/amd64/cpu.c =================================================================== RCS file: /cvs/src/sys/arch/amd64/amd64/cpu.c,v diff -u -p -r1.196 cpu.c --- arch/amd64/amd64/cpu.c 5 Jun 2025 09:29:54 -0000 1.196 +++ arch/amd64/amd64/cpu.c 16 Sep 2025 12:07:26 -0000 @@ -637,6 +633,8 @@ cpu_attach(struct device *parent, struct #endif #if defined(MULTIPROCESSOR) + cpu_xcall_establish(ci); + /* * Allocate UPAGES contiguous pages for the idle PCB and stack. */ Index: arch/amd64/amd64/intr.c =================================================================== RCS file: /cvs/src/sys/arch/amd64/amd64/intr.c,v diff -u -p -r1.63 intr.c --- arch/amd64/amd64/intr.c 11 Jun 2025 09:57:01 -0000 1.63 +++ arch/amd64/amd64/intr.c 16 Sep 2025 12:07:26 -0000 @@ -577,7 +589,9 @@ struct intrhand fake_softclock_intrhand; struct intrhand fake_softnet_intrhand; struct intrhand fake_softtty_intrhand; struct intrhand fake_timer_intrhand; +#ifdef MULTIPROCESSOR struct intrhand fake_ipi_intrhand; +#endif #if NXEN > 0 struct intrhand fake_xen_intrhand; #endif @@ -644,6 +658,16 @@ cpu_intr_init(struct cpu_info *ci) isp->is_handlers = &fake_ipi_intrhand; isp->is_pic = &local_pic; ci->ci_isources[LIR_IPI] = isp; + + isp = malloc(sizeof(*isp), M_DEVBUF, M_NOWAIT|M_ZERO); + if (isp == NULL) + panic("can't allocate fixed interrupt source"); + isp->is_recurse = Xxcallintr; + isp->is_resume = Xxcallintr; + fake_ipi_intrhand.ih_level = IPL_SOFTCLOCK; + isp->is_handlers = &fake_ipi_intrhand; + isp->is_pic = &local_pic; + ci->ci_isources[SIR_XCALL] = isp; #endif #if NXEN > 0 isp = malloc(sizeof (struct intrsource), M_DEVBUF, M_NOWAIT|M_ZERO); Index: arch/amd64/amd64/ipifuncs.c =================================================================== RCS file: /cvs/src/sys/arch/amd64/amd64/ipifuncs.c,v diff -u -p -r1.39 ipifuncs.c --- arch/amd64/amd64/ipifuncs.c 7 Jun 2024 16:53:35 -0000 1.39 +++ arch/amd64/amd64/ipifuncs.c 16 Sep 2025 12:07:26 -0000 @@ -61,6 +61,7 @@ void x86_64_ipi_nop(struct cpu_info *); void x86_64_ipi_halt(struct cpu_info *); void x86_64_ipi_wbinvd(struct cpu_info *); +void x86_64_ipi_xcall(struct cpu_info *); #if NVMM > 0 void x86_64_ipi_vmclear_vmm(struct cpu_info *); @@ -108,6 +109,7 @@ void (*ipifunc[X86_NIPI])(struct cpu_inf NULL, #endif x86_64_ipi_wbinvd, + x86_64_ipi_xcall, }; void @@ -169,3 +171,9 @@ x86_64_ipi_wbinvd(struct cpu_info *ci) { wbinvd(); } + +void +x86_64_ipi_xcall(struct cpu_info *ci) +{ + x86_atomic_setbits_u64(&ci->ci_ipending, 1UL << SIR_XCALL); +}; Index: arch/amd64/amd64/vector.S =================================================================== RCS file: /cvs/src/sys/arch/amd64/amd64/vector.S,v diff -u -p -r1.103 vector.S --- arch/amd64/amd64/vector.S 11 Jul 2025 20:04:20 -0000 1.103 +++ arch/amd64/amd64/vector.S 16 Sep 2025 12:07:26 -0000 @@ -1408,3 +1408,19 @@ KIDTVEC(softclock) jmp retpoline_r13 CODEPATCH_END(CPTAG_RETPOLINE_R13) END(Xsoftclock) + +#ifdef MULTIPROCESSOR +KIDTVEC(xcallintr) + endbr64 + movl $IPL_SOFTCLOCK, CPUVAR(ILEVEL) + sti + incl CPUVAR(IDEPTH) + movq CPUVAR(SELF),%rdi + call cpu_xcall_dispatch + movq $0,-8(%rsp) + decl CPUVAR(IDEPTH) + CODEPATCH_START + jmp retpoline_r13 + CODEPATCH_END(CPTAG_RETPOLINE_R13) +END(Xxcallintr) +#endif /* MULTIPROCESSOR */ Index: arch/amd64/include/cpu.h =================================================================== RCS file: /cvs/src/sys/arch/amd64/include/cpu.h,v diff -u -p -r1.180 cpu.h --- arch/amd64/include/cpu.h 28 Apr 2025 16:18:25 -0000 1.180 +++ arch/amd64/include/cpu.h 16 Sep 2025 12:07:26 -0000 @@ -53,6 +53,7 @@ #include #include #include +#include #include #ifdef _KERNEL @@ -215,6 +216,7 @@ struct cpu_info { #ifdef MULTIPROCESSOR struct srp_hazard ci_srp_hazards[SRP_HAZARD_NUM]; + struct xcall_cpu ci_xcall; #define __HAVE_UVM_PERCPU struct uvm_pmr_cache ci_uvm; /* [o] page cache */ #endif Index: arch/amd64/include/intr.h =================================================================== RCS file: /cvs/src/sys/arch/amd64/include/intr.h,v diff -u -p -r1.36 intr.h --- arch/amd64/include/intr.h 11 Jun 2025 09:57:01 -0000 1.36 +++ arch/amd64/include/intr.h 16 Sep 2025 12:07:26 -0000 @@ -222,7 +222,10 @@ void x86_ipi_handler(void); void x86_setperf_ipi(struct cpu_info *); extern void (*ipifunc[X86_NIPI])(struct cpu_info *); -#endif + +#define cpu_xcall_ipi(_ci) x86_send_ipi((_ci), X86_IPI_XCALL) +void Xxcallintr(void); +#endif /* MULTIPROCESSOR */ #endif /* !_LOCORE */ Index: arch/amd64/include/intrdefs.h =================================================================== RCS file: /cvs/src/sys/arch/amd64/include/intrdefs.h,v diff -u -p -r1.24 intrdefs.h --- arch/amd64/include/intrdefs.h 26 May 2024 13:37:31 -0000 1.24 +++ arch/amd64/include/intrdefs.h 16 Sep 2025 12:07:26 -0000 @@ -55,9 +55,10 @@ #define SIR_CLOCK 61 #define SIR_NET 60 #define SIR_TTY 59 +#define SIR_XCALL 58 -#define LIR_XEN 58 -#define LIR_HYPERV 57 +#define LIR_XEN 57 +#define LIR_HYPERV 56 /* * Maximum # of interrupt sources per CPU. 64 to fit in one word. @@ -85,8 +86,9 @@ #define X86_IPI_START_VMM 0x00000100 #define X86_IPI_STOP_VMM 0x00000200 #define X86_IPI_WBINVD 0x00000400 +#define X86_IPI_XCALL 0x00000800 -#define X86_NIPI 12 +#define X86_NIPI 13 #define IREENT_MAGIC 0x18041969 Index: kern/kern_xcall.c =================================================================== RCS file: /cvs/src/sys/kern/kern_xcall.c,v diff -u -p -r1.1 kern_xcall.c --- kern/kern_xcall.c 13 Jul 2025 05:45:21 -0000 1.1 +++ kern/kern_xcall.c 16 Sep 2025 12:07:26 -0000 @@ -106,10 +106,10 @@ cpu_xcall_sync(struct cpu_info *ci, void /* * This is the glue between the MD IPI code and the calling of xcalls */ -static void -cpu_xcall_handler(void *arg) +void +cpu_xcall_dispatch(struct cpu_info *ci) { - struct xcall_cpu *xci = arg; + struct xcall_cpu *xci = &ci->ci_xcall; struct xcall *xc; size_t i; @@ -130,17 +130,6 @@ cpu_xcall_establish(struct cpu_info *ci) for (i = 0; i < nitems(xci->xci_xcalls); i++) xci->xci_xcalls[i] = NULL; - - xci->xci_softintr = softintr_establish(IPL_XCALL | IPL_MPSAFE, - cpu_xcall_handler, xci); -} - -void -cpu_xcall_dispatch(struct cpu_info *ci) -{ - struct xcall_cpu *xci = &ci->ci_xcall; - - softintr_schedule(xci->xci_softintr); } #else /* MULTIPROCESSOR */ Index: sys/xcall.h =================================================================== RCS file: /cvs/src/sys/sys/xcall.h,v diff -u -p -r1.1 xcall.h --- sys/xcall.h 13 Jul 2025 05:45:21 -0000 1.1 +++ sys/xcall.h 16 Sep 2025 12:07:26 -0000 @@ -45,7 +45,7 @@ * * 4. cpu_xcall_ipi has to be provided by machine/intr.h. * - * 5. The MD xcall IPI handler has to call cpu_xcall_dispatch. + * 5. call cpu_xcall_dispatch at IPL_SOFTCLOCK on the target CPU. */ #ifndef _SYS_XCALL_H