From: Mark Kettenis Subject: Re: SYS_pinsyscalls question To: deraadt@openbsd.org Cc: Mateusz Guzik , marc.espie.openbsd@gmail.com, tech@openbsd.org Date: Sun, 02 Mar 2025 22:31:36 +0100 > From: Mateusz Guzik > Date: Sun, 23 Feb 2025 23:54:50 +0100 > > On Sun, Feb 23, 2025 at 8:50 PM Theo de Raadt wrote: > > > > Mateusz Guzik wrote: > > > > > On Sun, Feb 23, 2025 at 5:21 PM Theo de Raadt wrote: > > > > > > > > > Then this copyin: > > > > > error = copyin(SCARG(uap, pins), pins, npins * sizeof(u_int)); > > > > > if (error) > > > > > goto err; > > > > > > > > > > will go off cpu and drop the kernel lock, allowing other threads to > > > > > reach the same point, all still observing pr->ps_libcpin.pn_start == 0. > > > > > > > > > > Then several threads will reach this assignment: > > > > > pr->ps_libcpin.pn_start = base; > > > > > > > > > > is going to override all previously set bufs, leaking them. > > > > > > > > Yes, would need very custom crt0 in those static binaries. > > > > > > > > > > Presumably someone wanting to mess with the kernel would go there. > > > > > > > This diff should allow the first thread to do pinsyscalls to win, and > > > > re-check the condition post-sleep so that others lose and free their > > > > pintable. EPERM matches the existing spec in the man page. > > > > > > > > > > Another angle is what happens when one thread is messing about with > > > installing the table, while another is issuing syscalls. Note there is > > > no synchronization between this and pin_check. It may be this does not > > > blow up in practice at the moment, but I don't think depending on it > > > is viable. > > > > That is not a concern. system calls which start before pinsyscalls > > completes setup, work as expected. > > > > Even if that's considered fine, there is a transition period where the > table is just being installed. The code needs correct memory fences on > both sides to make sure this works as expected. Well, you're right that there is a potential race between threads here. Although that can only happen if you write your own version of ld.so that creates threads before pinning down the syscalls using pinsyscalls(2). The pinsyscals(2) code does pr->ps_libcpin.pn_start = base; pr->ps_libcpin.pn_end = base + len; pr->ps_libcpin.pn_pins = pins; pr->ps_libcpin.pn_npins = npins; but another thread may observe those assignments in a different order because there are no (memory) barriers. Fortunately the code that checks the libc.so syscall range explicitly checks pn_pins, so there is no risk of a kernel null-pointer dereference. But it does mean that the address or syscall number checks in pin_check() are done in a somewhat inconsistent way. I think the worst case is when the thread sees the correct pn_end, pn_pins and pn_npins, but pn_start is still zero. That could potentially permit a syscall that shouldn't be allowed. The diff below would fix the potential races. It adds a memory barrier that makes sure pn_start, pn_end and pn_npins become visible before pn_pins. Then in pin_check() we also need a memory barrier to make sure pn_pins gets loaded before the other variable. This shouldn't have a significant impact on performance. The membar_consumer() barrier is probably the cheapest of the bunch and on amd64 it is implemented as a no-op anyway. But at the same time, this is a solution for a problem that doesn't really exist in reality. If someone really writes an ld.so replacement that starts threads beforing calling pinsyscalls(2) they deserve what they get. Index: sys/syscall_mi.h =================================================================== RCS file: /cvs/src/sys/sys/syscall_mi.h,v diff -u -p -r1.37 syscall_mi.h --- sys/syscall_mi.h 27 Dec 2024 11:57:16 -0000 1.37 +++ sys/syscall_mi.h 2 Mar 2025 21:07:00 -0000 @@ -57,6 +57,7 @@ pin_check(struct proc *p, register_t cod extern char sigcodecall[], sigcoderet[], sigcodecall[]; struct pinsyscall *pin = NULL, *ppin, *plibcpin; struct process *pr = p->p_p; + u_int *libcpins; vaddr_t addr; int error = 0; @@ -73,7 +74,9 @@ pin_check(struct proc *p, register_t cod * 2b) dynamic binary: syscalls in ld.so (in the ps_pin region) * 3) sigtramp, containing only sigreturn(2) */ - if (plibcpin->pn_pins && + libcpins = READ_ONCE(plibcpin->pn_pins); + membar_consumer(); + if (libcpins && addr >= plibcpin->pn_start && addr < plibcpin->pn_end) pin = plibcpin; else if (ppin->pn_pins && Index: uvm/uvm_mmap.c =================================================================== RCS file: /cvs/src/sys/uvm/uvm_mmap.c,v diff -u -p -r1.197 uvm_mmap.c --- uvm/uvm_mmap.c 28 Feb 2025 06:26:18 -0000 1.197 +++ uvm/uvm_mmap.c 2 Mar 2025 21:07:00 -0000 @@ -653,8 +653,9 @@ err: } pr->ps_libcpin.pn_start = base; pr->ps_libcpin.pn_end = base + len; - pr->ps_libcpin.pn_pins = pins; pr->ps_libcpin.pn_npins = npins; + membar_producer(); + WRITE_ONCE(pr->ps_libcpin.pn_pins, pins); #ifdef PMAP_CHECK_COPYIN /* Assume (and insist) on libc.so text being execute-only */