Download raw body.
avoid lock contention in futex syscalls
On Thu, May 01, 2025 at 11:09:34AM +0200, Martin Pieuchot wrote:
> Hey David,
>
> Thanks a lot for your work. I like it very much. Comments and
> questions below.
>
> On 01/05/25(Thu) 14:07, David Gwynne wrote:
> > this is very similar to the change made to __thrsleep and __thrwakeup
> > in src/sys/kern/kern_synch.c r1.214.
> >
> > currently all futexes in the system coordinate using a single global
> > lock. if you have heavily threaded code building locks in userland
> > out of futexes, this lock gets hammered. this is true even if
> > userland thinks it's operating on separate locks, it all ends up
> > serialised in the kernel. this can reduce the throughput of these
> > heavily threaded programs.
> >
> > like the __thrsleep diff, the big change is hashing futex waiters
> > into an array of locks/lists (or buckets) based on their "id" to
> > try and avoid contending on a single lock.
>
> This is now the third sleepqueue implementation in the kernel. I'd
> appreciate if you could use the same terminology ("ident" instead of
> "key", "slpque" vs "bucket", etc) to ease understanding and hopefully
> future refactoring.
the backend to thrsleep is the other one? the plan is to get rid of
that.
> Also why this version use rwlocks and not mutexes to serialize access
> to the hashed sleep queues?
the length of the critical sections, futex_requeue in particular where
it can take multiple locks concurrently, made me lean toward rwlocks. we
don't have anything in base that uses requeue though, so i dont
have any evidence that it's bad.
> > also like the __thrsleep diff, this change also tries to avoid having
> > a thread waiting in futex_wait re-take the lock when waking up.
> > futex_wake is holding the bucket lock when waking up sleeping
> > threads, so having the sleeping thread try take the bucket lock again
> > would immediately put it back to sleep again. having futex_wait sleep
> > without the lock means it can return back to userland sooner.
>
> This is great. Does that imply that updating `ft_proc' requires the
> SCHED_LOCK()? If so, could that be documented with a [S] comment in
> the structure definition?
no, the bucket lock is what protects ft_proc. we need to hold SCHED_LOCK
to call wakeup_proc, which i do once for all the wakeups instead of each
one.
> This brings me to the real question, which is out of the scope of this
> diff. Since we are now by-passing the use of the global sleep queue.
> Could it be untangle from the SCHED_LOCK()? Or are we still relying on
> it in this implementation?
my understanding is if a thread goes to sleep, it has to sleep in
the kernel sleep queues. however, we skip some overhead by using
sleep_finish which just enqueues a thread (it doesn't traverse the list)
and wakeup_proc which avoids traversing the list on the sleep queue
to find the proc because we already have it.
tl;dr: we hold SCHED_LOCK as little as possible.
> > a feature of futexes is that multiple threads can wait on the same
> > address and get woken up together. this is currently implemented by
> > allocating a struct to represent this userland address, and then queuing
> > the waiting threads on this struct. while pools aren't slow, they're
> > not free, so this diff removes this struct and queues threads directly.
> > this means the futex wakups may have to iterate more, but in practice
> > this is amortised by having multiple lists/locks (which results in
> > shorter lists of threads), and avoiding the overhead of the pool
> > operations. my observation is that most futex ops dont share wait
> > addresses, so every futex wait would result in a pool get and put
> > anyway.
>
> Lovely.
>
> > another feature of futexes that __thrsleep doesnt have is the ability
> > to move the address threads are sleeping on. this means that threads
> > can move between buckets in the hash. this means care must be taken to
> > avoid deadlocks between the locks on each bucket, and when a waiting
> > thread wakes up after a timeout expires it has to be careful to remove
> > itself from the right bucket after such a requeue.
> >
> > most users of thrsleep have been migrated to use futexes instead,
> > with a big exception being go because benchmarks showed that
> > __thrsleep was significantly faster. that was true before i changed
> > __thrsleep last year, and is even more true now. im hoping this
> > change brings futex performance into the same ballpark as __thrsleep
> > so we can keep work toward removing __thrsleep. unfortunately i
> > dont know enough go to test this myself, so i'm hoping someone
> > *cough*jsing*cough* will try and let me know.
> >
> > thanks to phessler for his patience and help testing and reporting
> > issues with this diff by running it through a ton of ports builds.
> > also, because go is such a heavy user of __thrsleep, and because
> > of the lack of issues after i made that change, im feeling confident
> > about this diff.
the net at home sucks at the moment, so i'll tweak code when i have
better net.
> >
> > Index: sys/futex.h
> > ===================================================================
> > RCS file: /cvs/src/sys/sys/futex.h,v
> > diff -u -p -r1.2 futex.h
> > --- sys/futex.h 3 Jun 2018 15:09:26 -0000 1.2
> > +++ sys/futex.h 1 May 2025 03:21:42 -0000
> > @@ -28,11 +28,15 @@ int futex(volatile uint32_t *, int, int,
> > __END_DECLS
> > #endif /* ! _KERNEL */
> >
> > +#define FUTEX_OP_MASK 0x007f
> > +
> > #define FUTEX_WAIT 1
> > #define FUTEX_WAKE 2
> > #define FUTEX_REQUEUE 3
> >
> > -#define FUTEX_PRIVATE_FLAG 128
> > +#define FUTEX_FLAG_MASK 0xff80
> > +
> > +#define FUTEX_PRIVATE_FLAG 0x0080
> >
> > #define FUTEX_WAIT_PRIVATE (FUTEX_WAIT | FUTEX_PRIVATE_FLAG)
> > #define FUTEX_WAKE_PRIVATE (FUTEX_WAKE | FUTEX_PRIVATE_FLAG)
> > Index: kern/sys_futex.c
> > ===================================================================
> > RCS file: /cvs/src/sys/kern/sys_futex.c,v
> > diff -u -p -r1.22 sys_futex.c
> > --- kern/sys_futex.c 14 Aug 2023 07:42:34 -0000 1.22
> > +++ kern/sys_futex.c 1 May 2025 03:21:42 -0000
> > @@ -24,6 +24,8 @@
> > #include <sys/pool.h>
> > #include <sys/time.h>
> > #include <sys/rwlock.h>
> > +#include <sys/percpu.h> /* CACHELINESIZE */
> > +#include <sys/kernel.h> /* tick_nsec */
> > #include <sys/futex.h>
> >
> > #ifdef KTRACE
> > @@ -36,41 +38,65 @@
> > * Kernel representation of a futex.
> > */
> > struct futex {
> > - LIST_ENTRY(futex) ft_list; /* list of all futexes */ > - TAILQ_HEAD(, proc) ft_threads; /* sleeping queue */
> > + TAILQ_ENTRY(futex) ft_entry; /* list of all futexes */
>
> This is now /* list of futexes per bucket */. Can you document the
> locking please?
ok.
> > + struct process *ft_ps;
>
> Could we add /* for private futexes */. I hope this won't uncover too
> much broken code with shared vmspace.
we've hit the code pretty hard, and this hasn't been an issue. requeue
on the other hand...
> > struct uvm_object *ft_obj; /* UVM object */
> > struct vm_amap *ft_amap; /* UVM amap */
> > - voff_t ft_off; /* UVM offset */
> > - unsigned int ft_refcnt; /* # of references */
> > + volatile voff_t ft_off; /* UVM offset */
> > +
> > + struct proc * volatile ft_proc;
> > };
> >
> > -/* Syscall helpers. */
> > -int futex_wait(uint32_t *, uint32_t, const struct timespec *, int);
> > -int futex_wake(uint32_t *, uint32_t, int);
> > -int futex_requeue(uint32_t *, uint32_t, uint32_t *, uint32_t, int);
> > -
> > -/* Flags for futex_get(). */
> > -#define FT_CREATE 0x1 /* Create a futex if it doesn't exist. */
> > -#define FT_PRIVATE 0x2 /* Futex is process-private. */
> > +static int
> > +futex_is_eq(const struct futex *a, const struct futex *b)
> > +{
> > + return (a->ft_off == b->ft_off &&
> > + a->ft_ps == b->ft_ps &&
> > + a->ft_obj == b->ft_obj &&
> > + a->ft_amap == b->ft_amap);
> > +}
> >
> > -struct futex *futex_get(uint32_t *, int);
> > -void futex_put(struct futex *);
> > +TAILQ_HEAD(futexen, futex);
>
> Could you rename "futexen" to a name containing the word 'list' like
> "futexlist" or "fblist"? I find the current name confusing and the
> incoherency with the variable names puzzled me.
i'll make it futex_list.
>
> > -/*
> > - * The global futex lock serializes futex(2) calls so that no wakeup
> > - * event is lost, and protects all futex lists and futex states.
> > - */
> > -struct rwlock ftlock = RWLOCK_INITIALIZER("futex");
> > -static struct futex_list ftlist_shared =
> > - LIST_HEAD_INITIALIZER(ftlist_shared);
> > -struct pool ftpool;
> > +struct futex_bucket {
> > + struct futexen fb_list;
> > + struct rwlock fb_lock;
> > + uint32_t fb_id; /* for lock ordering */
> > +} __aligned(CACHELINESIZE);
> >
> > +/* Syscall helpers. */
> > +static int futex_wait(struct proc *, uint32_t *, uint32_t,
> > + const struct timespec *, int);
> > +static int futex_wake(struct proc *, uint32_t *, uint32_t, int,
> > + register_t *);
> > +static int futex_requeue(struct proc *, uint32_t *, uint32_t,
> > + uint32_t *, uint32_t, int, register_t *);
> > +
> > +/* Flags for futex_get(). kernel private flags sit in FUTEX_OP_MASK space */
> > +#define FT_PRIVATE FUTEX_PRIVATE_FLAG /* Futex is process-private. */
> > +
> > +#define FUTEX_BUCKET_BITS 6
> > +#define FUTEX_BUCKET_SIZE (1U << FUTEX_BUCKET_BITS)
> > +#define FUTEX_BUCKET_MASK (FUTEX_BUCKET_SIZE - 1)
> > +
> > +static struct futex_bucket futex_hash[FUTEX_BUCKET_SIZE];
> >
> > void
> > futex_init(void)
> > {
> > - pool_init(&ftpool, sizeof(struct futex), 0, IPL_NONE,
> > - PR_WAITOK | PR_RWLOCK, "futexpl", NULL);
> > + struct futex_bucket *fb;
> > + unsigned int i;
> > +
> > + for (i = 0; i < nitems(futex_hash); i++) {
> > + fb = &futex_hash[i];
> > +
> > + TAILQ_INIT(&fb->fb_list);
> > + rw_init(&fb->fb_lock, "futexlk");
> > +
> > + fb->fb_id = arc4random();
> > + fb->fb_id &= ~FUTEX_BUCKET_MASK;
> > + fb->fb_id |= i;
> > + }
> > }
> >
> > int
> > @@ -88,65 +114,51 @@ sys_futex(struct proc *p, void *v, regis
> > uint32_t val = SCARG(uap, val);
> > const struct timespec *timeout = SCARG(uap, timeout);
> > void *g = SCARG(uap, g);
> > - int flags = 0;
> > + int flags = op & FUTEX_FLAG_MASK;
> > int error = 0;
> >
> > - if (op & FUTEX_PRIVATE_FLAG)
> > - flags |= FT_PRIVATE;
> > -
> > - rw_enter_write(&ftlock);
> > - switch (op) {
> > + switch (op & FUTEX_OP_MASK) {
> > case FUTEX_WAIT:
> > - case FUTEX_WAIT_PRIVATE:
> > - error = futex_wait(uaddr, val, timeout, flags);
> > + error = futex_wait(p, uaddr, val, timeout, flags);
> > break;
> > case FUTEX_WAKE:
> > - case FUTEX_WAKE_PRIVATE:
> > - *retval = futex_wake(uaddr, val, flags);
> > + error = futex_wake(p, uaddr, val, flags, retval);
> > break;
> > case FUTEX_REQUEUE:
> > - case FUTEX_REQUEUE_PRIVATE:
> > - *retval = futex_requeue(uaddr, val, g, (u_long)timeout, flags);
> > + error = futex_requeue(p, uaddr, val, g,
> > + (u_long)timeout, flags, retval);
> > break;
> > default:
> > error = ENOSYS;
> > break;
> > }
> > - rw_exit_write(&ftlock);
> >
> > return error;
> > }
> >
> > -/*
> > - * Return an existing futex matching userspace address ``uaddr''.
> > - *
> > - * If such futex does not exist and FT_CREATE is given, create it.
> > - */
> > -struct futex *
> > -futex_get(uint32_t *uaddr, int flags)
> > +static void
> > +futex_addrs(struct proc *p, struct futex *f, uint32_t *uaddr, int flags)
> > {
> > - struct proc *p = curproc;
> > vm_map_t map = &p->p_vmspace->vm_map;
> > vm_map_entry_t entry;
> > struct uvm_object *obj = NULL;
> > struct vm_amap *amap = NULL;
> > voff_t off = (vaddr_t)uaddr;
> > - struct futex *f;
> > - struct futex_list *ftlist = &p->p_p->ps_ftlist;
> > + struct process *ps;
> >
> > - rw_assert_wrlock(&ftlock);
> > + if (ISSET(flags, FT_PRIVATE))
> > + ps = p->p_p;
> > + else {
> > + ps = NULL;
> >
> > - if (!(flags & FT_PRIVATE)) {
> > vm_map_lock_read(map);
> > if (uvm_map_lookup_entry(map, (vaddr_t)uaddr, &entry) &&
> > entry->inheritance == MAP_INHERIT_SHARE) {
> > if (UVM_ET_ISOBJ(entry)) {
> > - ftlist = &ftlist_shared;
> > obj = entry->object.uvm_obj;
> > off = entry->offset +
> > ((vaddr_t)uaddr - entry->start);
> > } else if (entry->aref.ar_amap) {
> > - ftlist = &ftlist_shared;
> > amap = entry->aref.ar_amap;
> > off = ptoa(entry->aref.ar_pageoff) +
> > ((vaddr_t)uaddr - entry->start);
> > @@ -155,47 +167,47 @@ futex_get(uint32_t *uaddr, int flags)
> > vm_map_unlock_read(map);
> > }
> >
> > - LIST_FOREACH(f, ftlist, ft_list) {
> > - if (f->ft_obj == obj && f->ft_amap == amap &&
> > - f->ft_off == off) {
> > - f->ft_refcnt++;
> > - break;
> > - }
> > - }
> > + f->ft_ps = ps;
> > + f->ft_obj = obj;
> > + f->ft_amap = amap;
> > + f->ft_off = off;
> > +}
> >
> > - if ((f == NULL) && (flags & FT_CREATE)) {
> > - /*
> > - * We rely on the rwlock to ensure that no other thread
> > - * create the same futex.
> > - */
> > - f = pool_get(&ftpool, PR_WAITOK);
> > - TAILQ_INIT(&f->ft_threads);
> > - f->ft_obj = obj;
> > - f->ft_amap = amap;
> > - f->ft_off = off;
> > - f->ft_refcnt = 1;
> > - LIST_INSERT_HEAD(ftlist, f, ft_list);
> > - }
> > +static inline struct futex_bucket *
> > +futex_get_bucket(struct futex *f)
>
> I'd appreciate if this hash function (like the thrsleep one) could be
> made more similar to the LOOKUP() used by the global sleep queue.
hrm. i folded more bits in to try to spread the bucket usage out. if
objects with locks in them are allocated on aligned boundaries and we're
only looking at low bits of their addresses, they will falsely share a
bucket. 3 bits plus the 6 bits from FUTEX_BUCKET_BITS is 9, or 512
bytes. 3 + 6 + 6 is 15 bits or 32k, which gives a bit more margin.
>
> > +{
> > + uint32_t key = f->ft_off >> 3; /* watevs */
> > + key ^= key >> FUTEX_BUCKET_BITS;
> >
> > - return f;
> > + return (&futex_hash[key & FUTEX_BUCKET_MASK]);
> > }
> >
> > -/*
> > - * Release a given futex.
> > - */
> > -void
> > -futex_put(struct futex *f)
> > +static int
> > +futex_remove(struct futex_bucket *ofb, struct futex *f)
> > {
> > - rw_assert_wrlock(&ftlock);
> > + struct futex_bucket *fb;
> > + int rv;
> > +
> > + /*
> > + * REQUEUE can move a futex between buckets, so follow it if needed.
> > + */
> >
> > - KASSERT(f->ft_refcnt > 0);
> > + for (;;) {
> > + rw_enter_write(&ofb->fb_lock);
> > + fb = futex_get_bucket(f);
> > + if (ofb == fb)
> > + break;
> >
> > - --f->ft_refcnt;
> > - if (f->ft_refcnt == 0) {
> > - KASSERT(TAILQ_EMPTY(&f->ft_threads));
> > - LIST_REMOVE(f, ft_list);
> > - pool_put(&ftpool, f);
> > + rw_exit_write(&ofb->fb_lock);
> > + ofb = fb;
> > }
> > +
> > + rv = f->ft_proc != NULL;
> > + if (rv)
> > + TAILQ_REMOVE(&fb->fb_list, f, ft_entry);
> > + rw_exit_write(&fb->fb_lock);
> > +
> > + return (rv);
> > }
> >
> > /*
> > @@ -203,34 +215,19 @@ futex_put(struct futex *f)
> > * ``uaddr''. Let it sleep for the specified ``timeout'' time, or
> > * indefinitely if the argument is NULL.
> > */
> > -int
> > -futex_wait(uint32_t *uaddr, uint32_t val, const struct timespec *timeout,
> > - int flags)
> > +static int
> > +futex_wait(struct proc *p, uint32_t *uaddr, uint32_t val,
> > + const struct timespec *timeout, int flags)
> > {
> > - struct proc *p = curproc;
> > - struct futex *f;
> > - uint64_t nsecs = INFSLP;
> > + struct futex f;
> > + struct futex_bucket *fb;
> > + uint64_t to_ticks = 0;
> > uint32_t cval;
> > int error;
> >
> > - /*
> > - * After reading the value a race is still possible but
> > - * we deal with it by serializing all futex syscalls.
> > - */
> > - rw_assert_wrlock(&ftlock);
> > -
> > - /*
> > - * Read user space futex value
> > - */
> > - if ((error = copyin32(uaddr, &cval)))
> > - return error;
> > -
> > - /* If the value changed, stop here. */
> > - if (cval != val)
> > - return EAGAIN;
> > -
> > if (timeout != NULL) {
> > struct timespec ts;
> > + uint64_t nsecs;
> >
> > if ((error = copyin(timeout, &ts, sizeof(ts))))
> > return error;
> > @@ -240,32 +237,77 @@ futex_wait(uint32_t *uaddr, uint32_t val
> > #endif
> > if (ts.tv_sec < 0 || !timespecisvalid(&ts))
> > return EINVAL;
> > +
> > nsecs = MAX(1, MIN(TIMESPEC_TO_NSEC(&ts), MAXTSLP));
> > + to_ticks = (nsecs + tick_nsec - 1) / (tick_nsec + 1) + 1;
> > + if (to_ticks > INT_MAX)
> > + to_ticks = INT_MAX;
> > }
> >
> > - f = futex_get(uaddr, flags | FT_CREATE);
> > - TAILQ_INSERT_TAIL(&f->ft_threads, p, p_fut_link);
> > - p->p_futex = f;
> > -
> > - error = rwsleep_nsec(p, &ftlock, PWAIT|PCATCH, "fsleep", nsecs);
> > - if (error == ERESTART)
> > - error = ECANCELED;
> > - else if (error == EWOULDBLOCK) {
> > - /* A race occurred between a wakeup and a timeout. */
> > - if (p->p_futex == NULL)
> > - error = 0;
> > - else
> > - error = ETIMEDOUT;
> > + futex_addrs(p, &f, uaddr, flags);
> > + fb = futex_get_bucket(&f);
> > +
> > + f.ft_proc = p;
> > + rw_enter_write(&fb->fb_lock);
> > + TAILQ_INSERT_TAIL(&fb->fb_list, &f, ft_entry);
> > + rw_exit_write(&fb->fb_lock);
> > +
> > + /*
> > + * Read user space futex value
> > + */
> > + if ((error = copyin32(uaddr, &cval)) != 0)
> > + goto exit;
> > +
> > + /* If the value changed, stop here. */
> > + if (cval != val) {
> > + error = EAGAIN;
> > + goto exit;
> > }
>
> I believe the reason for using a rwlock was to be able to read the value
> while holding it...
yeah, that makes sense.
we don't have to be holding the^Wa futex lock while doing the copyin
now though, which is good cos it means we won't be blocking other
threads if it needs to do a pagefault.
>
> >
> > + sleep_setup(&f, PWAIT|PCATCH, "fsleep");
> > + error = sleep_finish(to_ticks, f.ft_proc != NULL);
> > /* Remove ourself if we haven't been awaken. */
> > - if ((f = p->p_futex) != NULL) {
> > - p->p_futex = NULL;
> > - TAILQ_REMOVE(&f->ft_threads, p, p_fut_link);
> > - futex_put(f);
> > + if (error != 0 || f.ft_proc != NULL) {
> > + if (futex_remove(fb, &f) == 0)
> > + error = 0;
> > +
> > + switch (error) {
> > + case ERESTART:
> > + error = ECANCELED;
> > + break;
> > + case EWOULDBLOCK:
> > + error = ETIMEDOUT;
> > + break;
>
> Could you add a default here, to be clear it's not a mistake.
kind of disappointing to think that c programmers need help reading
their c, but sure.
>
> > + }
> > }
> >
> > return error;
> > +exit:
> > + if (f.ft_proc != NULL)
> > + futex_remove(fb, &f);
> > + return error;
> > +}
> > +
> > +static void
> > +futexen_wakeup(struct futexen *fl)
> > +{
> > + struct futex *f, *nf;
> > + struct proc *p;
> > +
>
> What is preventing a race with the waiting thread returning due to a
> timeout? In other words what guarantee that we can dereference an other
> thread's stack?
if you're holding a futex bucket lock, then you can reference any
of the futex structs in the futex bucket list.
the ft_proc pointer acts as a flag that indicates whether the futex
struct is queued or not. if it is not NULL it's on the list, if it is
NULL it has been removed from the list.
you can only change ft_proc and remove the futex from the list while
you're holding the futex bucket lock. this is true even for the timeout
(and other error) handling in futex_wait. futex_wait must not return
before the futex struct has been removed from the list.
> > + /*
> > + * take care to avoid referencing f after we set ft_proc
> > + * to NULL (and wake the associated thread up). f is on the
> > + * stack of the thread we're trying let out of the kernel,
> > + * so it can go away.
> > + */
> > +
> > + SCHED_LOCK();
> > + TAILQ_FOREACH_SAFE(f, fl, ft_entry, nf) {
> > + p = f->ft_proc;
> > + f->ft_proc = NULL;
> > + wakeup_proc(p);
> > + }
> > + SCHED_UNLOCK();
> > }
> >
> > /*
> > @@ -273,46 +315,135 @@ futex_wait(uint32_t *uaddr, uint32_t val
> > * ``uaddr'' and requeue at most ``m'' sibling threads on a futex at
> > * address ``uaddr2''.
> > */
> > -int
> > -futex_requeue(uint32_t *uaddr, uint32_t n, uint32_t *uaddr2, uint32_t m,
> > - int flags)
> > +static int
> > +futex_requeue(struct proc *p, uint32_t *uaddr, uint32_t n,
> > + uint32_t *uaddr2, uint32_t m, int flags, register_t *retval)
> > {
> > - struct futex *f, *g;
> > - struct proc *p;
> > + struct futexen fl = TAILQ_HEAD_INITIALIZER(fl);
> > + struct futex okey, nkey;
> > + struct futex *f, *nf, *mf = NULL;
> > + struct futex_bucket *ofb, *nfb;
> > uint32_t count = 0;
> >
> > - rw_assert_wrlock(&ftlock);
> > + if (m == 0)
> > + return futex_wake(p, uaddr, n, flags, retval);
> >
> > - f = futex_get(uaddr, flags);
> > - if (f == NULL)
> > - return 0;
> > + futex_addrs(p, &okey, uaddr, flags);
> > + ofb = futex_get_bucket(&okey);
> > + futex_addrs(p, &nkey, uaddr2, flags);
> > + nfb = futex_get_bucket(&nkey);
> > +
> > + if (ofb->fb_id < nfb->fb_id) {
> > + rw_enter_write(&ofb->fb_lock);
> > + rw_enter_write(&nfb->fb_lock);
> > + } else if (ofb->fb_id > nfb->fb_id) {
> > + rw_enter_write(&nfb->fb_lock);
> > + rw_enter_write(&ofb->fb_lock);
> > + } else
> > + rw_enter_write(&ofb->fb_lock);
> > +
> > + TAILQ_FOREACH_SAFE(f, &ofb->fb_list, ft_entry, nf) {
> > + /* __builtin_prefetch(nf, 1); */
> > + KASSERT(f->ft_proc != NULL);
> >
> > - while ((p = TAILQ_FIRST(&f->ft_threads)) != NULL && (count < (n + m))) {
> > - p->p_futex = NULL;
> > - TAILQ_REMOVE(&f->ft_threads, p, p_fut_link);
> > - futex_put(f);
> > -
> > - if (count < n) {
> > - wakeup_one(p);
> > - } else if (uaddr2 != NULL) {
> > - g = futex_get(uaddr2, FT_CREATE);
> > - TAILQ_INSERT_TAIL(&g->ft_threads, p, p_fut_link);
> > - p->p_futex = g;
> > + if (!futex_is_eq(f, &okey))
> > + continue;
> > +
> > + TAILQ_REMOVE(&ofb->fb_list, f, ft_entry);
> > + TAILQ_INSERT_TAIL(&fl, f, ft_entry);
> > +
> > + if (++count == n) {
> > + mf = nf;
> > + break;
> > }
> > - count++;
> > }
> >
> > - futex_put(f);
> > + if (!TAILQ_EMPTY(&fl))
> > + futexen_wakeup(&fl);
> > +
> > + /* update matching futexes */
> > + if (mf != NULL) {
> > + /*
> > + * only iterate from the current entry to the tail
> > + * of the list as it is now in case we're requeueing
> > + * on the end of the same list.
> > + */
> > + nf = TAILQ_LAST(&ofb->fb_list, futexen);
> > + do {
> > + f = mf;
> > + mf = TAILQ_NEXT(f, ft_entry);
> > + /* __builtin_prefetch(mf, 1); */
> > +
> > + KASSERT(f->ft_proc != NULL);
> > +
> > + if (!futex_is_eq(f, &okey))
> > + continue;
> > +
> > + TAILQ_REMOVE(&ofb->fb_list, f, ft_entry);
> > + /* it should only be ft_off that changes, but eh */
>
> Not sure about this comment. The offset is relative to the object or
> anon. Both could change depending on the address, no?
i dont actually know, so i'll take your word for it. i'll remove the
comment.
>
> > + f->ft_ps = nkey.ft_ps;
> > + f->ft_obj = nkey.ft_obj;
> > + f->ft_amap = nkey.ft_amap;
> > + f->ft_off = nkey.ft_off;
> > +
> > + TAILQ_INSERT_TAIL(&nfb->fb_list, f, ft_entry);
> > +
> > + if (--m == 0)
> > + break;
> > + } while (f != nf);
> > + }
> > +
> > + if (ofb->fb_id != nfb->fb_id)
> > + rw_exit_write(&nfb->fb_lock);
> > + rw_exit_write(&ofb->fb_lock);
> >
> > - return count;
> > + *retval = count;
> > + return 0;
> > }
> >
> > /*
> > * Wakeup at most ``n'' sibling threads sleeping on a futex at address
> > * ``uaddr''.
> > */
> > -int
> > -futex_wake(uint32_t *uaddr, uint32_t n, int flags)
> > +static int
> > +futex_wake(struct proc *p, uint32_t *uaddr, uint32_t n, int flags,
> > + register_t *retval)
> > {
> > - return futex_requeue(uaddr, n, NULL, 0, flags);
> > + struct futexen fl = TAILQ_HEAD_INITIALIZER(fl);
> > + struct futex key;
> > + struct futex *f, *nf;
> > + struct futex_bucket *fb;
> > + int count = 0;
> > +
> > + if (n == 0) {
> > + *retval = 0;
> > + return 0;
> > + }
> > +
> > + futex_addrs(p, &key, uaddr, flags);
> > + fb = futex_get_bucket(&key);
> > +
> > + rw_enter_write(&fb->fb_lock);
> > +
> > + TAILQ_FOREACH_SAFE(f, &fb->fb_list, ft_entry, nf) {
> > + /* __builtin_prefetch(nf, 1); */
> > + KASSERT(f->ft_proc != NULL);
> > +
> > + if (!futex_is_eq(f, &key))
> > + continue;
> > +
> > + TAILQ_REMOVE(&fb->fb_list, f, ft_entry);
> > + TAILQ_INSERT_TAIL(&fl, f, ft_entry);
> > +
> > + if (++count == n)
> > + break;
> > + }
>
> Can we release the sleep queue lock here? Or is it this lock that acts
> as a barrier to ensure we can deference the stack? Could this be
> documented somehow?
yes, as described above, holding the futex lock here provides the
guarantee that the futex struct won't go away.
> > + if (!TAILQ_EMPTY(&fl))
> > + futexen_wakeup(&fl);
> > +
> > + rw_exit_write(&fb->fb_lock);
> > +
> > + *retval = count;
> > + return 0;
> > }
if i get time at work i'll try and update the code today.
avoid lock contention in futex syscalls