From: Claudio Jeker Subject: Re: route gwroute mutex To: Alexander Bluhm Cc: Vitaliy Makkoveev , tech@openbsd.org Date: Fri, 14 Mar 2025 13:29:47 +0100 On Thu, Mar 13, 2025 at 10:58:50PM +0100, Alexander Bluhm wrote: > On Thu, Mar 06, 2025 at 05:47:59PM +0300, Vitaliy Makkoveev wrote: > > On Thu, Mar 06, 2025 at 03:24:08PM +0100, Claudio Jeker wrote: > > > On Thu, Mar 06, 2025 at 03:17:54PM +0100, Claudio Jeker wrote: > > > > On Thu, Mar 06, 2025 at 02:47:10PM +0100, Alexander Bluhm wrote: > > > > > Hi, > > > > > > > > > > I am still trying to make the leaves of the routing tree MP safe. > > > > > > > > > > A small step would be to protect rt_cachecnt and rt_gwroute writers > > > > > with a per route mutex. I am aware that this is not complete, e.g. > > > > > rt_flags also looks fishy. The READ_ONCE() in rt_getll() does not > > > > > protect from use-after-free. I want to avoid mutex in the hot path. > > > > > A previous attempt with SMR failed when running BGP. Let's make > > > > > this first step and find a smarter solition later. route_cleargateway() > > > > > is not in the hot packet path, so I added the mutex there. > > > > > > > > > > I have moved the rt_cachecnt from rt_setgwroute() to rt_putgwroute() > > > > > to have the RTF_CACHED logic in one function. > > > > > > > > > > ok? > > > > > > > > I'm begrudgingly OK with this. Right now the situation is clearly > > > > incorrect and the mutex solves this right now. > > > > Long term we should move towards making struct rtentry more RCU friendly > > > > so we can drop the mutex. But lets tackle that after 7.7 > > > > > > But please use the mutex in all cases and don't try to trick around it. > > > I know it sucks but at least then the code is correct. > > > > I agree, it's better to use mutex. I'm curious, is the performance impact > > such significant? > > I made some measurements for the different locks. > > http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/perform.html > > The performance impact of mutex, read once, mutex + refcount, > smr + refcount is best seen in this graphic. > http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/gnuplot/tcp.html > > The difference is very small, the variation of multiple measurements > is bigger. It is most expensive when rt_getll() returns a refcounted > route. Calling rtref() and rtfree() in the hot path is so expensive > that the mutex is irrelevant. > > The effect is better visible in these kernel stack flame graphs. > They show forwaring parallel UDP traffic. Click on ip_forward and > search for getll. > > http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/2025-03-07T00%3A00%3A00Z/btrace/netbench.pl_-v_-B1000000000_-b1000000_-d1_-f1_-i3_-N10_-cperform%40lt13_-sperform%40lt16_-a10.3.46.60_-t10_udpbench-btrace-kstack.0.svg?s=getll&x=784.6&y=213 > http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/patch-sys-rt-gwroute-mutex.0/btrace/netbench.pl_-v_-B1000000000_-b1000000_-d1_-f1_-i3_-N10_-cperform%40lt13_-sperform%40lt16_-a10.3.46.60_-t10_udpbench-btrace-kstack.0.svg?s=getll&x=769.9&y=197 > http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/patch-sys-rt-gwroute-readonce.0/btrace/netbench.pl_-v_-B1000000000_-b1000000_-d1_-f1_-i3_-N10_-cperform%40lt13_-sperform%40lt16_-a10.3.46.60_-t10_udpbench-btrace-kstack.0.svg?s=getll&x=809.4&y=197 > http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/patch-sys-rt-gwroute-refcnt.0/btrace/netbench.pl_-v_-B1000000000_-b1000000_-d1_-f1_-i3_-N10_-cperform%40lt13_-sperform%40lt16_-a10.3.46.60_-t10_udpbench-btrace-kstack.0.svg?s=getll&x=639.2&y=197 > http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/patch-sys-rt-gwroute-smr.0/btrace/netbench.pl_-v_-B1000000000_-b1000000_-d1_-f1_-i3_-N10_-cperform%40lt13_-sperform%40lt16_-a10.3.46.60_-t10_udpbench-btrace-kstack.0.svg?s=getll&x=676.9&y=213 > > SMR does not help anything here. When playing with it, I saw > additional races due to smr_barrier(). > > Maybe putting a mutex around rt_gwroute in rt_isvalid() and rt_getll() > is best. But it does not make the code more correct if rt_getll() > does not refcount the route while holding the mutex. If I remember > mpi@ correctly, the idea of the RTF_CACHED flag is, that rt_gwroute > is not removed while in use. If that is true, READ_ONCE() should > be sufficient. > Thanks for all this work. I have to say I dislike to take sortcuts for performance. Especially those that can turn bad later on. Your SMR diff is too simplistic and so you don't really get the benefit. I think the trick here is that you want to SMR all of struct rt_entry and then you only need to take a refcnt if you want to sleep holding the rt_entry. It is interesting that refcnt_take() is so expensive. Something is off there. I wonder why the mutex is cheap but the refcnt_take() is not. Both require similar atomic instructions and I would expect similar results for the two. Maybe this is some profiling artifact. I doubt there are minimal patches to fix the issues around routes (without affecting performance). We need to sit down and rethink and reqrite a lot of this. > bluhm > > Index: net/route.c > =================================================================== > RCS file: /data/mirror/openbsd/cvs/src/sys/net/route.c,v > diff -u -p -r1.443 route.c > --- net/route.c 6 Mar 2025 23:09:02 -0000 1.443 > +++ net/route.c 7 Mar 2025 18:52:03 -0000 > @@ -324,10 +324,13 @@ rtisvalid(struct rtentry *rt) > return (0); > > if (ISSET(rt->rt_flags, RTF_GATEWAY)) { > - if (rt->rt_gwroute == NULL) > + struct rtentry *gwroute; > + > + gwroute = READ_ONCE(rt->rt_gwroute); > + if (gwroute == NULL) > return (0); > - KASSERT(!ISSET(rt->rt_gwroute->rt_flags, RTF_GATEWAY)); > - if (!ISSET(rt->rt_gwroute->rt_flags, RTF_UP)) > + KASSERT(!ISSET(gwroute->rt_flags, RTF_GATEWAY)); > + if (!ISSET(gwroute->rt_flags, RTF_UP)) > return (0); > } > > @@ -599,8 +602,10 @@ rt_putgwroute(struct rtentry *rt, struct > mtx_leave(&nhrt->rt_mtx); > } > > + mtx_enter(&rt->rt_mtx); > onhrt = rt->rt_gwroute; > rt->rt_gwroute = nhrt; > + mtx_leave(&rt->rt_mtx); > > if (onhrt != NULL) { > mtx_enter(&onhrt->rt_mtx); > @@ -1169,8 +1174,11 @@ struct rtentry * > rt_getll(struct rtentry *rt) > { > if (ISSET(rt->rt_flags, RTF_GATEWAY)) { > + struct rtentry *gwroute; > + > + gwroute = READ_ONCE(rt->rt_gwroute); > /* We may return NULL here. */ > - return (rt->rt_gwroute); > + return (gwroute); > } > > return (rt); > Index: net/route.h > =================================================================== > RCS file: /data/mirror/openbsd/cvs/src/sys/net/route.h,v > diff -u -p -r1.214 route.h > --- net/route.h 6 Mar 2025 23:09:02 -0000 1.214 > +++ net/route.h 7 Mar 2025 12:44:42 -0000 > @@ -123,7 +123,7 @@ struct rtentry { > caddr_t rt_llinfo; /* [L] pointer to link level info or > an MPLS structure */ > union { > - struct rtentry *_nh; /* [X] rtentry for rt_gateway */ > + struct rtentry *_nh; /* [r] rtentry for rt_gateway */ > unsigned int _ref; /* [r] # gateway rtentry refs */ > } RT_gw; > #define rt_gwroute RT_gw._nh > Index: net/rtsock.c > =================================================================== > RCS file: /data/mirror/openbsd/cvs/src/sys/net/rtsock.c,v > diff -u -p -r1.381 rtsock.c > --- net/rtsock.c 16 Feb 2025 11:39:28 -0000 1.381 > +++ net/rtsock.c 7 Mar 2025 12:44:34 -0000 > @@ -1343,9 +1343,11 @@ route_cleargateway(struct rtentry *rt, v > { > struct rtentry *nhrt = arg; > > + mtx_enter(&rt->rt_mtx); > if (ISSET(rt->rt_flags, RTF_GATEWAY) && rt->rt_gwroute == nhrt && > !ISSET(rt->rt_locks, RTV_MTU)) > atomic_store_int(&rt->rt_mtu, 0); > + mtx_leave(&rt->rt_mtx); > > return (0); > } > -- :wq Claudio