From: Vitaliy Makkoveev Subject: Re: route gwroute mutex To: Alexander Bluhm , tech@openbsd.org Date: Fri, 14 Mar 2025 17:32:51 +0300 On Fri, Mar 14, 2025 at 01:29:47PM +0100, Claudio Jeker wrote: > On Thu, Mar 13, 2025 at 10:58:50PM +0100, Alexander Bluhm wrote: > > On Thu, Mar 06, 2025 at 05:47:59PM +0300, Vitaliy Makkoveev wrote: > > > On Thu, Mar 06, 2025 at 03:24:08PM +0100, Claudio Jeker wrote: > > > > On Thu, Mar 06, 2025 at 03:17:54PM +0100, Claudio Jeker wrote: > > > > > On Thu, Mar 06, 2025 at 02:47:10PM +0100, Alexander Bluhm wrote: > > > > > > Hi, > > > > > > > > > > > > I am still trying to make the leaves of the routing tree MP safe. > > > > > > > > > > > > A small step would be to protect rt_cachecnt and rt_gwroute writers > > > > > > with a per route mutex. I am aware that this is not complete, e.g. > > > > > > rt_flags also looks fishy. The READ_ONCE() in rt_getll() does not > > > > > > protect from use-after-free. I want to avoid mutex in the hot path. > > > > > > A previous attempt with SMR failed when running BGP. Let's make > > > > > > this first step and find a smarter solition later. route_cleargateway() > > > > > > is not in the hot packet path, so I added the mutex there. > > > > > > > > > > > > I have moved the rt_cachecnt from rt_setgwroute() to rt_putgwroute() > > > > > > to have the RTF_CACHED logic in one function. > > > > > > > > > > > > ok? > > > > > > > > > > I'm begrudgingly OK with this. Right now the situation is clearly > > > > > incorrect and the mutex solves this right now. > > > > > Long term we should move towards making struct rtentry more RCU friendly > > > > > so we can drop the mutex. But lets tackle that after 7.7 > > > > > > > > But please use the mutex in all cases and don't try to trick around it. > > > > I know it sucks but at least then the code is correct. > > > > > > I agree, it's better to use mutex. I'm curious, is the performance impact > > > such significant? > > > > I made some measurements for the different locks. > > > > http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/perform.html > > > > The performance impact of mutex, read once, mutex + refcount, > > smr + refcount is best seen in this graphic. > > http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/gnuplot/tcp.html > > > > The difference is very small, the variation of multiple measurements > > is bigger. It is most expensive when rt_getll() returns a refcounted > > route. Calling rtref() and rtfree() in the hot path is so expensive > > that the mutex is irrelevant. > > > > The effect is better visible in these kernel stack flame graphs. > > They show forwaring parallel UDP traffic. Click on ip_forward and > > search for getll. > > > > http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/2025-03-07T00%3A00%3A00Z/btrace/netbench.pl_-v_-B1000000000_-b1000000_-d1_-f1_-i3_-N10_-cperform%40lt13_-sperform%40lt16_-a10.3.46.60_-t10_udpbench-btrace-kstack.0.svg?s=getll&x=784.6&y=213 > > http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/patch-sys-rt-gwroute-mutex.0/btrace/netbench.pl_-v_-B1000000000_-b1000000_-d1_-f1_-i3_-N10_-cperform%40lt13_-sperform%40lt16_-a10.3.46.60_-t10_udpbench-btrace-kstack.0.svg?s=getll&x=769.9&y=197 > > http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/patch-sys-rt-gwroute-readonce.0/btrace/netbench.pl_-v_-B1000000000_-b1000000_-d1_-f1_-i3_-N10_-cperform%40lt13_-sperform%40lt16_-a10.3.46.60_-t10_udpbench-btrace-kstack.0.svg?s=getll&x=809.4&y=197 > > http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/patch-sys-rt-gwroute-refcnt.0/btrace/netbench.pl_-v_-B1000000000_-b1000000_-d1_-f1_-i3_-N10_-cperform%40lt13_-sperform%40lt16_-a10.3.46.60_-t10_udpbench-btrace-kstack.0.svg?s=getll&x=639.2&y=197 > > http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/patch-sys-rt-gwroute-smr.0/btrace/netbench.pl_-v_-B1000000000_-b1000000_-d1_-f1_-i3_-N10_-cperform%40lt13_-sperform%40lt16_-a10.3.46.60_-t10_udpbench-btrace-kstack.0.svg?s=getll&x=676.9&y=213 > > > > SMR does not help anything here. When playing with it, I saw > > additional races due to smr_barrier(). > > > > Maybe putting a mutex around rt_gwroute in rt_isvalid() and rt_getll() > > is best. But it does not make the code more correct if rt_getll() > > does not refcount the route while holding the mutex. If I remember > > mpi@ correctly, the idea of the RTF_CACHED flag is, that rt_gwroute > > is not removed while in use. If that is true, READ_ONCE() should > > be sufficient. > > > > Thanks for all this work. I have to say I dislike to take sortcuts for > performance. Especially those that can turn bad later on. > > Your SMR diff is too simplistic and so you don't really get the benefit. I > think the trick here is that you want to SMR all of struct rt_entry and > then you only need to take a refcnt if you want to sleep holding the > rt_entry. > > It is interesting that refcnt_take() is so expensive. Something is off > there. I wonder why the mutex is cheap but the refcnt_take() is not. > Both require similar atomic instructions and I would expect similar > results for the two. Maybe this is some profiling artifact. > dt(4) provides this refcnt_take() performance impact. > I doubt there are minimal patches to fix the issues around routes (without > affecting performance). We need to sit down and rethink and reqrite a lot > of this. > > > bluhm > > > > Index: net/route.c > > =================================================================== > > RCS file: /data/mirror/openbsd/cvs/src/sys/net/route.c,v > > diff -u -p -r1.443 route.c > > --- net/route.c 6 Mar 2025 23:09:02 -0000 1.443 > > +++ net/route.c 7 Mar 2025 18:52:03 -0000 > > @@ -324,10 +324,13 @@ rtisvalid(struct rtentry *rt) > > return (0); > > > > if (ISSET(rt->rt_flags, RTF_GATEWAY)) { > > - if (rt->rt_gwroute == NULL) > > + struct rtentry *gwroute; > > + > > + gwroute = READ_ONCE(rt->rt_gwroute); > > + if (gwroute == NULL) > > return (0); > > - KASSERT(!ISSET(rt->rt_gwroute->rt_flags, RTF_GATEWAY)); > > - if (!ISSET(rt->rt_gwroute->rt_flags, RTF_UP)) > > + KASSERT(!ISSET(gwroute->rt_flags, RTF_GATEWAY)); > > + if (!ISSET(gwroute->rt_flags, RTF_UP)) > > return (0); > > } > > > > @@ -599,8 +602,10 @@ rt_putgwroute(struct rtentry *rt, struct > > mtx_leave(&nhrt->rt_mtx); > > } > > > > + mtx_enter(&rt->rt_mtx); > > onhrt = rt->rt_gwroute; > > rt->rt_gwroute = nhrt; > > + mtx_leave(&rt->rt_mtx); > > > > if (onhrt != NULL) { > > mtx_enter(&onhrt->rt_mtx); > > @@ -1169,8 +1174,11 @@ struct rtentry * > > rt_getll(struct rtentry *rt) > > { > > if (ISSET(rt->rt_flags, RTF_GATEWAY)) { > > + struct rtentry *gwroute; > > + > > + gwroute = READ_ONCE(rt->rt_gwroute); > > /* We may return NULL here. */ > > - return (rt->rt_gwroute); > > + return (gwroute); > > } > > > > return (rt); > > Index: net/route.h > > =================================================================== > > RCS file: /data/mirror/openbsd/cvs/src/sys/net/route.h,v > > diff -u -p -r1.214 route.h > > --- net/route.h 6 Mar 2025 23:09:02 -0000 1.214 > > +++ net/route.h 7 Mar 2025 12:44:42 -0000 > > @@ -123,7 +123,7 @@ struct rtentry { > > caddr_t rt_llinfo; /* [L] pointer to link level info or > > an MPLS structure */ > > union { > > - struct rtentry *_nh; /* [X] rtentry for rt_gateway */ > > + struct rtentry *_nh; /* [r] rtentry for rt_gateway */ > > unsigned int _ref; /* [r] # gateway rtentry refs */ > > } RT_gw; > > #define rt_gwroute RT_gw._nh > > Index: net/rtsock.c > > =================================================================== > > RCS file: /data/mirror/openbsd/cvs/src/sys/net/rtsock.c,v > > diff -u -p -r1.381 rtsock.c > > --- net/rtsock.c 16 Feb 2025 11:39:28 -0000 1.381 > > +++ net/rtsock.c 7 Mar 2025 12:44:34 -0000 > > @@ -1343,9 +1343,11 @@ route_cleargateway(struct rtentry *rt, v > > { > > struct rtentry *nhrt = arg; > > > > + mtx_enter(&rt->rt_mtx); > > if (ISSET(rt->rt_flags, RTF_GATEWAY) && rt->rt_gwroute == nhrt && > > !ISSET(rt->rt_locks, RTV_MTU)) > > atomic_store_int(&rt->rt_mtu, 0); > > + mtx_leave(&rt->rt_mtx); > > > > return (0); > > } > > > > -- > :wq Claudio >