Index | Thread | Search

From:
Claudio Jeker <cjeker@diehard.n-r-g.com>
Subject:
Re: route gwroute mutex
To:
Alexander Bluhm <bluhm@openbsd.org>
Cc:
Vitaliy Makkoveev <mvs@openbsd.org>, tech@openbsd.org
Date:
Fri, 14 Mar 2025 13:29:47 +0100

Download raw body.

Thread
On Thu, Mar 13, 2025 at 10:58:50PM +0100, Alexander Bluhm wrote:
> On Thu, Mar 06, 2025 at 05:47:59PM +0300, Vitaliy Makkoveev wrote:
> > On Thu, Mar 06, 2025 at 03:24:08PM +0100, Claudio Jeker wrote:
> > > On Thu, Mar 06, 2025 at 03:17:54PM +0100, Claudio Jeker wrote:
> > > > On Thu, Mar 06, 2025 at 02:47:10PM +0100, Alexander Bluhm wrote:
> > > > > Hi,
> > > > > 
> > > > > I am still trying to make the leaves of the routing tree MP safe.
> > > > > 
> > > > > A small step would be to protect rt_cachecnt and rt_gwroute writers
> > > > > with a per route mutex.  I am aware that this is not complete, e.g.
> > > > > rt_flags also looks fishy.  The READ_ONCE() in rt_getll() does not
> > > > > protect from use-after-free.  I want to avoid mutex in the hot path.
> > > > > A previous attempt with SMR failed when running BGP.  Let's make
> > > > > this first step and find a smarter solition later.  route_cleargateway()
> > > > > is not in the hot packet path, so I added the mutex there.
> > > > > 
> > > > > I have moved the rt_cachecnt from rt_setgwroute() to rt_putgwroute()
> > > > > to have the RTF_CACHED logic in one function.
> > > > > 
> > > > > ok?
> > > > 
> > > > I'm begrudgingly OK with this. Right now the situation is clearly
> > > > incorrect and the mutex solves this right now.
> > > > Long term we should move towards making struct rtentry more RCU friendly
> > > > so we can drop the mutex. But lets tackle that after 7.7 
> > > 
> > > But please use the mutex in all cases and don't try to trick around it.
> > > I know it sucks but at least then the code is correct.
> > 
> > I agree, it's better to use mutex. I'm curious, is the performance impact
> > such significant?
> 
> I made some measurements for the different locks.
> 
> http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/perform.html
> 
> The performance impact of mutex, read once, mutex + refcount,
> smr + refcount is best seen in this graphic.
> http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/gnuplot/tcp.html
> 
> The difference is very small, the variation of multiple measurements
> is bigger.  It is most expensive when rt_getll() returns a refcounted
> route.  Calling rtref() and rtfree() in the hot path is so expensive
> that the mutex is irrelevant.
> 
> The effect is better visible in these kernel stack flame graphs.
> They show forwaring parallel UDP traffic.  Click on ip_forward and
> search for getll.
> 
> http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/2025-03-07T00%3A00%3A00Z/btrace/netbench.pl_-v_-B1000000000_-b1000000_-d1_-f1_-i3_-N10_-cperform%40lt13_-sperform%40lt16_-a10.3.46.60_-t10_udpbench-btrace-kstack.0.svg?s=getll&x=784.6&y=213
> http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/patch-sys-rt-gwroute-mutex.0/btrace/netbench.pl_-v_-B1000000000_-b1000000_-d1_-f1_-i3_-N10_-cperform%40lt13_-sperform%40lt16_-a10.3.46.60_-t10_udpbench-btrace-kstack.0.svg?s=getll&x=769.9&y=197
> http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/patch-sys-rt-gwroute-readonce.0/btrace/netbench.pl_-v_-B1000000000_-b1000000_-d1_-f1_-i3_-N10_-cperform%40lt13_-sperform%40lt16_-a10.3.46.60_-t10_udpbench-btrace-kstack.0.svg?s=getll&x=809.4&y=197
> http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/patch-sys-rt-gwroute-refcnt.0/btrace/netbench.pl_-v_-B1000000000_-b1000000_-d1_-f1_-i3_-N10_-cperform%40lt13_-sperform%40lt16_-a10.3.46.60_-t10_udpbench-btrace-kstack.0.svg?s=getll&x=639.2&y=197
> http://bluhm.genua.de/perform/results/2025-03-07T02:19:19Z/patch-sys-rt-gwroute-smr.0/btrace/netbench.pl_-v_-B1000000000_-b1000000_-d1_-f1_-i3_-N10_-cperform%40lt13_-sperform%40lt16_-a10.3.46.60_-t10_udpbench-btrace-kstack.0.svg?s=getll&x=676.9&y=213
> 
> SMR does not help anything here.  When playing with it, I saw
> additional races due to smr_barrier().
> 
> Maybe putting a mutex around rt_gwroute in rt_isvalid() and rt_getll()
> is best.  But it does not make the code more correct if rt_getll()
> does not refcount the route while holding the mutex.  If I remember
> mpi@ correctly, the idea of the RTF_CACHED flag is, that rt_gwroute
> is not removed while in use.  If that is true, READ_ONCE() should
> be sufficient.
> 

Thanks for all this work. I have to say I dislike to take sortcuts for
performance. Especially those that can turn bad later on.

Your SMR diff is too simplistic and so you don't really get the benefit. I
think the trick here is that you want to SMR all of struct rt_entry and
then you only need to take a refcnt if you want to sleep holding the
rt_entry.

It is interesting that refcnt_take() is so expensive. Something is off
there. I wonder why the mutex is cheap but the refcnt_take() is not.
Both require similar atomic instructions and I would expect similar
results for the two. Maybe this is some profiling artifact.

I doubt there are minimal patches to fix the issues around routes (without
affecting performance).  We need to sit down and rethink and reqrite a lot
of this.

> bluhm
> 
> Index: net/route.c
> ===================================================================
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/route.c,v
> diff -u -p -r1.443 route.c
> --- net/route.c	6 Mar 2025 23:09:02 -0000	1.443
> +++ net/route.c	7 Mar 2025 18:52:03 -0000
> @@ -324,10 +324,13 @@ rtisvalid(struct rtentry *rt)
>  		return (0);
>  
>  	if (ISSET(rt->rt_flags, RTF_GATEWAY)) {
> -		if (rt->rt_gwroute == NULL)
> +		struct rtentry *gwroute;
> +
> +		gwroute = READ_ONCE(rt->rt_gwroute);
> +		if (gwroute == NULL)
>  			return (0);
> -		KASSERT(!ISSET(rt->rt_gwroute->rt_flags, RTF_GATEWAY));
> -		if (!ISSET(rt->rt_gwroute->rt_flags, RTF_UP))
> +		KASSERT(!ISSET(gwroute->rt_flags, RTF_GATEWAY));
> +		if (!ISSET(gwroute->rt_flags, RTF_UP))
>  			return (0);
>  	}
>  
> @@ -599,8 +602,10 @@ rt_putgwroute(struct rtentry *rt, struct
>  		mtx_leave(&nhrt->rt_mtx);
>  	}
>  
> +	mtx_enter(&rt->rt_mtx);
>  	onhrt = rt->rt_gwroute;
>  	rt->rt_gwroute = nhrt;
> +	mtx_leave(&rt->rt_mtx);
>  
>  	if (onhrt != NULL) {
>  		mtx_enter(&onhrt->rt_mtx);
> @@ -1169,8 +1174,11 @@ struct rtentry *
>  rt_getll(struct rtentry *rt)
>  {
>  	if (ISSET(rt->rt_flags, RTF_GATEWAY)) {
> +		struct rtentry *gwroute;
> +
> +		gwroute = READ_ONCE(rt->rt_gwroute);
>  	 	/* We may return NULL here. */
> -		return (rt->rt_gwroute);
> +		return (gwroute);
>  	}
>  
>  	return (rt);
> Index: net/route.h
> ===================================================================
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/route.h,v
> diff -u -p -r1.214 route.h
> --- net/route.h	6 Mar 2025 23:09:02 -0000	1.214
> +++ net/route.h	7 Mar 2025 12:44:42 -0000
> @@ -123,7 +123,7 @@ struct rtentry {
>  	caddr_t		 rt_llinfo;	/* [L] pointer to link level info or
>  					   an MPLS structure */
>  	union {
> -		struct rtentry	*_nh;	/* [X] rtentry for rt_gateway */
> +		struct rtentry	*_nh;	/* [r] rtentry for rt_gateway */
>  		unsigned int	 _ref;	/* [r] # gateway rtentry refs */
>  	} RT_gw;
>  #define rt_gwroute	 RT_gw._nh
> Index: net/rtsock.c
> ===================================================================
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/rtsock.c,v
> diff -u -p -r1.381 rtsock.c
> --- net/rtsock.c	16 Feb 2025 11:39:28 -0000	1.381
> +++ net/rtsock.c	7 Mar 2025 12:44:34 -0000
> @@ -1343,9 +1343,11 @@ route_cleargateway(struct rtentry *rt, v
>  {
>  	struct rtentry *nhrt = arg;
>  
> +	mtx_enter(&rt->rt_mtx);
>  	if (ISSET(rt->rt_flags, RTF_GATEWAY) && rt->rt_gwroute == nhrt &&
>  	    !ISSET(rt->rt_locks, RTV_MTU))
>  		atomic_store_int(&rt->rt_mtu, 0);
> +	mtx_leave(&rt->rt_mtx);
>  
>  	return (0);
>  }
> 

-- 
:wq Claudio