Index | Thread | Search

From:
Mark Kettenis <mark.kettenis@xs4all.nl>
Subject:
Re: make route rt_locks atomic
To:
Claudio Jeker <cjeker@diehard.n-r-g.com>
Cc:
bluhm@openbsd.org, tech@openbsd.org
Date:
Mon, 20 Jan 2025 17:54:43 +0100

Download raw body.

Thread
> Date: Mon, 20 Jan 2025 17:03:29 +0100
> From: Claudio Jeker <cjeker@diehard.n-r-g.com>
> 
> On Mon, Jan 20, 2025 at 02:51:02PM +0100, Alexander Bluhm wrote:
> > On Mon, Jan 20, 2025 at 02:09:03PM +0100, Claudio Jeker wrote:
> > > On Tue, Jan 14, 2025 at 12:16:59AM +0100, Alexander Bluhm wrote:
> > > > Hi,
> > > > 
> > > > rt_locks is accessed lockless.  Setting the bits has to be done
> > > > atomically.  Loads should be marked as atomic to avoid wrong compiler
> > > > optimizations.
> > > > 
> > > > The correlation between rt_locks and rt_mtu is very weak.  When
> > > > RTV_MTU is set, stop modifying rt_mtu.  But is is not really necessary
> > > > to stop updating rt_mtu immediately when another thread modifies
> > > > rt_locks.  The events in the network stack changing those values
> > > > are not synchronized anyway.  So I came to the conclusion that
> > > > memory barriers are not necessary.
> > > > 
> > > > In route message output, I have left setting the flags in exclusive
> > > > net lock as they are modified together with other route attributes.
> > > > 
> > > > ok?
> > > 
> > > I'm probably late to the party but again, how can rt_locks be atomic when
> > > the value rt_locks protects is changed not in an atomic transaction with
> > > the rt_locks value?
> > > This is not truly atomic and while what you do is probably OK it is not
> > > correct.
> > 
> > Changes of rt_locks are atomic with this diff.  Old code is lucky
> > without atomic_setbits_int() as RTV_MTU is the only flag used in
> > rt_locks.
> 
> I know. The current code is not correct. Our routing code was declared MP
> safe without ever looking at what happens on the nodes.
> Or actually parallel softnet was enabled without solving that piece of
> the puzzle.
> 
> > Another thing is how rt_locks and rt_mtu are related.  Do they need
> > a common mutex?  I say no.  Do the need memmory barries?  I also
> > say no, as rt_mtu is only valid if RTV_MTU is clean.  If one CPU
> > misses the RTV_MTU update, there is no rt_mtu update that must be
> > handled.  And if it misses the rt_mtu update, it does not matter
> > as RTV_MTU is set.
> 
> Why is rt_mtu only valid if RTV_MTU is unset? That is not true.
> You can lock the MTU of a route to some value from userland.
> So if one cpu sets RTV_MTU and another cpu updates rt_mtu at the same time
> you may end up with the wrong value locked.
> Also the rtv_mtu is not fully ignored when RTV_MTU is set. So I think we
> need to be careful here.
> 
> What about RTV_EXPIRE? This is not only about one value of rt_locks.
> Now RTV_EXPIRE is not really used (apart from route(8)). Still it is part
> of rt_locks.
> 
> I dislike that we mark a value as [a] when in reality it is more [a*]
> with a big * comment why we consider it atomic.

Right, and I wonder that if you need more than a sentence to descrive
why you consider access atomic, you're probably better off using a
mutex.

> > The only problematic path may be clearing RTV_MTU.  But I left this
> > code in exclusive net lock.
> > 
> > Note that the diff makes things only better.  If path MTU is really
> > not correct, we have to backout parallel softnet.  Or provide a
> > diff which fixes it.  I cannot see a bug.
> 
> Yes, we introduced this mess with parallel softnet. I fear with unlocking
> tcp the situation will get worse at first since tcp is the number one
> cause that fiddles with rt_mtu and RTV_MTU.
>  
> > The reason I want the volatile around rt_locks is to avoid compiler
> > misoptmisations.  Nothing can go wrong by CPU reordering in this
> > case.
> 
> This is fine. I think it would be good to have an atomic_checkbits_int()
> function since many of the checks of such atomic bitfields scare me as
> well.

We currently don't have consensus about how we handle these.  Can we
please try to reach such consensus before we make more of our codebase
unreadable?

We now have three different ways we deal with "simple atomic access":

1. Declare a variable or struct member as volatile.  This is how we
   check the P_* flags in struct proc and struct process for example.

2. Use atomic_{load|store}_int().  This is what all the recent changes
   to make sysctl "mpsafe" do for example.

3. Use READ_ONCE()/WRITE_ONCE().  Most examples of this pattern are in
   the socket code it seems.

The problem with #2 and #3 are that they introduce a lot of
distracting noise in the code.  And if you're going to say that to be
atomic all access has to use one of these functions/macros, you might
as well pick #1.  IIRC the argument of the Linux kernel developers to
not #1 is that it forces the compiler to always go the main memory to
access the data, which may be sub-optimal.

As to why we have both #2 and #3, I don't really know.  I suppose
atomic_{load|store}_int() goes together with the other atomic_*
functions.  But READ_ONCE() better expresses the case where we read a
value atomically into a local variable to guarantee we use the same
value in all the functions?  Also READ_ONCE() has an additional memory
barrier for alpha.  But atomic_load_int() probably needs that barrier
as well.

As to why we need to do something, mvs@ has mentioned compiler
optimizations a few times.  To understand the issues

  https://lwn.net/Articles/793253/

is a good read.  I'm not sure load tearing and store tearing are real
issues for us.  Allegedly those have been seen in the wild, but the
examples I've read about are all kind of convoluted.  And unlike Linux
we're not trying to support a wide range of compilers.  But most of
the other issues mentioned mean we have to do something.


> > > > Index: net/route.c
> > > > ===================================================================
> > > > RCS file: /data/mirror/openbsd/cvs/src/sys/net/route.c,v
> > > > diff -u -p -r1.438 route.c
> > > > --- net/route.c	3 Jan 2025 21:27:40 -0000	1.438
> > > > +++ net/route.c	13 Jan 2025 22:50:54 -0000
> > > > @@ -557,7 +557,7 @@ rt_setgwroute(struct rtentry *rt, const 
> > > >  	 * If the MTU of next hop is 0, this will reset the MTU of the
> > > >  	 * route to run PMTUD again from scratch.
> > > >  	 */
> > > > -	if (!ISSET(rt->rt_locks, RTV_MTU)) {
> > > > +	if (!ISSET(atomic_load_int(&rt->rt_locks), RTV_MTU)) {
> > > >  		u_int mtu, nhmtu;
> > > >  
> > > >  		mtu = atomic_load_int(&rt->rt_mtu);
> > > > Index: net/route.h
> > > > ===================================================================
> > > > RCS file: /data/mirror/openbsd/cvs/src/sys/net/route.h,v
> > > > diff -u -p -r1.211 route.h
> > > > --- net/route.h	3 Jan 2025 21:27:40 -0000	1.211
> > > > +++ net/route.h	13 Jan 2025 22:50:54 -0000
> > > > @@ -38,6 +38,7 @@
> > > >  /*
> > > >   * Locks used to protect struct members in this file:
> > > >   *	I	immutable after creation
> > > > + *	a	atomic operations
> > > >   *	N	net lock
> > > >   *	X	exclusive net lock, or shared net lock + kernel lock
> > > >   *	R	art (rtable) lock
> > > > @@ -60,7 +61,7 @@
> > > >  struct rt_kmetrics {
> > > >  	u_int64_t	rmx_pksent;	/* packets sent using this route */
> > > >  	int64_t		rmx_expire;	/* lifetime for route, e.g. redirect */
> > > > -	u_int		rmx_locks;	/* Kernel must leave these values */
> > > > +	u_int		rmx_locks;	/* [a] Kernel must leave these values */
> > > >  	u_int		rmx_mtu;	/* [a] MTU for this path */
> > > >  };
> > > >  #endif
> > > > Index: net/rtsock.c
> > > > ===================================================================
> > > > RCS file: /data/mirror/openbsd/cvs/src/sys/net/rtsock.c,v
> > > > diff -u -p -r1.377 rtsock.c
> > > > --- net/rtsock.c	9 Jan 2025 18:20:29 -0000	1.377
> > > > +++ net/rtsock.c	13 Jan 2025 22:50:54 -0000
> > > > @@ -1198,8 +1198,9 @@ change:
> > > >  		}
> > > >  		if_group_routechange(info->rti_info[RTAX_DST],
> > > >  		    info->rti_info[RTAX_NETMASK]);
> > > > -		rt->rt_locks &= ~(rtm->rtm_inits);
> > > > -		rt->rt_locks |= (rtm->rtm_inits & rtm->rtm_rmx.rmx_locks);
> > > > +		atomic_clearbits_int(&rt->rt_locks, rtm->rtm_inits);
> > > > +		atomic_setbits_int(&rt->rt_locks,
> > > > +		     rtm->rtm_inits & rtm->rtm_rmx.rmx_locks);
> > > >  		NET_UNLOCK();
> > > >  		break;
> > > >  	case RTM_GET:
> > > > @@ -1344,7 +1345,7 @@ route_cleargateway(struct rtentry *rt, v
> > > >  	struct rtentry *nhrt = arg;
> > > >  
> > > >  	if (ISSET(rt->rt_flags, RTF_GATEWAY) && rt->rt_gwroute == nhrt &&
> > > > -	    !ISSET(rt->rt_locks, RTV_MTU))
> > > > +	    !ISSET(atomic_load_int(&rt->rt_locks), RTV_MTU))
> > > >  		atomic_store_int(&rt->rt_mtu, 0);
> > > >  
> > > >  	return (0);
> > > > @@ -1420,7 +1421,7 @@ rtm_getmetrics(const struct rtentry *rt,
> > > >  	}
> > > >  
> > > >  	bzero(out, sizeof(*out));
> > > > -	out->rmx_locks = in->rmx_locks;
> > > > +	out->rmx_locks = atomic_load_int(&in->rmx_locks);
> > > >  	out->rmx_mtu = atomic_load_int(&in->rmx_mtu);
> > > >  	out->rmx_expire = expire;
> > > >  	out->rmx_pksent = in->rmx_pksent;
> > > > Index: netinet/ip_icmp.c
> > > > ===================================================================
> > > > RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/ip_icmp.c,v
> > > > diff -u -p -r1.198 ip_icmp.c
> > > > --- netinet/ip_icmp.c	3 Jan 2025 21:27:40 -0000	1.198
> > > > +++ netinet/ip_icmp.c	13 Jan 2025 22:50:54 -0000
> > > > @@ -1054,9 +1054,9 @@ icmp_mtudisc(struct icmp *icp, u_int rta
> > > >  	 *	  on a route.  We should be using a separate flag
> > > >  	 *	  for the kernel to indicate this.
> > > >  	 */
> > > > -	if ((rt->rt_locks & RTV_MTU) == 0) {
> > > > +	if (!ISSET(atomic_load_int(&rt->rt_locks), RTV_MTU)) {
> > > >  		if (mtu < 296 || mtu > ifp->if_mtu)
> > > > -			rt->rt_locks |= RTV_MTU;
> > > > +			atomic_setbits_int(&rt->rt_locks, RTV_MTU);
> > > >  		else if (rtmtu > mtu || rtmtu == 0)
> > > >  			atomic_cas_uint(&rt->rt_mtu, rtmtu, mtu);
> > > >  	}
> > > > @@ -1090,7 +1090,7 @@ icmp_mtudisc_timeout(struct rtentry *rt,
> > > >  			(*ctlfunc)(PRC_MTUINC, sintosa(&sin),
> > > >  			    rtableid, NULL);
> > > >  	} else {
> > > > -		if ((rt->rt_locks & RTV_MTU) == 0)
> > > > +		if (!ISSET(atomic_load_int(&rt->rt_locks), RTV_MTU))
> > > >  			atomic_store_int(&rt->rt_mtu, 0);
> > > >  	}
> > > >  
> > > > Index: netinet/ip_output.c
> > > > ===================================================================
> > > > RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/ip_output.c,v
> > > > diff -u -p -r1.402 ip_output.c
> > > > --- netinet/ip_output.c	3 Jan 2025 21:27:40 -0000	1.402
> > > > +++ netinet/ip_output.c	13 Jan 2025 22:50:54 -0000
> > > > @@ -384,7 +384,7 @@ sendit:
> > > >  	 * the route's MTU is locked.
> > > >  	 */
> > > >  	if ((flags & IP_MTUDISC) && ro && ro->ro_rt &&
> > > > -	    (ro->ro_rt->rt_locks & RTV_MTU) == 0)
> > > > +	    (!ISSET(atomic_load_int(&ro->ro_rt->rt_locks), RTV_MTU)))
> > > >  		ip->ip_off |= htons(IP_DF);
> > > >  
> > > >  #ifdef IPSEC
> > > > @@ -471,7 +471,7 @@ sendit:
> > > >  		 */
> > > >  		if (rtisvalid(ro->ro_rt) &&
> > > >  		    ISSET(ro->ro_rt->rt_flags, RTF_HOST) &&
> > > > -		    !(ro->ro_rt->rt_locks & RTV_MTU)) {
> > > > +		    !ISSET(atomic_load_int(&ro->ro_rt->rt_locks), RTV_MTU)) {
> > > >  			u_int rtmtu;
> > > >  
> > > >  			rtmtu = atomic_load_int(&ro->ro_rt->rt_mtu);
> > > > Index: netinet/tcp_subr.c
> > > > ===================================================================
> > > > RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/tcp_subr.c,v
> > > > diff -u -p -r1.204 tcp_subr.c
> > > > --- netinet/tcp_subr.c	3 Jan 2025 17:23:51 -0000	1.204
> > > > +++ netinet/tcp_subr.c	13 Jan 2025 22:50:54 -0000
> > > > @@ -868,7 +868,8 @@ tcp_mtudisc(struct inpcb *inp, int errno
> > > >  
> > > >  	rt = in_pcbrtentry(inp);
> > > >  	if (rt != NULL) {
> > > > -		unsigned int orig_mtulock = (rt->rt_locks & RTV_MTU);
> > > > +		unsigned int orig_mtulock =
> > > > +		    (atomic_load_int(&rt->rt_locks) & RTV_MTU);
> > > >  
> > > >  		/*
> > > >  		 * If this was not a host route, remove and realloc.
> > > > @@ -878,7 +879,7 @@ tcp_mtudisc(struct inpcb *inp, int errno
> > > >  			if ((rt = in_pcbrtentry(inp)) == NULL)
> > > >  				return;
> > > >  		}
> > > > -		if (orig_mtulock < (rt->rt_locks & RTV_MTU))
> > > > +		if (orig_mtulock < (atomic_load_int(&rt->rt_locks) & RTV_MTU))
> > > >  			change = 1;
> > > >  	}
> > > >  	tcp_mss(tp, -1);
> > > > Index: netinet/tcp_timer.c
> > > > ===================================================================
> > > > RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/tcp_timer.c,v
> > > > diff -u -p -r1.80 tcp_timer.c
> > > > --- netinet/tcp_timer.c	5 Jan 2025 12:18:48 -0000	1.80
> > > > +++ netinet/tcp_timer.c	13 Jan 2025 22:50:54 -0000
> > > > @@ -282,7 +282,7 @@ tcp_timer_rexmt(void *arg)
> > > >  		rt = in_pcbrtentry(inp);
> > > >  		/* Check if path MTU discovery is disabled already */
> > > >  		if (rt && (rt->rt_flags & RTF_HOST) &&
> > > > -		    (rt->rt_locks & RTV_MTU))
> > > > +		    ISSET(atomic_load_int(&rt->rt_locks), RTV_MTU))
> > > >  			goto leave;
> > > >  
> > > >  		rt = NULL;
> > > > @@ -303,8 +303,8 @@ tcp_timer_rexmt(void *arg)
> > > >  		}
> > > >  		if (rt != NULL) {
> > > >  			/* Disable path MTU discovery */
> > > > -			if ((rt->rt_locks & RTV_MTU) == 0) {
> > > > -				rt->rt_locks |= RTV_MTU;
> > > > +			if (!ISSET(atomic_load_int(&rt->rt_locks), RTV_MTU)) {
> > > > +				atomic_setbits_int(&rt->rt_locks, RTV_MTU);
> > > >  				in_rtchange(inp, 0);
> > > >  			}
> > > >  
> > > > Index: netinet6/icmp6.c
> > > > ===================================================================
> > > > RCS file: /data/mirror/openbsd/cvs/src/sys/netinet6/icmp6.c,v
> > > > diff -u -p -r1.256 icmp6.c
> > > > --- netinet6/icmp6.c	3 Jan 2025 21:27:40 -0000	1.256
> > > > +++ netinet6/icmp6.c	13 Jan 2025 22:50:54 -0000
> > > > @@ -1016,7 +1016,7 @@ icmp6_mtudisc_update(struct ip6ctlparam 
> > > >  	rt = icmp6_mtudisc_clone(&sin6, m->m_pkthdr.ph_rtableid, 0);
> > > >  
> > > >  	if (rt != NULL && ISSET(rt->rt_flags, RTF_HOST) &&
> > > > -	    !(rt->rt_locks & RTV_MTU)) {
> > > > +	    !ISSET(atomic_load_int(&rt->rt_locks), RTV_MTU)) {
> > > >  		u_int rtmtu;
> > > >  
> > > >  		rtmtu = atomic_load_int(&rt->rt_mtu);
> > > > @@ -1851,7 +1851,7 @@ icmp6_mtudisc_timeout(struct rtentry *rt
> > > >  	if ((rt->rt_flags & (RTF_DYNAMIC|RTF_HOST)) == (RTF_DYNAMIC|RTF_HOST)) {
> > > >  		rtdeletemsg(rt, ifp, rtableid);
> > > >  	} else {
> > > > -		if (!(rt->rt_locks & RTV_MTU))
> > > > +		if (!ISSET(atomic_load_int(&rt->rt_locks), RTV_MTU))
> > > >  			atomic_store_int(&rt->rt_mtu, 0);
> > > >  	}
> > > >  
> > > > Index: netinet6/ip6_output.c
> > > > ===================================================================
> > > > RCS file: /data/mirror/openbsd/cvs/src/sys/netinet6/ip6_output.c,v
> > > > diff -u -p -r1.294 ip6_output.c
> > > > --- netinet6/ip6_output.c	3 Jan 2025 21:27:40 -0000	1.294
> > > > +++ netinet6/ip6_output.c	13 Jan 2025 22:50:54 -0000
> > > > @@ -1047,7 +1047,7 @@ ip6_getpmtu(struct rtentry *rt, struct i
> > > >  			 * field isn't locked).
> > > >  			 */
> > > >  			mtu = ifp->if_mtu;
> > > > -			if (!(rt->rt_locks & RTV_MTU))
> > > > +			if (!ISSET(atomic_load_int(&rt->rt_locks), RTV_MTU))
> > > >  				atomic_cas_uint(&rt->rt_mtu, rtmtu, mtu);
> > > >  		}
> > > >  	} else {
> > > > 
> > > 
> > > -- 
> > > :wq Claudio
> > 
> 
> -- 
> :wq Claudio
> 
>