Index | Thread | Search

From:
Claudio Jeker <cjeker@diehard.n-r-g.com>
Subject:
Re: ipv4 icmp_reflect() source address selection optimisation
To:
Alexander Bluhm <bluhm@openbsd.org>
Cc:
David Gwynne <david@gwynne.id.au>, tech@openbsd.org
Date:
Sat, 7 Jun 2025 08:12:07 +0200

Download raw body.

Thread
  • Alexander Bluhm:

    ipv4 icmp_reflect() source address selection optimisation

  • On Fri, Jun 06, 2025 at 11:02:19PM +0200, Alexander Bluhm wrote:
    > On Wed, Jun 04, 2025 at 09:09:01PM +0200, Claudio Jeker wrote:
    > > On Wed, Jun 04, 2025 at 04:24:22PM +1000, David Gwynne wrote:
    > > > On Wed, Aug 21, 2024 at 12:35:34PM +0200, Claudio Jeker wrote:
    > > > > On Wed, Aug 21, 2024 at 03:40:23PM +1000, David Gwynne wrote:
    > > > > > On Fri, Aug 09, 2024 at 11:47:14AM +0200, Claudio Jeker wrote:
    > > > > > > On Fri, Aug 09, 2024 at 11:15:34AM +0200, Alexander Bluhm wrote:
    > > > > > > > On Fri, Aug 09, 2024 at 10:40:11AM +1000, David Gwynne wrote:
    > > > > > > > > tl;dr: i believe this change would let us simplify pf_route (the
    > > > > > > > > loopback ip handling specifically), so it's worth it.
    > > > > > > > 
    > > > > > > > In contrast I have worked on all the corner cases in 15 years to
    > > > > > > > make it work.  Maybe not the best solution, but it works.  I fear
    > > > > > > > that at our customers somethig will break if we change behavior
    > > > > > > > just because we think it looks better.
    > > > > > 
    > > > > > all of them? that's a bold claim.
    > > > > > 
    > > > > > add this config the PF host in your test setup and then run the regress:
    > > > > > 
    > > > > > # cat /etc/hostname.vport0
    > > > > > inet 169.254.0.1 255.255.255.0
    > > > > > up
    > > > > > # cat /etc/hostname.pfsync0
    > > > > > syncif vport0
    > > > > > maxupd 128
    > > > > > defer
    > > > > > up
    > > > > > 
    > > > > > 
    > > > > > im hitting edge cases with pf_route and pfsync, that's why im working
    > > > > > on this. this bit is a chunk out of a much larger set of changes to fix
    > > > > > it, which ive included below. i'm starting to worry that i'll have
    > > > > > to carry it as part of my menagerie of local diffs on the firewalls cos
    > > > > > i'm the only person in the world doing multipath and pfsync?
    > > > > > 
    > > > > > my full diff for the pf_route and pfsync fixes is below. i'm not hugely
    > > > > > happy with chewing up more mbuf tags ids, but we haven't come up with a
    > > > > > less worse idea yet.
    > > > > > 
    > > > > > > This is not only about looks. ICMP source selection is a massive pain
    > > > > > > point on DFZ routers and we need to fix this. Reducing the amount of
    > > > > > > corners to cover would be very benefitial.
    > > > > > > 
    > > > > > > > In this particular case I think using the first addreess as source
    > > > > > > > address is wrong.  IPv4 source address selection should be done
    > > > > > > > with a route lookup.  Basically what in_pcbselsrc() does.  And using
    > > > > > > > the route interface address seems reasonable to me.
    > > > > > 
    > > > > > for the large part my last diff does follow what in_pcbselsrc does
    > > > > > more closely than the current code. it respects the "route sourceaddr"
    > > > > > config while the current code doesnt.
    > > > > > 
    > > > > > however, the icmp reflect situation is a little different to what
    > > > > > in_pcbselsrc usually handles though. in_pcbselsrc is used to pick
    > > > > > an address for a locally terminated connection, while icmp_reflect
    > > > > > can be used to reply to a packet going through the box. let's not
    > > > > > pretend these are completely identical situations.
    > > > > 
    > > > > Yes. This is also the big issue of icmp reflect since it needs to select a
    > > > > source IP for a situation where the system can not decide which IP is
    > > > > better. An Ip from the received interface, that of the route the
    > > > > destination points at, or an IP of the interface the ICMP will actually
    > > > > go out with.
    > > > > 
    > > > > We have 3 options and all 3 have issues. It is a game we can not win in
    > > > > code. With respecting the "route sourceaddr" we have at least a way to
    > > > > decide this via configuration.
    > > > >  
    > > > > > > To be honest I think it should not matter which IP is selected in the IPv4
    > > > > > > case. All adresses on the interface can be used to send out an ICMP error.
    > > > > > > Now I do agree that using the same logic as in in_pcbselsrc() would benefit
    > > > > > > consitancy.
    > > > > > 
    > > > > > i mostly agree, except selecting 127.0.0.1 to reply to a packet that
    > > > > > arrived off the wire seems bloody minded when there's (what i consider)
    > > > > > reasonable ways to avoid it.
    > > > > 
    > > > > Very much agree with that. I was mostly thinking of selecting an IP of an
    > > > > external interface (like the one the packet was received on).
    > > > 
    > > > while investigating a problem today i noticed an issue that this
    > > > diff fixes.
    > > > 
    > > > we have ipsec/ikev2 vpn servers for a bunch of windows boxes, which
    > > > is a fairly vanilla ipsec setup on openbsd. the box has a single public
    > > > ip facing the world, and iked allocates the clients addresses out
    > > > of a 192.168 range. the rest of the world has a route for the 192.168
    > > > address via the vpn servers public ip.
    > > > 
    > > > to avoid traffic loops when sending traffic to disconnected peers, we
    > > > add a blackhole route on the vpn servers for the 192.168 range, which
    > > > looks like this:
    > > > 
    > > > # netstat -nr -f inet
    > > > Destination        Gateway            Flags   Refs      Use   Mtu  Prio Iface
    > > > default            192.0.2.97         UGS       10 2059524192     -     8 vmx0 
    > > > 224/4              127.0.0.1          URS        0 10139877 32768     8 lo0  
    > > > 127/8              127.0.0.1          UGRS       0        0 32768     8 lo0  
    > > > 127.0.0.1          127.0.0.1          UHhl       2     8541 32768     1 lo0  
    > > > 192.0.2.96/29      192.0.2.101        UCn        2   200605     -     4 vmx0 
    > > > 192.0.2.97         00:00:5e:00:01:50  UHLch      1    42617     -     3 vmx0 
    > > > 192.0.2.99         bc:2c:55:d0:ea:d2  UHLc       0   168100     -     3 vmx0 
    > > > 192.0.2.101        00:50:56:a1:2a:d3  UHLl       0 950034255     -     1 vmx0 
    > > > 192.0.2.103        192.0.2.101        UHb        0        0     -     1 vmx0 
    > > > 192.168.76.128/25  127.0.0.1          UGSB       0 1839277300 32768     8 lo0
    > > > 
    > > > there is a problem if we want to talk to vpn clients from the vpn
    > > > server itself. the route lookup for clients in the 192.168 range
    > > > makes it look like the closest ip address is 127.0.0.1 cos the route
    > > > is on the loopback interface, but you shouldnt use 127/8 "on the
    > > > wire" for any packets leaving the box. we cope with this by using
    > > > route sourceaddr -ifp vmx0:
    > > > 
    > > > $ route -n sourceaddr     
    > > > Preferred source address set for rdomain 0
    > > > IPv4: 192.0.2.101     
    > > > IPv6: default
    > > > 
    > > > with this config we end up with the public ip as the if address that gets used as the source ip when connections are made to the vpn endpoints:
    > > > 
    > > > $ route -n get 192.168.76.222   
    > > >    route to: 192.168.76.222
    > > > destination: 192.168.76.128
    > > >        mask: 255.255.255.128
    > > >     gateway: 127.0.0.1
    > > >   interface: lo0
    > > >  if address: 192.0.2.101
    > > >    priority: 8 (static)
    > > >       flags: <UP,GATEWAY,DONE,STATIC,BLACKHOLE>
    > > >      use       mtu    expire
    > > > 1839323462     32768         0 
    > > > 
    > > > if there's an active endpoint on that IP, then we can send packets
    > > > to it with our public IP as the source address, and ipsec will steal
    > > > it before it gets dumped by the loopback interface.
    > > > 
    > > > the problem is if a vpn endpoint sends us something and we have to
    > > > generate an icmp packet and reflect it back. because icmp_reflect
    > > > wasnt updated like the source address selection on sockets was for
    > > > sourceaddr, we see stuff like this:
    > > > 
    > > > 15:48:54.350914 (authentic,confidential): SPI 0x41f838b5: 192.168.76.164.55401 > 255.255.255.255.1947: udp 40 (encap)
    > > > 15:48:54.350940 (authentic,confidential): SPI 0xfc332686: 127.0.0.1 > 192.168.76.164: icmp: 255.255.255.255 udp port 1947 unreachable (encap)
    > > > 15:48:54.735090 (authentic,confidential): SPI 0x8b05c3b7: 192.168.76.150.17500 > 255.255.255.255.17500: udp 325 (encap)
    > > > 15:48:54.735115 (authentic,confidential): SPI 0x25f1f8ce: 127.0.0.1 > 192.168.76.150: icmp: 255.255.255.255 udp port 17500 unreachable (encap)
    > > > 
    > > > icmp_reflect() is using the ip from the interface with the route,
    > > > which is 127.0.0.1 on lo0 which has the blackhole route on it.
    > > > 
    > > > with this diff, the icmp replies come from the route sourceaddr,
    > > > consistent with the pcb source address selection:
    > > > 
    > > > 16:15:26.459005 (authentic,confidential): SPI 0xe6b11fd2: 192.168.76.169.50377 > 255.255.255.255.1947: udp 40 (encap)
    > > > 16:15:26.459039 (authentic,confidential): SPI 0x37658bde: 192.0.2.101 > 192.168.76.169: icmp: 255.255.255.255 udp port 1947 unreachable (encap)
    > > > 16:15:46.200838 (authentic,confidential): SPI 0x8bfcd15f: 192.168.76.216.57621 > 255.255.255.255.8612: udp 16 (encap)
    > > > 16:15:46.200849 (authentic,confidential): SPI 0xb0047d40: 192.0.2.101 > 192.168.76.216: icmp: 255.255.255.255 udp port 8612 unreachable (encap)
    > > > 
    > > > however, i still believe icmp_reflection should prefer the IP on
    > > > the interface a forwarded packet was received on over the local ip
    > > > of the destination network. again, this helps avoid picking 127.0.0.1
    > > > for blackhole/reject routes. this is not a situation that pcb source
    > > > address selection has to deal with, so it's ok for the decisions
    > > > to be made a little differently.
    > > 
    > > I thought that route sourceaddr would also influence icmp_reflect().
    > > It seems this was not done but it should be done since on IXP routers you
    > > really don't want to reflect with the IXP lan IP.
    > > 
    > > I think the route sourceaddr selection vs directly connected host is not
    > > done correctly. At least matching on just RTF_HOST will also include
    > > PMTUD cloned routes which is for sure wrong here.
    > >     ISSET(rt->rt_flags, RTF_LLINFO|RTF_HOST)) {
    > > 
    > > I wonder if this should instead use !ISSET(..., RTF_GATEWAY) since gateway
    > > routes are by definition not directly reachable.
    > > 
    > > Anyway this affects also the in_pcb code and I think that should be fixed
    > > afterwards in tree.
    > 
    > This test passed regress.  Except ipsec where some regex on expected
    > icmp packets did not match.  But these can be fixed.
    > 
    > But I am still not convinced why we need a different algorithm for
    > ICMP packets.  We generate ICMP packets and we use the regular
    > algorithm to do source selection.  It is the address of the outgoing
    > interface.
    > 
    > All your trouble comes from the reject route to loopback.  Then the
    > ICMP packet has 127.0.0.1.  Unfortunately reject only works with
    > loopback.
    > 
    > Why not implement the route reject feature in ipv4_input()?  Then
    > you control the route that is used to generate the ICMP source
    > address.  By giving the reject route a gateway, the source IP will
    > be in its network.  What do you think about this idea?
    
    We need both in the long term. But for routers it is curcial to have a way
    to control the source address of ICMP packets. In frequent cases the IP
    address used by the regular algorithm are unusable (e.g. because the
    routes point to a interface that uses an IP which is not in the global
    routing table).  I thought that route sourceaddr already did this behaviour
    change to ICMP.
    
    The problem with reject and blackhole routes is also annoying. Again
    we just keep some bad design because it was done so in 4.4BSD. The issue
    here is that injecting blackhole routes is none obvious and we uncover
    more and more side-effects of forcing routes via loopback for this.
    We hit another issue with the egress group the other day which resulted
    in pf to use 127.0.0.1 as outgoing IP address.
     
    > bluhm
    > 
    > > > Index: ip_icmp.c
    > > > ===================================================================
    > > > RCS file: /cvs/src/sys/netinet/ip_icmp.c,v
    > > > diff -u -p -r1.199 ip_icmp.c
    > > > --- ip_icmp.c	2 Mar 2025 21:28:32 -0000	1.199
    > > > +++ ip_icmp.c	4 Jun 2025 03:02:10 -0000
    > > > @@ -691,7 +691,8 @@ icmp_reflect(struct mbuf *m, struct mbuf
    > > >  	struct ip *ip = mtod(m, struct ip *);
    > > >  	struct mbuf *opts = NULL;
    > > >  	struct sockaddr_in sin;
    > > > -	struct rtentry *rt = NULL;
    > > > +	struct rtentry *rt;
    > > > +	struct in_addr ip_src = { INADDR_ANY };
    > > >  	int optlen = (ip->ip_hl << 2) - sizeof(struct ip);
    > > >  	u_int rtableid;
    > > >  	u_int8_t pfflags;
    > > > @@ -708,10 +709,6 @@ icmp_reflect(struct mbuf *m, struct mbuf
    > > >  		return (ELOOP);
    > > >  	}
    > > >  	rtableid = m->m_pkthdr.ph_rtableid;
    > > > -	pfflags = m->m_pkthdr.pf.flags;
    > > > -	m_resethdr(m);
    > > > -	m->m_pkthdr.ph_rtableid = rtableid;
    > > > -	m->m_pkthdr.pf.flags = pfflags & PF_TAG_GENERATED;
    > > >  
    > > >  	/*
    > > >  	 * If the incoming packet was addressed directly to us,
    > > > @@ -725,19 +722,24 @@ icmp_reflect(struct mbuf *m, struct mbuf
    > > >  		sin.sin_addr = ip->ip_dst;
    > > >  
    > > >  		rt = rtalloc(sintosa(&sin), 0, rtableid);
    > > > -		if (rtisvalid(rt) &&
    > > > -		    ISSET(rt->rt_flags, RTF_LOCAL|RTF_BROADCAST))
    > > > -			ia = ifatoia(rt->rt_ifa);
    > > > -	}
    > > > +		if (rtisvalid(rt)) {
    > > > +			if (ISSET(rt->rt_flags, RTF_LOCAL))
    > > > +				ip_src = ip->ip_dst;
    > > > +			else if (ISSET(rt->rt_flags, RTF_BROADCAST)) {
    > > > +				ia = ifatoia(rt->rt_ifa);
    > > > +				ip_src = ia->ia_addr.sin_addr;
    > > > +			}
    > > > +		}
    > > > +		rtfree(rt);
    > > > +	} else
    > > > +		ip_src = ia->ia_addr.sin_addr;
    > > >  
    > > >  	/*
    > > >  	 * The following happens if the packet was not addressed to us.
    > > > -	 * Use the new source address and do a route lookup. If it fails
    > > > -	 * drop the packet as there is no path to the host.
    > > > +	 * If we're directly connected use the closest address, otherwise
    > > > +	 * try to use the sourceaddr from the routing table.
    > > >  	 */
    > > > -	if (ia == NULL) {
    > > > -		rtfree(rt);
    > > > -
    > > > +	if (ip_src.s_addr == INADDR_ANY) {
    > > >  		memset(&sin, 0, sizeof(sin));
    > > >  		sin.sin_len = sizeof(sin);
    > > >  		sin.sin_family = AF_INET;
    > > > @@ -745,21 +747,56 @@ icmp_reflect(struct mbuf *m, struct mbuf
    > > >  
    > > >  		/* keep packet in the original virtual instance */
    > > >  		rt = rtalloc(sintosa(&sin), RT_RESOLVE, rtableid);
    > > > -		if (rt == NULL) {
    > > > -			ipstat_inc(ips_noroute);
    > > > -			m_freem(m);
    > > > -			return (EHOSTUNREACH);
    > > > +		if (rtisvalid(rt) &&
    > > > +		    ISSET(rt->rt_flags, RTF_LLINFO|RTF_HOST)) {
    > > > +			ia = ifatoia(rt->rt_ifa);
    > > > +			ip_src = ia->ia_addr.sin_addr;
    > > > +		} else {
    > > > +			struct sockaddr *sourceaddr;
    > > > +			struct ifaddr *ifa;
    > > > +
    > > > +			sourceaddr = rtable_getsource(rtableid, AF_INET);
    > > > +			if (sourceaddr != NULL) {
    > > > +				ifa = ifa_ifwithaddr(sourceaddr, rtableid);
    > > > +				if (ifa != NULL &&
    > > > +				    ISSET(ifa->ifa_ifp->if_flags, IFF_UP))
    > > > +					ip_src = satosin(sourceaddr)->sin_addr;
    > > > +			}
    > > >  		}
    > > > +		rtfree(rt);
    > > > +	}
    > > >  
    > > > -		ia = ifatoia(rt->rt_ifa);
    > > > +	/*
    > > > +	 * If the above didn't find an ip_src, get the IP of the
    > > > +	 * interface the original packet was received on. If all this
    > > > +	 * comes up with nothing, ip_output() will try and fill it
    > > > +	 * in for us.
    > > > +	 */
    > > > +	if (ip_src.s_addr == INADDR_ANY) {
    > > > +		struct ifnet *ifp;
    > > > +		struct ifaddr *ifa;
    > > > +
    > > > +		ifp = if_get(m->m_pkthdr.ph_ifidx);
    > > > +		if (ifp != NULL) {
    > > > +			TAILQ_FOREACH(ifa, &ifp->if_addrlist, ifa_list) {
    > > > +				if (ifa->ifa_addr->sa_family != AF_INET)
    > > > +					continue;
    > > > +
    > > > +				ip_src = satosin(ifa->ifa_addr)->sin_addr;
    > > > +				break;
    > > > +			}
    > > > +		}
    > > > +		if_put(ifp);
    > > >  	}
    > > >  
    > > > +	pfflags = m->m_pkthdr.pf.flags;
    > > > +
    > > > +	m_resethdr(m);
    > > > +	m->m_pkthdr.ph_rtableid = rtableid;
    > > > +	m->m_pkthdr.pf.flags = pfflags & PF_TAG_GENERATED;
    > > >  	ip->ip_dst = ip->ip_src;
    > > > +	ip->ip_src = ip_src;
    > > >  	ip->ip_ttl = MAXTTL;
    > > > -
    > > > -	/* It is safe to dereference ``ia'' iff ``rt'' is valid. */
    > > > -	ip->ip_src = ia->ia_addr.sin_addr;
    > > > -	rtfree(rt);
    > > >  
    > > >  	if (optlen > 0) {
    > > >  		u_char *cp;
    > > > 
    > > 
    > > -- 
    > > :wq Claudio
    > 
    
    -- 
    :wq Claudio
    
    
  • Alexander Bluhm:

    ipv4 icmp_reflect() source address selection optimisation