Index | Thread | Search

From:
David Gwynne <david@gwynne.id.au>
Subject:
Re: ipv4 icmp_reflect() source address selection optimisation
To:
Alexander Bluhm <alexander.bluhm@gmx.net>
Cc:
tech@openbsd.org
Date:
Sat, 7 Jun 2025 13:57:45 +1000

Download raw body.

Thread
On Fri, Jun 06, 2025 at 11:02:19PM +0200, Alexander Bluhm wrote:
> On Wed, Jun 04, 2025 at 09:09:01PM +0200, Claudio Jeker wrote:
> > On Wed, Jun 04, 2025 at 04:24:22PM +1000, David Gwynne wrote:
> > > On Wed, Aug 21, 2024 at 12:35:34PM +0200, Claudio Jeker wrote:
> > > > On Wed, Aug 21, 2024 at 03:40:23PM +1000, David Gwynne wrote:
> > > > > On Fri, Aug 09, 2024 at 11:47:14AM +0200, Claudio Jeker wrote:
> > > > > > On Fri, Aug 09, 2024 at 11:15:34AM +0200, Alexander Bluhm wrote:
> > > > > > > On Fri, Aug 09, 2024 at 10:40:11AM +1000, David Gwynne wrote:
> > > > > > > > tl;dr: i believe this change would let us simplify pf_route (the
> > > > > > > > loopback ip handling specifically), so it's worth it.
> > > > > > > 
> > > > > > > In contrast I have worked on all the corner cases in 15 years to
> > > > > > > make it work.  Maybe not the best solution, but it works.  I fear
> > > > > > > that at our customers somethig will break if we change behavior
> > > > > > > just because we think it looks better.
> > > > > 
> > > > > all of them? that's a bold claim.
> > > > > 
> > > > > add this config the PF host in your test setup and then run the regress:
> > > > > 
> > > > > # cat /etc/hostname.vport0
> > > > > inet 169.254.0.1 255.255.255.0
> > > > > up
> > > > > # cat /etc/hostname.pfsync0
> > > > > syncif vport0
> > > > > maxupd 128
> > > > > defer
> > > > > up
> > > > > 
> > > > > 
> > > > > im hitting edge cases with pf_route and pfsync, that's why im working
> > > > > on this. this bit is a chunk out of a much larger set of changes to fix
> > > > > it, which ive included below. i'm starting to worry that i'll have
> > > > > to carry it as part of my menagerie of local diffs on the firewalls cos
> > > > > i'm the only person in the world doing multipath and pfsync?
> > > > > 
> > > > > my full diff for the pf_route and pfsync fixes is below. i'm not hugely
> > > > > happy with chewing up more mbuf tags ids, but we haven't come up with a
> > > > > less worse idea yet.
> > > > > 
> > > > > > This is not only about looks. ICMP source selection is a massive pain
> > > > > > point on DFZ routers and we need to fix this. Reducing the amount of
> > > > > > corners to cover would be very benefitial.
> > > > > > 
> > > > > > > In this particular case I think using the first addreess as source
> > > > > > > address is wrong.  IPv4 source address selection should be done
> > > > > > > with a route lookup.  Basically what in_pcbselsrc() does.  And using
> > > > > > > the route interface address seems reasonable to me.
> > > > > 
> > > > > for the large part my last diff does follow what in_pcbselsrc does
> > > > > more closely than the current code. it respects the "route sourceaddr"
> > > > > config while the current code doesnt.
> > > > > 
> > > > > however, the icmp reflect situation is a little different to what
> > > > > in_pcbselsrc usually handles though. in_pcbselsrc is used to pick
> > > > > an address for a locally terminated connection, while icmp_reflect
> > > > > can be used to reply to a packet going through the box. let's not
> > > > > pretend these are completely identical situations.
> > > > 
> > > > Yes. This is also the big issue of icmp reflect since it needs to select a
> > > > source IP for a situation where the system can not decide which IP is
> > > > better. An Ip from the received interface, that of the route the
> > > > destination points at, or an IP of the interface the ICMP will actually
> > > > go out with.
> > > > 
> > > > We have 3 options and all 3 have issues. It is a game we can not win in
> > > > code. With respecting the "route sourceaddr" we have at least a way to
> > > > decide this via configuration.
> > > >  
> > > > > > To be honest I think it should not matter which IP is selected in the IPv4
> > > > > > case. All adresses on the interface can be used to send out an ICMP error.
> > > > > > Now I do agree that using the same logic as in in_pcbselsrc() would benefit
> > > > > > consitancy.
> > > > > 
> > > > > i mostly agree, except selecting 127.0.0.1 to reply to a packet that
> > > > > arrived off the wire seems bloody minded when there's (what i consider)
> > > > > reasonable ways to avoid it.
> > > > 
> > > > Very much agree with that. I was mostly thinking of selecting an IP of an
> > > > external interface (like the one the packet was received on).
> > > 
> > > while investigating a problem today i noticed an issue that this
> > > diff fixes.
> > > 
> > > we have ipsec/ikev2 vpn servers for a bunch of windows boxes, which
> > > is a fairly vanilla ipsec setup on openbsd. the box has a single public
> > > ip facing the world, and iked allocates the clients addresses out
> > > of a 192.168 range. the rest of the world has a route for the 192.168
> > > address via the vpn servers public ip.
> > > 
> > > to avoid traffic loops when sending traffic to disconnected peers, we
> > > add a blackhole route on the vpn servers for the 192.168 range, which
> > > looks like this:
> > > 
> > > # netstat -nr -f inet
> > > Destination        Gateway            Flags   Refs      Use   Mtu  Prio Iface
> > > default            192.0.2.97         UGS       10 2059524192     -     8 vmx0 
> > > 224/4              127.0.0.1          URS        0 10139877 32768     8 lo0  
> > > 127/8              127.0.0.1          UGRS       0        0 32768     8 lo0  
> > > 127.0.0.1          127.0.0.1          UHhl       2     8541 32768     1 lo0  
> > > 192.0.2.96/29      192.0.2.101        UCn        2   200605     -     4 vmx0 
> > > 192.0.2.97         00:00:5e:00:01:50  UHLch      1    42617     -     3 vmx0 
> > > 192.0.2.99         bc:2c:55:d0:ea:d2  UHLc       0   168100     -     3 vmx0 
> > > 192.0.2.101        00:50:56:a1:2a:d3  UHLl       0 950034255     -     1 vmx0 
> > > 192.0.2.103        192.0.2.101        UHb        0        0     -     1 vmx0 
> > > 192.168.76.128/25  127.0.0.1          UGSB       0 1839277300 32768     8 lo0
> > > 
> > > there is a problem if we want to talk to vpn clients from the vpn
> > > server itself. the route lookup for clients in the 192.168 range
> > > makes it look like the closest ip address is 127.0.0.1 cos the route
> > > is on the loopback interface, but you shouldnt use 127/8 "on the
> > > wire" for any packets leaving the box. we cope with this by using
> > > route sourceaddr -ifp vmx0:
> > > 
> > > $ route -n sourceaddr     
> > > Preferred source address set for rdomain 0
> > > IPv4: 192.0.2.101     
> > > IPv6: default
> > > 
> > > with this config we end up with the public ip as the if address that gets used as the source ip when connections are made to the vpn endpoints:
> > > 
> > > $ route -n get 192.168.76.222   
> > >    route to: 192.168.76.222
> > > destination: 192.168.76.128
> > >        mask: 255.255.255.128
> > >     gateway: 127.0.0.1
> > >   interface: lo0
> > >  if address: 192.0.2.101
> > >    priority: 8 (static)
> > >       flags: <UP,GATEWAY,DONE,STATIC,BLACKHOLE>
> > >      use       mtu    expire
> > > 1839323462     32768         0 
> > > 
> > > if there's an active endpoint on that IP, then we can send packets
> > > to it with our public IP as the source address, and ipsec will steal
> > > it before it gets dumped by the loopback interface.
> > > 
> > > the problem is if a vpn endpoint sends us something and we have to
> > > generate an icmp packet and reflect it back. because icmp_reflect
> > > wasnt updated like the source address selection on sockets was for
> > > sourceaddr, we see stuff like this:
> > > 
> > > 15:48:54.350914 (authentic,confidential): SPI 0x41f838b5: 192.168.76.164.55401 > 255.255.255.255.1947: udp 40 (encap)
> > > 15:48:54.350940 (authentic,confidential): SPI 0xfc332686: 127.0.0.1 > 192.168.76.164: icmp: 255.255.255.255 udp port 1947 unreachable (encap)
> > > 15:48:54.735090 (authentic,confidential): SPI 0x8b05c3b7: 192.168.76.150.17500 > 255.255.255.255.17500: udp 325 (encap)
> > > 15:48:54.735115 (authentic,confidential): SPI 0x25f1f8ce: 127.0.0.1 > 192.168.76.150: icmp: 255.255.255.255 udp port 17500 unreachable (encap)
> > > 
> > > icmp_reflect() is using the ip from the interface with the route,
> > > which is 127.0.0.1 on lo0 which has the blackhole route on it.
> > > 
> > > with this diff, the icmp replies come from the route sourceaddr,
> > > consistent with the pcb source address selection:
> > > 
> > > 16:15:26.459005 (authentic,confidential): SPI 0xe6b11fd2: 192.168.76.169.50377 > 255.255.255.255.1947: udp 40 (encap)
> > > 16:15:26.459039 (authentic,confidential): SPI 0x37658bde: 192.0.2.101 > 192.168.76.169: icmp: 255.255.255.255 udp port 1947 unreachable (encap)
> > > 16:15:46.200838 (authentic,confidential): SPI 0x8bfcd15f: 192.168.76.216.57621 > 255.255.255.255.8612: udp 16 (encap)
> > > 16:15:46.200849 (authentic,confidential): SPI 0xb0047d40: 192.0.2.101 > 192.168.76.216: icmp: 255.255.255.255 udp port 8612 unreachable (encap)
> > > 
> > > however, i still believe icmp_reflection should prefer the IP on
> > > the interface a forwarded packet was received on over the local ip
> > > of the destination network. again, this helps avoid picking 127.0.0.1
> > > for blackhole/reject routes. this is not a situation that pcb source
> > > address selection has to deal with, so it's ok for the decisions
> > > to be made a little differently.
> > 
> > I thought that route sourceaddr would also influence icmp_reflect().
> > It seems this was not done but it should be done since on IXP routers you
> > really don't want to reflect with the IXP lan IP.
> > 
> > I think the route sourceaddr selection vs directly connected host is not
> > done correctly. At least matching on just RTF_HOST will also include
> > PMTUD cloned routes which is for sure wrong here.
> >     ISSET(rt->rt_flags, RTF_LLINFO|RTF_HOST)) {
> > 
> > I wonder if this should instead use !ISSET(..., RTF_GATEWAY) since gateway
> > routes are by definition not directly reachable.
> > 
> > Anyway this affects also the in_pcb code and I think that should be fixed
> > afterwards in tree.
> 
> This test passed regress.  Except ipsec where some regex on expected
> icmp packets did not match.  But these can be fixed.
> 
> But I am still not convinced why we need a different algorithm for
> ICMP packets.  We generate ICMP packets and we use the regular
> algorithm to do source selection.  It is the address of the outgoing
> interface.

that's not quite true yet. the diff changes two things. the first
problem is that icmp_reflect does not consider route sourceaddr
like in_pcbselsrc does, which this diff fixes. the second is to use
where the mbuf came from (m->m_pkthdr.ph_ifidx) to pick an ip.

i would argue we're using the same algorithm for the common set of
info this code and in_pcbselsrc can make decisions with.

i could factor the code out for the address selection and use it for
both in_pcbselsrc and icmp_reflect, but because in_pcbselsrc doesn't
know if it's picking an address for a reply or not (ie, it doesn't have
access to an m->m_pkthdr.ph_ifidx), it wont use that chunk of code.

> All your trouble comes from the reject route to loopback.  Then the
> ICMP packet has 127.0.0.1.  Unfortunately reject only works with
> loopback.
> 
> Why not implement the route reject feature in ipv4_input()?  Then
> you control the route that is used to generate the ICMP source
> address.  By giving the reject route a gateway, the source IP will
> be in its network.  What do you think about this idea?

if the stack handles RTF_REJECT|RTF_BLACKHOLE, then i can do this in my
vpn concentrator setup:

$ route -qn add 192.168.76.128/25 192.0.2.101 -blackhole

that would look like this:

$ route -n get 192.168.76.128
   route to: 192.168.76.128
destination: 192.168.76.128
       mask: 255.255.255.128
    gateway: 192.0.2.101
  interface: vmx0
 if address: 192.0.2.101
   priority: 8 (static)
      flags: <UP,GATEWAY,DONE,STATIC,BLACKHOLE>
     use       mtu    expire
       0         0         0 

is that what you're suggesting?

i like that idea of handling RTF_REJECT|RTF_BLACKHOLE in the IP
stack instead of lo(4), and proposed it around the same time i sent
the original icmp_reflect diff out. i got the strong impression at
the time that you thought it was too risky. maybe you were talking
about the other changes i made to lo(4) at the time. i will revisit
that one day, but i have enough balls in the air already right now.

how about i make this a bit easier for you and split the changes
in this diff up?

this one has icmp_reflect respect the route sourceaddr like
in_pcbselsrc, but doesn't use the rcvif to pick an address.

Index: ip_icmp.c
===================================================================
RCS file: /cvs/src/sys/netinet/ip_icmp.c,v
diff -u -p -r1.199 ip_icmp.c
--- ip_icmp.c	2 Mar 2025 21:28:32 -0000	1.199
+++ ip_icmp.c	7 Jun 2025 03:51:30 -0000
@@ -691,7 +691,8 @@ icmp_reflect(struct mbuf *m, struct mbuf
 	struct ip *ip = mtod(m, struct ip *);
 	struct mbuf *opts = NULL;
 	struct sockaddr_in sin;
-	struct rtentry *rt = NULL;
+	struct rtentry *rt;
+	struct in_addr ip_src = { INADDR_ANY };
 	int optlen = (ip->ip_hl << 2) - sizeof(struct ip);
 	u_int rtableid;
 	u_int8_t pfflags;
@@ -708,10 +709,6 @@ icmp_reflect(struct mbuf *m, struct mbuf
 		return (ELOOP);
 	}
 	rtableid = m->m_pkthdr.ph_rtableid;
-	pfflags = m->m_pkthdr.pf.flags;
-	m_resethdr(m);
-	m->m_pkthdr.ph_rtableid = rtableid;
-	m->m_pkthdr.pf.flags = pfflags & PF_TAG_GENERATED;
 
 	/*
 	 * If the incoming packet was addressed directly to us,
@@ -725,19 +722,24 @@ icmp_reflect(struct mbuf *m, struct mbuf
 		sin.sin_addr = ip->ip_dst;
 
 		rt = rtalloc(sintosa(&sin), 0, rtableid);
-		if (rtisvalid(rt) &&
-		    ISSET(rt->rt_flags, RTF_LOCAL|RTF_BROADCAST))
-			ia = ifatoia(rt->rt_ifa);
-	}
+		if (rtisvalid(rt)) {
+			if (ISSET(rt->rt_flags, RTF_LOCAL))
+				ip_src = ip->ip_dst;
+			else if (ISSET(rt->rt_flags, RTF_BROADCAST)) {
+				ia = ifatoia(rt->rt_ifa);
+				ip_src = ia->ia_addr.sin_addr;
+			}
+		}
+		rtfree(rt);
+	} else
+		ip_src = ia->ia_addr.sin_addr;
 
 	/*
 	 * The following happens if the packet was not addressed to us.
-	 * Use the new source address and do a route lookup. If it fails
-	 * drop the packet as there is no path to the host.
+	 * If we're directly connected use the closest address, otherwise
+	 * try to use the sourceaddr from the routing table.
 	 */
-	if (ia == NULL) {
-		rtfree(rt);
-
+	if (ip_src.s_addr == INADDR_ANY) {
 		memset(&sin, 0, sizeof(sin));
 		sin.sin_len = sizeof(sin);
 		sin.sin_family = AF_INET;
@@ -745,21 +747,38 @@ icmp_reflect(struct mbuf *m, struct mbuf
 
 		/* keep packet in the original virtual instance */
 		rt = rtalloc(sintosa(&sin), RT_RESOLVE, rtableid);
-		if (rt == NULL) {
-			ipstat_inc(ips_noroute);
-			m_freem(m);
-			return (EHOSTUNREACH);
+		if (rtisvalid(rt) &&
+		    ISSET(rt->rt_flags, RTF_LLINFO|RTF_HOST)) {
+			ia = ifatoia(rt->rt_ifa);
+			ip_src = ia->ia_addr.sin_addr;
+		} else {
+			struct sockaddr *sourceaddr;
+			struct ifaddr *ifa;
+
+			sourceaddr = rtable_getsource(rtableid, AF_INET);
+			if (sourceaddr != NULL) {
+				ifa = ifa_ifwithaddr(sourceaddr, rtableid);
+				if (ifa != NULL &&
+				    ISSET(ifa->ifa_ifp->if_flags, IFF_UP))
+					ip_src = satosin(sourceaddr)->sin_addr;
+			}
 		}
-
-		ia = ifatoia(rt->rt_ifa);
+		rtfree(rt);
 	}
 
+	/*
+	 * If the above didn't find an ip_src, ip_output() will try
+	 * and fill it in for us.
+	 */
+
+	pfflags = m->m_pkthdr.pf.flags;
+
+	m_resethdr(m);
+	m->m_pkthdr.ph_rtableid = rtableid;
+	m->m_pkthdr.pf.flags = pfflags & PF_TAG_GENERATED;
 	ip->ip_dst = ip->ip_src;
+	ip->ip_src = ip_src;
 	ip->ip_ttl = MAXTTL;
-
-	/* It is safe to dereference ``ia'' iff ``rt'' is valid. */
-	ip->ip_src = ia->ia_addr.sin_addr;
-	rtfree(rt);
 
 	if (optlen > 0) {
 		u_char *cp;