From: David Gwynne Subject: Re: ipv4 icmp_reflect() source address selection optimisation To: Alexander Bluhm Cc: tech@openbsd.org Date: Sat, 7 Jun 2025 13:57:45 +1000 On Fri, Jun 06, 2025 at 11:02:19PM +0200, Alexander Bluhm wrote: > On Wed, Jun 04, 2025 at 09:09:01PM +0200, Claudio Jeker wrote: > > On Wed, Jun 04, 2025 at 04:24:22PM +1000, David Gwynne wrote: > > > On Wed, Aug 21, 2024 at 12:35:34PM +0200, Claudio Jeker wrote: > > > > On Wed, Aug 21, 2024 at 03:40:23PM +1000, David Gwynne wrote: > > > > > On Fri, Aug 09, 2024 at 11:47:14AM +0200, Claudio Jeker wrote: > > > > > > On Fri, Aug 09, 2024 at 11:15:34AM +0200, Alexander Bluhm wrote: > > > > > > > On Fri, Aug 09, 2024 at 10:40:11AM +1000, David Gwynne wrote: > > > > > > > > tl;dr: i believe this change would let us simplify pf_route (the > > > > > > > > loopback ip handling specifically), so it's worth it. > > > > > > > > > > > > > > In contrast I have worked on all the corner cases in 15 years to > > > > > > > make it work. Maybe not the best solution, but it works. I fear > > > > > > > that at our customers somethig will break if we change behavior > > > > > > > just because we think it looks better. > > > > > > > > > > all of them? that's a bold claim. > > > > > > > > > > add this config the PF host in your test setup and then run the regress: > > > > > > > > > > # cat /etc/hostname.vport0 > > > > > inet 169.254.0.1 255.255.255.0 > > > > > up > > > > > # cat /etc/hostname.pfsync0 > > > > > syncif vport0 > > > > > maxupd 128 > > > > > defer > > > > > up > > > > > > > > > > > > > > > im hitting edge cases with pf_route and pfsync, that's why im working > > > > > on this. this bit is a chunk out of a much larger set of changes to fix > > > > > it, which ive included below. i'm starting to worry that i'll have > > > > > to carry it as part of my menagerie of local diffs on the firewalls cos > > > > > i'm the only person in the world doing multipath and pfsync? > > > > > > > > > > my full diff for the pf_route and pfsync fixes is below. i'm not hugely > > > > > happy with chewing up more mbuf tags ids, but we haven't come up with a > > > > > less worse idea yet. > > > > > > > > > > > This is not only about looks. ICMP source selection is a massive pain > > > > > > point on DFZ routers and we need to fix this. Reducing the amount of > > > > > > corners to cover would be very benefitial. > > > > > > > > > > > > > In this particular case I think using the first addreess as source > > > > > > > address is wrong. IPv4 source address selection should be done > > > > > > > with a route lookup. Basically what in_pcbselsrc() does. And using > > > > > > > the route interface address seems reasonable to me. > > > > > > > > > > for the large part my last diff does follow what in_pcbselsrc does > > > > > more closely than the current code. it respects the "route sourceaddr" > > > > > config while the current code doesnt. > > > > > > > > > > however, the icmp reflect situation is a little different to what > > > > > in_pcbselsrc usually handles though. in_pcbselsrc is used to pick > > > > > an address for a locally terminated connection, while icmp_reflect > > > > > can be used to reply to a packet going through the box. let's not > > > > > pretend these are completely identical situations. > > > > > > > > Yes. This is also the big issue of icmp reflect since it needs to select a > > > > source IP for a situation where the system can not decide which IP is > > > > better. An Ip from the received interface, that of the route the > > > > destination points at, or an IP of the interface the ICMP will actually > > > > go out with. > > > > > > > > We have 3 options and all 3 have issues. It is a game we can not win in > > > > code. With respecting the "route sourceaddr" we have at least a way to > > > > decide this via configuration. > > > > > > > > > > To be honest I think it should not matter which IP is selected in the IPv4 > > > > > > case. All adresses on the interface can be used to send out an ICMP error. > > > > > > Now I do agree that using the same logic as in in_pcbselsrc() would benefit > > > > > > consitancy. > > > > > > > > > > i mostly agree, except selecting 127.0.0.1 to reply to a packet that > > > > > arrived off the wire seems bloody minded when there's (what i consider) > > > > > reasonable ways to avoid it. > > > > > > > > Very much agree with that. I was mostly thinking of selecting an IP of an > > > > external interface (like the one the packet was received on). > > > > > > while investigating a problem today i noticed an issue that this > > > diff fixes. > > > > > > we have ipsec/ikev2 vpn servers for a bunch of windows boxes, which > > > is a fairly vanilla ipsec setup on openbsd. the box has a single public > > > ip facing the world, and iked allocates the clients addresses out > > > of a 192.168 range. the rest of the world has a route for the 192.168 > > > address via the vpn servers public ip. > > > > > > to avoid traffic loops when sending traffic to disconnected peers, we > > > add a blackhole route on the vpn servers for the 192.168 range, which > > > looks like this: > > > > > > # netstat -nr -f inet > > > Destination Gateway Flags Refs Use Mtu Prio Iface > > > default 192.0.2.97 UGS 10 2059524192 - 8 vmx0 > > > 224/4 127.0.0.1 URS 0 10139877 32768 8 lo0 > > > 127/8 127.0.0.1 UGRS 0 0 32768 8 lo0 > > > 127.0.0.1 127.0.0.1 UHhl 2 8541 32768 1 lo0 > > > 192.0.2.96/29 192.0.2.101 UCn 2 200605 - 4 vmx0 > > > 192.0.2.97 00:00:5e:00:01:50 UHLch 1 42617 - 3 vmx0 > > > 192.0.2.99 bc:2c:55:d0:ea:d2 UHLc 0 168100 - 3 vmx0 > > > 192.0.2.101 00:50:56:a1:2a:d3 UHLl 0 950034255 - 1 vmx0 > > > 192.0.2.103 192.0.2.101 UHb 0 0 - 1 vmx0 > > > 192.168.76.128/25 127.0.0.1 UGSB 0 1839277300 32768 8 lo0 > > > > > > there is a problem if we want to talk to vpn clients from the vpn > > > server itself. the route lookup for clients in the 192.168 range > > > makes it look like the closest ip address is 127.0.0.1 cos the route > > > is on the loopback interface, but you shouldnt use 127/8 "on the > > > wire" for any packets leaving the box. we cope with this by using > > > route sourceaddr -ifp vmx0: > > > > > > $ route -n sourceaddr > > > Preferred source address set for rdomain 0 > > > IPv4: 192.0.2.101 > > > IPv6: default > > > > > > with this config we end up with the public ip as the if address that gets used as the source ip when connections are made to the vpn endpoints: > > > > > > $ route -n get 192.168.76.222 > > > route to: 192.168.76.222 > > > destination: 192.168.76.128 > > > mask: 255.255.255.128 > > > gateway: 127.0.0.1 > > > interface: lo0 > > > if address: 192.0.2.101 > > > priority: 8 (static) > > > flags: > > > use mtu expire > > > 1839323462 32768 0 > > > > > > if there's an active endpoint on that IP, then we can send packets > > > to it with our public IP as the source address, and ipsec will steal > > > it before it gets dumped by the loopback interface. > > > > > > the problem is if a vpn endpoint sends us something and we have to > > > generate an icmp packet and reflect it back. because icmp_reflect > > > wasnt updated like the source address selection on sockets was for > > > sourceaddr, we see stuff like this: > > > > > > 15:48:54.350914 (authentic,confidential): SPI 0x41f838b5: 192.168.76.164.55401 > 255.255.255.255.1947: udp 40 (encap) > > > 15:48:54.350940 (authentic,confidential): SPI 0xfc332686: 127.0.0.1 > 192.168.76.164: icmp: 255.255.255.255 udp port 1947 unreachable (encap) > > > 15:48:54.735090 (authentic,confidential): SPI 0x8b05c3b7: 192.168.76.150.17500 > 255.255.255.255.17500: udp 325 (encap) > > > 15:48:54.735115 (authentic,confidential): SPI 0x25f1f8ce: 127.0.0.1 > 192.168.76.150: icmp: 255.255.255.255 udp port 17500 unreachable (encap) > > > > > > icmp_reflect() is using the ip from the interface with the route, > > > which is 127.0.0.1 on lo0 which has the blackhole route on it. > > > > > > with this diff, the icmp replies come from the route sourceaddr, > > > consistent with the pcb source address selection: > > > > > > 16:15:26.459005 (authentic,confidential): SPI 0xe6b11fd2: 192.168.76.169.50377 > 255.255.255.255.1947: udp 40 (encap) > > > 16:15:26.459039 (authentic,confidential): SPI 0x37658bde: 192.0.2.101 > 192.168.76.169: icmp: 255.255.255.255 udp port 1947 unreachable (encap) > > > 16:15:46.200838 (authentic,confidential): SPI 0x8bfcd15f: 192.168.76.216.57621 > 255.255.255.255.8612: udp 16 (encap) > > > 16:15:46.200849 (authentic,confidential): SPI 0xb0047d40: 192.0.2.101 > 192.168.76.216: icmp: 255.255.255.255 udp port 8612 unreachable (encap) > > > > > > however, i still believe icmp_reflection should prefer the IP on > > > the interface a forwarded packet was received on over the local ip > > > of the destination network. again, this helps avoid picking 127.0.0.1 > > > for blackhole/reject routes. this is not a situation that pcb source > > > address selection has to deal with, so it's ok for the decisions > > > to be made a little differently. > > > > I thought that route sourceaddr would also influence icmp_reflect(). > > It seems this was not done but it should be done since on IXP routers you > > really don't want to reflect with the IXP lan IP. > > > > I think the route sourceaddr selection vs directly connected host is not > > done correctly. At least matching on just RTF_HOST will also include > > PMTUD cloned routes which is for sure wrong here. > > ISSET(rt->rt_flags, RTF_LLINFO|RTF_HOST)) { > > > > I wonder if this should instead use !ISSET(..., RTF_GATEWAY) since gateway > > routes are by definition not directly reachable. > > > > Anyway this affects also the in_pcb code and I think that should be fixed > > afterwards in tree. > > This test passed regress. Except ipsec where some regex on expected > icmp packets did not match. But these can be fixed. > > But I am still not convinced why we need a different algorithm for > ICMP packets. We generate ICMP packets and we use the regular > algorithm to do source selection. It is the address of the outgoing > interface. that's not quite true yet. the diff changes two things. the first problem is that icmp_reflect does not consider route sourceaddr like in_pcbselsrc does, which this diff fixes. the second is to use where the mbuf came from (m->m_pkthdr.ph_ifidx) to pick an ip. i would argue we're using the same algorithm for the common set of info this code and in_pcbselsrc can make decisions with. i could factor the code out for the address selection and use it for both in_pcbselsrc and icmp_reflect, but because in_pcbselsrc doesn't know if it's picking an address for a reply or not (ie, it doesn't have access to an m->m_pkthdr.ph_ifidx), it wont use that chunk of code. > All your trouble comes from the reject route to loopback. Then the > ICMP packet has 127.0.0.1. Unfortunately reject only works with > loopback. > > Why not implement the route reject feature in ipv4_input()? Then > you control the route that is used to generate the ICMP source > address. By giving the reject route a gateway, the source IP will > be in its network. What do you think about this idea? if the stack handles RTF_REJECT|RTF_BLACKHOLE, then i can do this in my vpn concentrator setup: $ route -qn add 192.168.76.128/25 192.0.2.101 -blackhole that would look like this: $ route -n get 192.168.76.128 route to: 192.168.76.128 destination: 192.168.76.128 mask: 255.255.255.128 gateway: 192.0.2.101 interface: vmx0 if address: 192.0.2.101 priority: 8 (static) flags: use mtu expire 0 0 0 is that what you're suggesting? i like that idea of handling RTF_REJECT|RTF_BLACKHOLE in the IP stack instead of lo(4), and proposed it around the same time i sent the original icmp_reflect diff out. i got the strong impression at the time that you thought it was too risky. maybe you were talking about the other changes i made to lo(4) at the time. i will revisit that one day, but i have enough balls in the air already right now. how about i make this a bit easier for you and split the changes in this diff up? this one has icmp_reflect respect the route sourceaddr like in_pcbselsrc, but doesn't use the rcvif to pick an address. Index: ip_icmp.c =================================================================== RCS file: /cvs/src/sys/netinet/ip_icmp.c,v diff -u -p -r1.199 ip_icmp.c --- ip_icmp.c 2 Mar 2025 21:28:32 -0000 1.199 +++ ip_icmp.c 7 Jun 2025 03:51:30 -0000 @@ -691,7 +691,8 @@ icmp_reflect(struct mbuf *m, struct mbuf struct ip *ip = mtod(m, struct ip *); struct mbuf *opts = NULL; struct sockaddr_in sin; - struct rtentry *rt = NULL; + struct rtentry *rt; + struct in_addr ip_src = { INADDR_ANY }; int optlen = (ip->ip_hl << 2) - sizeof(struct ip); u_int rtableid; u_int8_t pfflags; @@ -708,10 +709,6 @@ icmp_reflect(struct mbuf *m, struct mbuf return (ELOOP); } rtableid = m->m_pkthdr.ph_rtableid; - pfflags = m->m_pkthdr.pf.flags; - m_resethdr(m); - m->m_pkthdr.ph_rtableid = rtableid; - m->m_pkthdr.pf.flags = pfflags & PF_TAG_GENERATED; /* * If the incoming packet was addressed directly to us, @@ -725,19 +722,24 @@ icmp_reflect(struct mbuf *m, struct mbuf sin.sin_addr = ip->ip_dst; rt = rtalloc(sintosa(&sin), 0, rtableid); - if (rtisvalid(rt) && - ISSET(rt->rt_flags, RTF_LOCAL|RTF_BROADCAST)) - ia = ifatoia(rt->rt_ifa); - } + if (rtisvalid(rt)) { + if (ISSET(rt->rt_flags, RTF_LOCAL)) + ip_src = ip->ip_dst; + else if (ISSET(rt->rt_flags, RTF_BROADCAST)) { + ia = ifatoia(rt->rt_ifa); + ip_src = ia->ia_addr.sin_addr; + } + } + rtfree(rt); + } else + ip_src = ia->ia_addr.sin_addr; /* * The following happens if the packet was not addressed to us. - * Use the new source address and do a route lookup. If it fails - * drop the packet as there is no path to the host. + * If we're directly connected use the closest address, otherwise + * try to use the sourceaddr from the routing table. */ - if (ia == NULL) { - rtfree(rt); - + if (ip_src.s_addr == INADDR_ANY) { memset(&sin, 0, sizeof(sin)); sin.sin_len = sizeof(sin); sin.sin_family = AF_INET; @@ -745,21 +747,38 @@ icmp_reflect(struct mbuf *m, struct mbuf /* keep packet in the original virtual instance */ rt = rtalloc(sintosa(&sin), RT_RESOLVE, rtableid); - if (rt == NULL) { - ipstat_inc(ips_noroute); - m_freem(m); - return (EHOSTUNREACH); + if (rtisvalid(rt) && + ISSET(rt->rt_flags, RTF_LLINFO|RTF_HOST)) { + ia = ifatoia(rt->rt_ifa); + ip_src = ia->ia_addr.sin_addr; + } else { + struct sockaddr *sourceaddr; + struct ifaddr *ifa; + + sourceaddr = rtable_getsource(rtableid, AF_INET); + if (sourceaddr != NULL) { + ifa = ifa_ifwithaddr(sourceaddr, rtableid); + if (ifa != NULL && + ISSET(ifa->ifa_ifp->if_flags, IFF_UP)) + ip_src = satosin(sourceaddr)->sin_addr; + } } - - ia = ifatoia(rt->rt_ifa); + rtfree(rt); } + /* + * If the above didn't find an ip_src, ip_output() will try + * and fill it in for us. + */ + + pfflags = m->m_pkthdr.pf.flags; + + m_resethdr(m); + m->m_pkthdr.ph_rtableid = rtableid; + m->m_pkthdr.pf.flags = pfflags & PF_TAG_GENERATED; ip->ip_dst = ip->ip_src; + ip->ip_src = ip_src; ip->ip_ttl = MAXTTL; - - /* It is safe to dereference ``ia'' iff ``rt'' is valid. */ - ip->ip_src = ia->ia_addr.sin_addr; - rtfree(rt); if (optlen > 0) { u_char *cp;