From: David Gwynne Subject: Re: ipv4 icmp_reflect() source address selection optimisation To: tech@openbsd.org Date: Wed, 4 Jun 2025 16:24:22 +1000 On Wed, Aug 21, 2024 at 12:35:34PM +0200, Claudio Jeker wrote: > On Wed, Aug 21, 2024 at 03:40:23PM +1000, David Gwynne wrote: > > On Fri, Aug 09, 2024 at 11:47:14AM +0200, Claudio Jeker wrote: > > > On Fri, Aug 09, 2024 at 11:15:34AM +0200, Alexander Bluhm wrote: > > > > On Fri, Aug 09, 2024 at 10:40:11AM +1000, David Gwynne wrote: > > > > > tl;dr: i believe this change would let us simplify pf_route (the > > > > > loopback ip handling specifically), so it's worth it. > > > > > > > > In contrast I have worked on all the corner cases in 15 years to > > > > make it work. Maybe not the best solution, but it works. I fear > > > > that at our customers somethig will break if we change behavior > > > > just because we think it looks better. > > > > all of them? that's a bold claim. > > > > add this config the PF host in your test setup and then run the regress: > > > > # cat /etc/hostname.vport0 > > inet 169.254.0.1 255.255.255.0 > > up > > # cat /etc/hostname.pfsync0 > > syncif vport0 > > maxupd 128 > > defer > > up > > > > > > im hitting edge cases with pf_route and pfsync, that's why im working > > on this. this bit is a chunk out of a much larger set of changes to fix > > it, which ive included below. i'm starting to worry that i'll have > > to carry it as part of my menagerie of local diffs on the firewalls cos > > i'm the only person in the world doing multipath and pfsync? > > > > my full diff for the pf_route and pfsync fixes is below. i'm not hugely > > happy with chewing up more mbuf tags ids, but we haven't come up with a > > less worse idea yet. > > > > > This is not only about looks. ICMP source selection is a massive pain > > > point on DFZ routers and we need to fix this. Reducing the amount of > > > corners to cover would be very benefitial. > > > > > > > In this particular case I think using the first addreess as source > > > > address is wrong. IPv4 source address selection should be done > > > > with a route lookup. Basically what in_pcbselsrc() does. And using > > > > the route interface address seems reasonable to me. > > > > for the large part my last diff does follow what in_pcbselsrc does > > more closely than the current code. it respects the "route sourceaddr" > > config while the current code doesnt. > > > > however, the icmp reflect situation is a little different to what > > in_pcbselsrc usually handles though. in_pcbselsrc is used to pick > > an address for a locally terminated connection, while icmp_reflect > > can be used to reply to a packet going through the box. let's not > > pretend these are completely identical situations. > > Yes. This is also the big issue of icmp reflect since it needs to select a > source IP for a situation where the system can not decide which IP is > better. An Ip from the received interface, that of the route the > destination points at, or an IP of the interface the ICMP will actually > go out with. > > We have 3 options and all 3 have issues. It is a game we can not win in > code. With respecting the "route sourceaddr" we have at least a way to > decide this via configuration. > > > > To be honest I think it should not matter which IP is selected in the IPv4 > > > case. All adresses on the interface can be used to send out an ICMP error. > > > Now I do agree that using the same logic as in in_pcbselsrc() would benefit > > > consitancy. > > > > i mostly agree, except selecting 127.0.0.1 to reply to a packet that > > arrived off the wire seems bloody minded when there's (what i consider) > > reasonable ways to avoid it. > > Very much agree with that. I was mostly thinking of selecting an IP of an > external interface (like the one the packet was received on). while investigating a problem today i noticed an issue that this diff fixes. we have ipsec/ikev2 vpn servers for a bunch of windows boxes, which is a fairly vanilla ipsec setup on openbsd. the box has a single public ip facing the world, and iked allocates the clients addresses out of a 192.168 range. the rest of the world has a route for the 192.168 address via the vpn servers public ip. to avoid traffic loops when sending traffic to disconnected peers, we add a blackhole route on the vpn servers for the 192.168 range, which looks like this: # netstat -nr -f inet Destination Gateway Flags Refs Use Mtu Prio Iface default 192.0.2.97 UGS 10 2059524192 - 8 vmx0 224/4 127.0.0.1 URS 0 10139877 32768 8 lo0 127/8 127.0.0.1 UGRS 0 0 32768 8 lo0 127.0.0.1 127.0.0.1 UHhl 2 8541 32768 1 lo0 192.0.2.96/29 192.0.2.101 UCn 2 200605 - 4 vmx0 192.0.2.97 00:00:5e:00:01:50 UHLch 1 42617 - 3 vmx0 192.0.2.99 bc:2c:55:d0:ea:d2 UHLc 0 168100 - 3 vmx0 192.0.2.101 00:50:56:a1:2a:d3 UHLl 0 950034255 - 1 vmx0 192.0.2.103 192.0.2.101 UHb 0 0 - 1 vmx0 192.168.76.128/25 127.0.0.1 UGSB 0 1839277300 32768 8 lo0 there is a problem if we want to talk to vpn clients from the vpn server itself. the route lookup for clients in the 192.168 range makes it look like the closest ip address is 127.0.0.1 cos the route is on the loopback interface, but you shouldnt use 127/8 "on the wire" for any packets leaving the box. we cope with this by using route sourceaddr -ifp vmx0: $ route -n sourceaddr Preferred source address set for rdomain 0 IPv4: 192.0.2.101 IPv6: default with this config we end up with the public ip as the if address that gets used as the source ip when connections are made to the vpn endpoints: $ route -n get 192.168.76.222 route to: 192.168.76.222 destination: 192.168.76.128 mask: 255.255.255.128 gateway: 127.0.0.1 interface: lo0 if address: 192.0.2.101 priority: 8 (static) flags: use mtu expire 1839323462 32768 0 if there's an active endpoint on that IP, then we can send packets to it with our public IP as the source address, and ipsec will steal it before it gets dumped by the loopback interface. the problem is if a vpn endpoint sends us something and we have to generate an icmp packet and reflect it back. because icmp_reflect wasnt updated like the source address selection on sockets was for sourceaddr, we see stuff like this: 15:48:54.350914 (authentic,confidential): SPI 0x41f838b5: 192.168.76.164.55401 > 255.255.255.255.1947: udp 40 (encap) 15:48:54.350940 (authentic,confidential): SPI 0xfc332686: 127.0.0.1 > 192.168.76.164: icmp: 255.255.255.255 udp port 1947 unreachable (encap) 15:48:54.735090 (authentic,confidential): SPI 0x8b05c3b7: 192.168.76.150.17500 > 255.255.255.255.17500: udp 325 (encap) 15:48:54.735115 (authentic,confidential): SPI 0x25f1f8ce: 127.0.0.1 > 192.168.76.150: icmp: 255.255.255.255 udp port 17500 unreachable (encap) icmp_reflect() is using the ip from the interface with the route, which is 127.0.0.1 on lo0 which has the blackhole route on it. with this diff, the icmp replies come from the route sourceaddr, consistent with the pcb source address selection: 16:15:26.459005 (authentic,confidential): SPI 0xe6b11fd2: 192.168.76.169.50377 > 255.255.255.255.1947: udp 40 (encap) 16:15:26.459039 (authentic,confidential): SPI 0x37658bde: 192.0.2.101 > 192.168.76.169: icmp: 255.255.255.255 udp port 1947 unreachable (encap) 16:15:46.200838 (authentic,confidential): SPI 0x8bfcd15f: 192.168.76.216.57621 > 255.255.255.255.8612: udp 16 (encap) 16:15:46.200849 (authentic,confidential): SPI 0xb0047d40: 192.0.2.101 > 192.168.76.216: icmp: 255.255.255.255 udp port 8612 unreachable (encap) however, i still believe icmp_reflection should prefer the IP on the interface a forwarded packet was received on over the local ip of the destination network. again, this helps avoid picking 127.0.0.1 for blackhole/reject routes. this is not a situation that pcb source address selection has to deal with, so it's ok for the decisions to be made a little differently. Index: ip_icmp.c =================================================================== RCS file: /cvs/src/sys/netinet/ip_icmp.c,v diff -u -p -r1.199 ip_icmp.c --- ip_icmp.c 2 Mar 2025 21:28:32 -0000 1.199 +++ ip_icmp.c 4 Jun 2025 03:02:10 -0000 @@ -691,7 +691,8 @@ icmp_reflect(struct mbuf *m, struct mbuf struct ip *ip = mtod(m, struct ip *); struct mbuf *opts = NULL; struct sockaddr_in sin; - struct rtentry *rt = NULL; + struct rtentry *rt; + struct in_addr ip_src = { INADDR_ANY }; int optlen = (ip->ip_hl << 2) - sizeof(struct ip); u_int rtableid; u_int8_t pfflags; @@ -708,10 +709,6 @@ icmp_reflect(struct mbuf *m, struct mbuf return (ELOOP); } rtableid = m->m_pkthdr.ph_rtableid; - pfflags = m->m_pkthdr.pf.flags; - m_resethdr(m); - m->m_pkthdr.ph_rtableid = rtableid; - m->m_pkthdr.pf.flags = pfflags & PF_TAG_GENERATED; /* * If the incoming packet was addressed directly to us, @@ -725,19 +722,24 @@ icmp_reflect(struct mbuf *m, struct mbuf sin.sin_addr = ip->ip_dst; rt = rtalloc(sintosa(&sin), 0, rtableid); - if (rtisvalid(rt) && - ISSET(rt->rt_flags, RTF_LOCAL|RTF_BROADCAST)) - ia = ifatoia(rt->rt_ifa); - } + if (rtisvalid(rt)) { + if (ISSET(rt->rt_flags, RTF_LOCAL)) + ip_src = ip->ip_dst; + else if (ISSET(rt->rt_flags, RTF_BROADCAST)) { + ia = ifatoia(rt->rt_ifa); + ip_src = ia->ia_addr.sin_addr; + } + } + rtfree(rt); + } else + ip_src = ia->ia_addr.sin_addr; /* * The following happens if the packet was not addressed to us. - * Use the new source address and do a route lookup. If it fails - * drop the packet as there is no path to the host. + * If we're directly connected use the closest address, otherwise + * try to use the sourceaddr from the routing table. */ - if (ia == NULL) { - rtfree(rt); - + if (ip_src.s_addr == INADDR_ANY) { memset(&sin, 0, sizeof(sin)); sin.sin_len = sizeof(sin); sin.sin_family = AF_INET; @@ -745,21 +747,56 @@ icmp_reflect(struct mbuf *m, struct mbuf /* keep packet in the original virtual instance */ rt = rtalloc(sintosa(&sin), RT_RESOLVE, rtableid); - if (rt == NULL) { - ipstat_inc(ips_noroute); - m_freem(m); - return (EHOSTUNREACH); + if (rtisvalid(rt) && + ISSET(rt->rt_flags, RTF_LLINFO|RTF_HOST)) { + ia = ifatoia(rt->rt_ifa); + ip_src = ia->ia_addr.sin_addr; + } else { + struct sockaddr *sourceaddr; + struct ifaddr *ifa; + + sourceaddr = rtable_getsource(rtableid, AF_INET); + if (sourceaddr != NULL) { + ifa = ifa_ifwithaddr(sourceaddr, rtableid); + if (ifa != NULL && + ISSET(ifa->ifa_ifp->if_flags, IFF_UP)) + ip_src = satosin(sourceaddr)->sin_addr; + } } + rtfree(rt); + } - ia = ifatoia(rt->rt_ifa); + /* + * If the above didn't find an ip_src, get the IP of the + * interface the original packet was received on. If all this + * comes up with nothing, ip_output() will try and fill it + * in for us. + */ + if (ip_src.s_addr == INADDR_ANY) { + struct ifnet *ifp; + struct ifaddr *ifa; + + ifp = if_get(m->m_pkthdr.ph_ifidx); + if (ifp != NULL) { + TAILQ_FOREACH(ifa, &ifp->if_addrlist, ifa_list) { + if (ifa->ifa_addr->sa_family != AF_INET) + continue; + + ip_src = satosin(ifa->ifa_addr)->sin_addr; + break; + } + } + if_put(ifp); } + pfflags = m->m_pkthdr.pf.flags; + + m_resethdr(m); + m->m_pkthdr.ph_rtableid = rtableid; + m->m_pkthdr.pf.flags = pfflags & PF_TAG_GENERATED; ip->ip_dst = ip->ip_src; + ip->ip_src = ip_src; ip->ip_ttl = MAXTTL; - - /* It is safe to dereference ``ia'' iff ``rt'' is valid. */ - ip->ip_src = ia->ia_addr.sin_addr; - rtfree(rt); if (optlen > 0) { u_char *cp;