From: Claudio Jeker Subject: Re: ipv4 icmp_reflect() source address selection optimisation To: David Gwynne Cc: tech@openbsd.org Date: Wed, 4 Jun 2025 21:09:01 +0200 On Wed, Jun 04, 2025 at 04:24:22PM +1000, David Gwynne wrote: > On Wed, Aug 21, 2024 at 12:35:34PM +0200, Claudio Jeker wrote: > > On Wed, Aug 21, 2024 at 03:40:23PM +1000, David Gwynne wrote: > > > On Fri, Aug 09, 2024 at 11:47:14AM +0200, Claudio Jeker wrote: > > > > On Fri, Aug 09, 2024 at 11:15:34AM +0200, Alexander Bluhm wrote: > > > > > On Fri, Aug 09, 2024 at 10:40:11AM +1000, David Gwynne wrote: > > > > > > tl;dr: i believe this change would let us simplify pf_route (the > > > > > > loopback ip handling specifically), so it's worth it. > > > > > > > > > > In contrast I have worked on all the corner cases in 15 years to > > > > > make it work. Maybe not the best solution, but it works. I fear > > > > > that at our customers somethig will break if we change behavior > > > > > just because we think it looks better. > > > > > > all of them? that's a bold claim. > > > > > > add this config the PF host in your test setup and then run the regress: > > > > > > # cat /etc/hostname.vport0 > > > inet 169.254.0.1 255.255.255.0 > > > up > > > # cat /etc/hostname.pfsync0 > > > syncif vport0 > > > maxupd 128 > > > defer > > > up > > > > > > > > > im hitting edge cases with pf_route and pfsync, that's why im working > > > on this. this bit is a chunk out of a much larger set of changes to fix > > > it, which ive included below. i'm starting to worry that i'll have > > > to carry it as part of my menagerie of local diffs on the firewalls cos > > > i'm the only person in the world doing multipath and pfsync? > > > > > > my full diff for the pf_route and pfsync fixes is below. i'm not hugely > > > happy with chewing up more mbuf tags ids, but we haven't come up with a > > > less worse idea yet. > > > > > > > This is not only about looks. ICMP source selection is a massive pain > > > > point on DFZ routers and we need to fix this. Reducing the amount of > > > > corners to cover would be very benefitial. > > > > > > > > > In this particular case I think using the first addreess as source > > > > > address is wrong. IPv4 source address selection should be done > > > > > with a route lookup. Basically what in_pcbselsrc() does. And using > > > > > the route interface address seems reasonable to me. > > > > > > for the large part my last diff does follow what in_pcbselsrc does > > > more closely than the current code. it respects the "route sourceaddr" > > > config while the current code doesnt. > > > > > > however, the icmp reflect situation is a little different to what > > > in_pcbselsrc usually handles though. in_pcbselsrc is used to pick > > > an address for a locally terminated connection, while icmp_reflect > > > can be used to reply to a packet going through the box. let's not > > > pretend these are completely identical situations. > > > > Yes. This is also the big issue of icmp reflect since it needs to select a > > source IP for a situation where the system can not decide which IP is > > better. An Ip from the received interface, that of the route the > > destination points at, or an IP of the interface the ICMP will actually > > go out with. > > > > We have 3 options and all 3 have issues. It is a game we can not win in > > code. With respecting the "route sourceaddr" we have at least a way to > > decide this via configuration. > > > > > > To be honest I think it should not matter which IP is selected in the IPv4 > > > > case. All adresses on the interface can be used to send out an ICMP error. > > > > Now I do agree that using the same logic as in in_pcbselsrc() would benefit > > > > consitancy. > > > > > > i mostly agree, except selecting 127.0.0.1 to reply to a packet that > > > arrived off the wire seems bloody minded when there's (what i consider) > > > reasonable ways to avoid it. > > > > Very much agree with that. I was mostly thinking of selecting an IP of an > > external interface (like the one the packet was received on). > > while investigating a problem today i noticed an issue that this > diff fixes. > > we have ipsec/ikev2 vpn servers for a bunch of windows boxes, which > is a fairly vanilla ipsec setup on openbsd. the box has a single public > ip facing the world, and iked allocates the clients addresses out > of a 192.168 range. the rest of the world has a route for the 192.168 > address via the vpn servers public ip. > > to avoid traffic loops when sending traffic to disconnected peers, we > add a blackhole route on the vpn servers for the 192.168 range, which > looks like this: > > # netstat -nr -f inet > Destination Gateway Flags Refs Use Mtu Prio Iface > default 192.0.2.97 UGS 10 2059524192 - 8 vmx0 > 224/4 127.0.0.1 URS 0 10139877 32768 8 lo0 > 127/8 127.0.0.1 UGRS 0 0 32768 8 lo0 > 127.0.0.1 127.0.0.1 UHhl 2 8541 32768 1 lo0 > 192.0.2.96/29 192.0.2.101 UCn 2 200605 - 4 vmx0 > 192.0.2.97 00:00:5e:00:01:50 UHLch 1 42617 - 3 vmx0 > 192.0.2.99 bc:2c:55:d0:ea:d2 UHLc 0 168100 - 3 vmx0 > 192.0.2.101 00:50:56:a1:2a:d3 UHLl 0 950034255 - 1 vmx0 > 192.0.2.103 192.0.2.101 UHb 0 0 - 1 vmx0 > 192.168.76.128/25 127.0.0.1 UGSB 0 1839277300 32768 8 lo0 > > there is a problem if we want to talk to vpn clients from the vpn > server itself. the route lookup for clients in the 192.168 range > makes it look like the closest ip address is 127.0.0.1 cos the route > is on the loopback interface, but you shouldnt use 127/8 "on the > wire" for any packets leaving the box. we cope with this by using > route sourceaddr -ifp vmx0: > > $ route -n sourceaddr > Preferred source address set for rdomain 0 > IPv4: 192.0.2.101 > IPv6: default > > with this config we end up with the public ip as the if address that gets used as the source ip when connections are made to the vpn endpoints: > > $ route -n get 192.168.76.222 > route to: 192.168.76.222 > destination: 192.168.76.128 > mask: 255.255.255.128 > gateway: 127.0.0.1 > interface: lo0 > if address: 192.0.2.101 > priority: 8 (static) > flags: > use mtu expire > 1839323462 32768 0 > > if there's an active endpoint on that IP, then we can send packets > to it with our public IP as the source address, and ipsec will steal > it before it gets dumped by the loopback interface. > > the problem is if a vpn endpoint sends us something and we have to > generate an icmp packet and reflect it back. because icmp_reflect > wasnt updated like the source address selection on sockets was for > sourceaddr, we see stuff like this: > > 15:48:54.350914 (authentic,confidential): SPI 0x41f838b5: 192.168.76.164.55401 > 255.255.255.255.1947: udp 40 (encap) > 15:48:54.350940 (authentic,confidential): SPI 0xfc332686: 127.0.0.1 > 192.168.76.164: icmp: 255.255.255.255 udp port 1947 unreachable (encap) > 15:48:54.735090 (authentic,confidential): SPI 0x8b05c3b7: 192.168.76.150.17500 > 255.255.255.255.17500: udp 325 (encap) > 15:48:54.735115 (authentic,confidential): SPI 0x25f1f8ce: 127.0.0.1 > 192.168.76.150: icmp: 255.255.255.255 udp port 17500 unreachable (encap) > > icmp_reflect() is using the ip from the interface with the route, > which is 127.0.0.1 on lo0 which has the blackhole route on it. > > with this diff, the icmp replies come from the route sourceaddr, > consistent with the pcb source address selection: > > 16:15:26.459005 (authentic,confidential): SPI 0xe6b11fd2: 192.168.76.169.50377 > 255.255.255.255.1947: udp 40 (encap) > 16:15:26.459039 (authentic,confidential): SPI 0x37658bde: 192.0.2.101 > 192.168.76.169: icmp: 255.255.255.255 udp port 1947 unreachable (encap) > 16:15:46.200838 (authentic,confidential): SPI 0x8bfcd15f: 192.168.76.216.57621 > 255.255.255.255.8612: udp 16 (encap) > 16:15:46.200849 (authentic,confidential): SPI 0xb0047d40: 192.0.2.101 > 192.168.76.216: icmp: 255.255.255.255 udp port 8612 unreachable (encap) > > however, i still believe icmp_reflection should prefer the IP on > the interface a forwarded packet was received on over the local ip > of the destination network. again, this helps avoid picking 127.0.0.1 > for blackhole/reject routes. this is not a situation that pcb source > address selection has to deal with, so it's ok for the decisions > to be made a little differently. I thought that route sourceaddr would also influence icmp_reflect(). It seems this was not done but it should be done since on IXP routers you really don't want to reflect with the IXP lan IP. I think the route sourceaddr selection vs directly connected host is not done correctly. At least matching on just RTF_HOST will also include PMTUD cloned routes which is for sure wrong here. ISSET(rt->rt_flags, RTF_LLINFO|RTF_HOST)) { I wonder if this should instead use !ISSET(..., RTF_GATEWAY) since gateway routes are by definition not directly reachable. Anyway this affects also the in_pcb code and I think that should be fixed afterwards in tree. > Index: ip_icmp.c > =================================================================== > RCS file: /cvs/src/sys/netinet/ip_icmp.c,v > diff -u -p -r1.199 ip_icmp.c > --- ip_icmp.c 2 Mar 2025 21:28:32 -0000 1.199 > +++ ip_icmp.c 4 Jun 2025 03:02:10 -0000 > @@ -691,7 +691,8 @@ icmp_reflect(struct mbuf *m, struct mbuf > struct ip *ip = mtod(m, struct ip *); > struct mbuf *opts = NULL; > struct sockaddr_in sin; > - struct rtentry *rt = NULL; > + struct rtentry *rt; > + struct in_addr ip_src = { INADDR_ANY }; > int optlen = (ip->ip_hl << 2) - sizeof(struct ip); > u_int rtableid; > u_int8_t pfflags; > @@ -708,10 +709,6 @@ icmp_reflect(struct mbuf *m, struct mbuf > return (ELOOP); > } > rtableid = m->m_pkthdr.ph_rtableid; > - pfflags = m->m_pkthdr.pf.flags; > - m_resethdr(m); > - m->m_pkthdr.ph_rtableid = rtableid; > - m->m_pkthdr.pf.flags = pfflags & PF_TAG_GENERATED; > > /* > * If the incoming packet was addressed directly to us, > @@ -725,19 +722,24 @@ icmp_reflect(struct mbuf *m, struct mbuf > sin.sin_addr = ip->ip_dst; > > rt = rtalloc(sintosa(&sin), 0, rtableid); > - if (rtisvalid(rt) && > - ISSET(rt->rt_flags, RTF_LOCAL|RTF_BROADCAST)) > - ia = ifatoia(rt->rt_ifa); > - } > + if (rtisvalid(rt)) { > + if (ISSET(rt->rt_flags, RTF_LOCAL)) > + ip_src = ip->ip_dst; > + else if (ISSET(rt->rt_flags, RTF_BROADCAST)) { > + ia = ifatoia(rt->rt_ifa); > + ip_src = ia->ia_addr.sin_addr; > + } > + } > + rtfree(rt); > + } else > + ip_src = ia->ia_addr.sin_addr; > > /* > * The following happens if the packet was not addressed to us. > - * Use the new source address and do a route lookup. If it fails > - * drop the packet as there is no path to the host. > + * If we're directly connected use the closest address, otherwise > + * try to use the sourceaddr from the routing table. > */ > - if (ia == NULL) { > - rtfree(rt); > - > + if (ip_src.s_addr == INADDR_ANY) { > memset(&sin, 0, sizeof(sin)); > sin.sin_len = sizeof(sin); > sin.sin_family = AF_INET; > @@ -745,21 +747,56 @@ icmp_reflect(struct mbuf *m, struct mbuf > > /* keep packet in the original virtual instance */ > rt = rtalloc(sintosa(&sin), RT_RESOLVE, rtableid); > - if (rt == NULL) { > - ipstat_inc(ips_noroute); > - m_freem(m); > - return (EHOSTUNREACH); > + if (rtisvalid(rt) && > + ISSET(rt->rt_flags, RTF_LLINFO|RTF_HOST)) { > + ia = ifatoia(rt->rt_ifa); > + ip_src = ia->ia_addr.sin_addr; > + } else { > + struct sockaddr *sourceaddr; > + struct ifaddr *ifa; > + > + sourceaddr = rtable_getsource(rtableid, AF_INET); > + if (sourceaddr != NULL) { > + ifa = ifa_ifwithaddr(sourceaddr, rtableid); > + if (ifa != NULL && > + ISSET(ifa->ifa_ifp->if_flags, IFF_UP)) > + ip_src = satosin(sourceaddr)->sin_addr; > + } > } > + rtfree(rt); > + } > > - ia = ifatoia(rt->rt_ifa); > + /* > + * If the above didn't find an ip_src, get the IP of the > + * interface the original packet was received on. If all this > + * comes up with nothing, ip_output() will try and fill it > + * in for us. > + */ > + if (ip_src.s_addr == INADDR_ANY) { > + struct ifnet *ifp; > + struct ifaddr *ifa; > + > + ifp = if_get(m->m_pkthdr.ph_ifidx); > + if (ifp != NULL) { > + TAILQ_FOREACH(ifa, &ifp->if_addrlist, ifa_list) { > + if (ifa->ifa_addr->sa_family != AF_INET) > + continue; > + > + ip_src = satosin(ifa->ifa_addr)->sin_addr; > + break; > + } > + } > + if_put(ifp); > } > > + pfflags = m->m_pkthdr.pf.flags; > + > + m_resethdr(m); > + m->m_pkthdr.ph_rtableid = rtableid; > + m->m_pkthdr.pf.flags = pfflags & PF_TAG_GENERATED; > ip->ip_dst = ip->ip_src; > + ip->ip_src = ip_src; > ip->ip_ttl = MAXTTL; > - > - /* It is safe to dereference ``ia'' iff ``rt'' is valid. */ > - ip->ip_src = ia->ia_addr.sin_addr; > - rtfree(rt); > > if (optlen > 0) { > u_char *cp; > -- :wq Claudio