Download raw body.
ipv4 icmp_reflect() source address selection optimisation
On Wed, Jun 04, 2025 at 04:24:22PM +1000, David Gwynne wrote:
> On Wed, Aug 21, 2024 at 12:35:34PM +0200, Claudio Jeker wrote:
> > On Wed, Aug 21, 2024 at 03:40:23PM +1000, David Gwynne wrote:
> > > On Fri, Aug 09, 2024 at 11:47:14AM +0200, Claudio Jeker wrote:
> > > > On Fri, Aug 09, 2024 at 11:15:34AM +0200, Alexander Bluhm wrote:
> > > > > On Fri, Aug 09, 2024 at 10:40:11AM +1000, David Gwynne wrote:
> > > > > > tl;dr: i believe this change would let us simplify pf_route (the
> > > > > > loopback ip handling specifically), so it's worth it.
> > > > >
> > > > > In contrast I have worked on all the corner cases in 15 years to
> > > > > make it work. Maybe not the best solution, but it works. I fear
> > > > > that at our customers somethig will break if we change behavior
> > > > > just because we think it looks better.
> > >
> > > all of them? that's a bold claim.
> > >
> > > add this config the PF host in your test setup and then run the regress:
> > >
> > > # cat /etc/hostname.vport0
> > > inet 169.254.0.1 255.255.255.0
> > > up
> > > # cat /etc/hostname.pfsync0
> > > syncif vport0
> > > maxupd 128
> > > defer
> > > up
> > >
> > >
> > > im hitting edge cases with pf_route and pfsync, that's why im working
> > > on this. this bit is a chunk out of a much larger set of changes to fix
> > > it, which ive included below. i'm starting to worry that i'll have
> > > to carry it as part of my menagerie of local diffs on the firewalls cos
> > > i'm the only person in the world doing multipath and pfsync?
> > >
> > > my full diff for the pf_route and pfsync fixes is below. i'm not hugely
> > > happy with chewing up more mbuf tags ids, but we haven't come up with a
> > > less worse idea yet.
> > >
> > > > This is not only about looks. ICMP source selection is a massive pain
> > > > point on DFZ routers and we need to fix this. Reducing the amount of
> > > > corners to cover would be very benefitial.
> > > >
> > > > > In this particular case I think using the first addreess as source
> > > > > address is wrong. IPv4 source address selection should be done
> > > > > with a route lookup. Basically what in_pcbselsrc() does. And using
> > > > > the route interface address seems reasonable to me.
> > >
> > > for the large part my last diff does follow what in_pcbselsrc does
> > > more closely than the current code. it respects the "route sourceaddr"
> > > config while the current code doesnt.
> > >
> > > however, the icmp reflect situation is a little different to what
> > > in_pcbselsrc usually handles though. in_pcbselsrc is used to pick
> > > an address for a locally terminated connection, while icmp_reflect
> > > can be used to reply to a packet going through the box. let's not
> > > pretend these are completely identical situations.
> >
> > Yes. This is also the big issue of icmp reflect since it needs to select a
> > source IP for a situation where the system can not decide which IP is
> > better. An Ip from the received interface, that of the route the
> > destination points at, or an IP of the interface the ICMP will actually
> > go out with.
> >
> > We have 3 options and all 3 have issues. It is a game we can not win in
> > code. With respecting the "route sourceaddr" we have at least a way to
> > decide this via configuration.
> >
> > > > To be honest I think it should not matter which IP is selected in the IPv4
> > > > case. All adresses on the interface can be used to send out an ICMP error.
> > > > Now I do agree that using the same logic as in in_pcbselsrc() would benefit
> > > > consitancy.
> > >
> > > i mostly agree, except selecting 127.0.0.1 to reply to a packet that
> > > arrived off the wire seems bloody minded when there's (what i consider)
> > > reasonable ways to avoid it.
> >
> > Very much agree with that. I was mostly thinking of selecting an IP of an
> > external interface (like the one the packet was received on).
>
> while investigating a problem today i noticed an issue that this
> diff fixes.
>
> we have ipsec/ikev2 vpn servers for a bunch of windows boxes, which
> is a fairly vanilla ipsec setup on openbsd. the box has a single public
> ip facing the world, and iked allocates the clients addresses out
> of a 192.168 range. the rest of the world has a route for the 192.168
> address via the vpn servers public ip.
>
> to avoid traffic loops when sending traffic to disconnected peers, we
> add a blackhole route on the vpn servers for the 192.168 range, which
> looks like this:
>
> # netstat -nr -f inet
> Destination Gateway Flags Refs Use Mtu Prio Iface
> default 192.0.2.97 UGS 10 2059524192 - 8 vmx0
> 224/4 127.0.0.1 URS 0 10139877 32768 8 lo0
> 127/8 127.0.0.1 UGRS 0 0 32768 8 lo0
> 127.0.0.1 127.0.0.1 UHhl 2 8541 32768 1 lo0
> 192.0.2.96/29 192.0.2.101 UCn 2 200605 - 4 vmx0
> 192.0.2.97 00:00:5e:00:01:50 UHLch 1 42617 - 3 vmx0
> 192.0.2.99 bc:2c:55:d0:ea:d2 UHLc 0 168100 - 3 vmx0
> 192.0.2.101 00:50:56:a1:2a:d3 UHLl 0 950034255 - 1 vmx0
> 192.0.2.103 192.0.2.101 UHb 0 0 - 1 vmx0
> 192.168.76.128/25 127.0.0.1 UGSB 0 1839277300 32768 8 lo0
>
> there is a problem if we want to talk to vpn clients from the vpn
> server itself. the route lookup for clients in the 192.168 range
> makes it look like the closest ip address is 127.0.0.1 cos the route
> is on the loopback interface, but you shouldnt use 127/8 "on the
> wire" for any packets leaving the box. we cope with this by using
> route sourceaddr -ifp vmx0:
>
> $ route -n sourceaddr
> Preferred source address set for rdomain 0
> IPv4: 192.0.2.101
> IPv6: default
>
> with this config we end up with the public ip as the if address that gets used as the source ip when connections are made to the vpn endpoints:
>
> $ route -n get 192.168.76.222
> route to: 192.168.76.222
> destination: 192.168.76.128
> mask: 255.255.255.128
> gateway: 127.0.0.1
> interface: lo0
> if address: 192.0.2.101
> priority: 8 (static)
> flags: <UP,GATEWAY,DONE,STATIC,BLACKHOLE>
> use mtu expire
> 1839323462 32768 0
>
> if there's an active endpoint on that IP, then we can send packets
> to it with our public IP as the source address, and ipsec will steal
> it before it gets dumped by the loopback interface.
>
> the problem is if a vpn endpoint sends us something and we have to
> generate an icmp packet and reflect it back. because icmp_reflect
> wasnt updated like the source address selection on sockets was for
> sourceaddr, we see stuff like this:
>
> 15:48:54.350914 (authentic,confidential): SPI 0x41f838b5: 192.168.76.164.55401 > 255.255.255.255.1947: udp 40 (encap)
> 15:48:54.350940 (authentic,confidential): SPI 0xfc332686: 127.0.0.1 > 192.168.76.164: icmp: 255.255.255.255 udp port 1947 unreachable (encap)
> 15:48:54.735090 (authentic,confidential): SPI 0x8b05c3b7: 192.168.76.150.17500 > 255.255.255.255.17500: udp 325 (encap)
> 15:48:54.735115 (authentic,confidential): SPI 0x25f1f8ce: 127.0.0.1 > 192.168.76.150: icmp: 255.255.255.255 udp port 17500 unreachable (encap)
>
> icmp_reflect() is using the ip from the interface with the route,
> which is 127.0.0.1 on lo0 which has the blackhole route on it.
>
> with this diff, the icmp replies come from the route sourceaddr,
> consistent with the pcb source address selection:
>
> 16:15:26.459005 (authentic,confidential): SPI 0xe6b11fd2: 192.168.76.169.50377 > 255.255.255.255.1947: udp 40 (encap)
> 16:15:26.459039 (authentic,confidential): SPI 0x37658bde: 192.0.2.101 > 192.168.76.169: icmp: 255.255.255.255 udp port 1947 unreachable (encap)
> 16:15:46.200838 (authentic,confidential): SPI 0x8bfcd15f: 192.168.76.216.57621 > 255.255.255.255.8612: udp 16 (encap)
> 16:15:46.200849 (authentic,confidential): SPI 0xb0047d40: 192.0.2.101 > 192.168.76.216: icmp: 255.255.255.255 udp port 8612 unreachable (encap)
>
> however, i still believe icmp_reflection should prefer the IP on
> the interface a forwarded packet was received on over the local ip
> of the destination network. again, this helps avoid picking 127.0.0.1
> for blackhole/reject routes. this is not a situation that pcb source
> address selection has to deal with, so it's ok for the decisions
> to be made a little differently.
I thought that route sourceaddr would also influence icmp_reflect().
It seems this was not done but it should be done since on IXP routers you
really don't want to reflect with the IXP lan IP.
I think the route sourceaddr selection vs directly connected host is not
done correctly. At least matching on just RTF_HOST will also include
PMTUD cloned routes which is for sure wrong here.
ISSET(rt->rt_flags, RTF_LLINFO|RTF_HOST)) {
I wonder if this should instead use !ISSET(..., RTF_GATEWAY) since gateway
routes are by definition not directly reachable.
Anyway this affects also the in_pcb code and I think that should be fixed
afterwards in tree.
> Index: ip_icmp.c
> ===================================================================
> RCS file: /cvs/src/sys/netinet/ip_icmp.c,v
> diff -u -p -r1.199 ip_icmp.c
> --- ip_icmp.c 2 Mar 2025 21:28:32 -0000 1.199
> +++ ip_icmp.c 4 Jun 2025 03:02:10 -0000
> @@ -691,7 +691,8 @@ icmp_reflect(struct mbuf *m, struct mbuf
> struct ip *ip = mtod(m, struct ip *);
> struct mbuf *opts = NULL;
> struct sockaddr_in sin;
> - struct rtentry *rt = NULL;
> + struct rtentry *rt;
> + struct in_addr ip_src = { INADDR_ANY };
> int optlen = (ip->ip_hl << 2) - sizeof(struct ip);
> u_int rtableid;
> u_int8_t pfflags;
> @@ -708,10 +709,6 @@ icmp_reflect(struct mbuf *m, struct mbuf
> return (ELOOP);
> }
> rtableid = m->m_pkthdr.ph_rtableid;
> - pfflags = m->m_pkthdr.pf.flags;
> - m_resethdr(m);
> - m->m_pkthdr.ph_rtableid = rtableid;
> - m->m_pkthdr.pf.flags = pfflags & PF_TAG_GENERATED;
>
> /*
> * If the incoming packet was addressed directly to us,
> @@ -725,19 +722,24 @@ icmp_reflect(struct mbuf *m, struct mbuf
> sin.sin_addr = ip->ip_dst;
>
> rt = rtalloc(sintosa(&sin), 0, rtableid);
> - if (rtisvalid(rt) &&
> - ISSET(rt->rt_flags, RTF_LOCAL|RTF_BROADCAST))
> - ia = ifatoia(rt->rt_ifa);
> - }
> + if (rtisvalid(rt)) {
> + if (ISSET(rt->rt_flags, RTF_LOCAL))
> + ip_src = ip->ip_dst;
> + else if (ISSET(rt->rt_flags, RTF_BROADCAST)) {
> + ia = ifatoia(rt->rt_ifa);
> + ip_src = ia->ia_addr.sin_addr;
> + }
> + }
> + rtfree(rt);
> + } else
> + ip_src = ia->ia_addr.sin_addr;
>
> /*
> * The following happens if the packet was not addressed to us.
> - * Use the new source address and do a route lookup. If it fails
> - * drop the packet as there is no path to the host.
> + * If we're directly connected use the closest address, otherwise
> + * try to use the sourceaddr from the routing table.
> */
> - if (ia == NULL) {
> - rtfree(rt);
> -
> + if (ip_src.s_addr == INADDR_ANY) {
> memset(&sin, 0, sizeof(sin));
> sin.sin_len = sizeof(sin);
> sin.sin_family = AF_INET;
> @@ -745,21 +747,56 @@ icmp_reflect(struct mbuf *m, struct mbuf
>
> /* keep packet in the original virtual instance */
> rt = rtalloc(sintosa(&sin), RT_RESOLVE, rtableid);
> - if (rt == NULL) {
> - ipstat_inc(ips_noroute);
> - m_freem(m);
> - return (EHOSTUNREACH);
> + if (rtisvalid(rt) &&
> + ISSET(rt->rt_flags, RTF_LLINFO|RTF_HOST)) {
> + ia = ifatoia(rt->rt_ifa);
> + ip_src = ia->ia_addr.sin_addr;
> + } else {
> + struct sockaddr *sourceaddr;
> + struct ifaddr *ifa;
> +
> + sourceaddr = rtable_getsource(rtableid, AF_INET);
> + if (sourceaddr != NULL) {
> + ifa = ifa_ifwithaddr(sourceaddr, rtableid);
> + if (ifa != NULL &&
> + ISSET(ifa->ifa_ifp->if_flags, IFF_UP))
> + ip_src = satosin(sourceaddr)->sin_addr;
> + }
> }
> + rtfree(rt);
> + }
>
> - ia = ifatoia(rt->rt_ifa);
> + /*
> + * If the above didn't find an ip_src, get the IP of the
> + * interface the original packet was received on. If all this
> + * comes up with nothing, ip_output() will try and fill it
> + * in for us.
> + */
> + if (ip_src.s_addr == INADDR_ANY) {
> + struct ifnet *ifp;
> + struct ifaddr *ifa;
> +
> + ifp = if_get(m->m_pkthdr.ph_ifidx);
> + if (ifp != NULL) {
> + TAILQ_FOREACH(ifa, &ifp->if_addrlist, ifa_list) {
> + if (ifa->ifa_addr->sa_family != AF_INET)
> + continue;
> +
> + ip_src = satosin(ifa->ifa_addr)->sin_addr;
> + break;
> + }
> + }
> + if_put(ifp);
> }
>
> + pfflags = m->m_pkthdr.pf.flags;
> +
> + m_resethdr(m);
> + m->m_pkthdr.ph_rtableid = rtableid;
> + m->m_pkthdr.pf.flags = pfflags & PF_TAG_GENERATED;
> ip->ip_dst = ip->ip_src;
> + ip->ip_src = ip_src;
> ip->ip_ttl = MAXTTL;
> -
> - /* It is safe to dereference ``ia'' iff ``rt'' is valid. */
> - ip->ip_src = ia->ia_addr.sin_addr;
> - rtfree(rt);
>
> if (optlen > 0) {
> u_char *cp;
>
--
:wq Claudio
ipv4 icmp_reflect() source address selection optimisation