Download raw body.
ipv4 icmp_reflect() source address selection optimisation
On Fri, Jun 06, 2025 at 11:02:19PM +0200, Alexander Bluhm wrote:
> On Wed, Jun 04, 2025 at 09:09:01PM +0200, Claudio Jeker wrote:
> > On Wed, Jun 04, 2025 at 04:24:22PM +1000, David Gwynne wrote:
> > > On Wed, Aug 21, 2024 at 12:35:34PM +0200, Claudio Jeker wrote:
> > > > On Wed, Aug 21, 2024 at 03:40:23PM +1000, David Gwynne wrote:
> > > > > On Fri, Aug 09, 2024 at 11:47:14AM +0200, Claudio Jeker wrote:
> > > > > > On Fri, Aug 09, 2024 at 11:15:34AM +0200, Alexander Bluhm wrote:
> > > > > > > On Fri, Aug 09, 2024 at 10:40:11AM +1000, David Gwynne wrote:
> > > > > > > > tl;dr: i believe this change would let us simplify pf_route (the
> > > > > > > > loopback ip handling specifically), so it's worth it.
> > > > > > >
> > > > > > > In contrast I have worked on all the corner cases in 15 years to
> > > > > > > make it work. Maybe not the best solution, but it works. I fear
> > > > > > > that at our customers somethig will break if we change behavior
> > > > > > > just because we think it looks better.
> > > > >
> > > > > all of them? that's a bold claim.
> > > > >
> > > > > add this config the PF host in your test setup and then run the regress:
> > > > >
> > > > > # cat /etc/hostname.vport0
> > > > > inet 169.254.0.1 255.255.255.0
> > > > > up
> > > > > # cat /etc/hostname.pfsync0
> > > > > syncif vport0
> > > > > maxupd 128
> > > > > defer
> > > > > up
> > > > >
> > > > >
> > > > > im hitting edge cases with pf_route and pfsync, that's why im working
> > > > > on this. this bit is a chunk out of a much larger set of changes to fix
> > > > > it, which ive included below. i'm starting to worry that i'll have
> > > > > to carry it as part of my menagerie of local diffs on the firewalls cos
> > > > > i'm the only person in the world doing multipath and pfsync?
> > > > >
> > > > > my full diff for the pf_route and pfsync fixes is below. i'm not hugely
> > > > > happy with chewing up more mbuf tags ids, but we haven't come up with a
> > > > > less worse idea yet.
> > > > >
> > > > > > This is not only about looks. ICMP source selection is a massive pain
> > > > > > point on DFZ routers and we need to fix this. Reducing the amount of
> > > > > > corners to cover would be very benefitial.
> > > > > >
> > > > > > > In this particular case I think using the first addreess as source
> > > > > > > address is wrong. IPv4 source address selection should be done
> > > > > > > with a route lookup. Basically what in_pcbselsrc() does. And using
> > > > > > > the route interface address seems reasonable to me.
> > > > >
> > > > > for the large part my last diff does follow what in_pcbselsrc does
> > > > > more closely than the current code. it respects the "route sourceaddr"
> > > > > config while the current code doesnt.
> > > > >
> > > > > however, the icmp reflect situation is a little different to what
> > > > > in_pcbselsrc usually handles though. in_pcbselsrc is used to pick
> > > > > an address for a locally terminated connection, while icmp_reflect
> > > > > can be used to reply to a packet going through the box. let's not
> > > > > pretend these are completely identical situations.
> > > >
> > > > Yes. This is also the big issue of icmp reflect since it needs to select a
> > > > source IP for a situation where the system can not decide which IP is
> > > > better. An Ip from the received interface, that of the route the
> > > > destination points at, or an IP of the interface the ICMP will actually
> > > > go out with.
> > > >
> > > > We have 3 options and all 3 have issues. It is a game we can not win in
> > > > code. With respecting the "route sourceaddr" we have at least a way to
> > > > decide this via configuration.
> > > >
> > > > > > To be honest I think it should not matter which IP is selected in the IPv4
> > > > > > case. All adresses on the interface can be used to send out an ICMP error.
> > > > > > Now I do agree that using the same logic as in in_pcbselsrc() would benefit
> > > > > > consitancy.
> > > > >
> > > > > i mostly agree, except selecting 127.0.0.1 to reply to a packet that
> > > > > arrived off the wire seems bloody minded when there's (what i consider)
> > > > > reasonable ways to avoid it.
> > > >
> > > > Very much agree with that. I was mostly thinking of selecting an IP of an
> > > > external interface (like the one the packet was received on).
> > >
> > > while investigating a problem today i noticed an issue that this
> > > diff fixes.
> > >
> > > we have ipsec/ikev2 vpn servers for a bunch of windows boxes, which
> > > is a fairly vanilla ipsec setup on openbsd. the box has a single public
> > > ip facing the world, and iked allocates the clients addresses out
> > > of a 192.168 range. the rest of the world has a route for the 192.168
> > > address via the vpn servers public ip.
> > >
> > > to avoid traffic loops when sending traffic to disconnected peers, we
> > > add a blackhole route on the vpn servers for the 192.168 range, which
> > > looks like this:
> > >
> > > # netstat -nr -f inet
> > > Destination Gateway Flags Refs Use Mtu Prio Iface
> > > default 192.0.2.97 UGS 10 2059524192 - 8 vmx0
> > > 224/4 127.0.0.1 URS 0 10139877 32768 8 lo0
> > > 127/8 127.0.0.1 UGRS 0 0 32768 8 lo0
> > > 127.0.0.1 127.0.0.1 UHhl 2 8541 32768 1 lo0
> > > 192.0.2.96/29 192.0.2.101 UCn 2 200605 - 4 vmx0
> > > 192.0.2.97 00:00:5e:00:01:50 UHLch 1 42617 - 3 vmx0
> > > 192.0.2.99 bc:2c:55:d0:ea:d2 UHLc 0 168100 - 3 vmx0
> > > 192.0.2.101 00:50:56:a1:2a:d3 UHLl 0 950034255 - 1 vmx0
> > > 192.0.2.103 192.0.2.101 UHb 0 0 - 1 vmx0
> > > 192.168.76.128/25 127.0.0.1 UGSB 0 1839277300 32768 8 lo0
> > >
> > > there is a problem if we want to talk to vpn clients from the vpn
> > > server itself. the route lookup for clients in the 192.168 range
> > > makes it look like the closest ip address is 127.0.0.1 cos the route
> > > is on the loopback interface, but you shouldnt use 127/8 "on the
> > > wire" for any packets leaving the box. we cope with this by using
> > > route sourceaddr -ifp vmx0:
> > >
> > > $ route -n sourceaddr
> > > Preferred source address set for rdomain 0
> > > IPv4: 192.0.2.101
> > > IPv6: default
> > >
> > > with this config we end up with the public ip as the if address that gets used as the source ip when connections are made to the vpn endpoints:
> > >
> > > $ route -n get 192.168.76.222
> > > route to: 192.168.76.222
> > > destination: 192.168.76.128
> > > mask: 255.255.255.128
> > > gateway: 127.0.0.1
> > > interface: lo0
> > > if address: 192.0.2.101
> > > priority: 8 (static)
> > > flags: <UP,GATEWAY,DONE,STATIC,BLACKHOLE>
> > > use mtu expire
> > > 1839323462 32768 0
> > >
> > > if there's an active endpoint on that IP, then we can send packets
> > > to it with our public IP as the source address, and ipsec will steal
> > > it before it gets dumped by the loopback interface.
> > >
> > > the problem is if a vpn endpoint sends us something and we have to
> > > generate an icmp packet and reflect it back. because icmp_reflect
> > > wasnt updated like the source address selection on sockets was for
> > > sourceaddr, we see stuff like this:
> > >
> > > 15:48:54.350914 (authentic,confidential): SPI 0x41f838b5: 192.168.76.164.55401 > 255.255.255.255.1947: udp 40 (encap)
> > > 15:48:54.350940 (authentic,confidential): SPI 0xfc332686: 127.0.0.1 > 192.168.76.164: icmp: 255.255.255.255 udp port 1947 unreachable (encap)
> > > 15:48:54.735090 (authentic,confidential): SPI 0x8b05c3b7: 192.168.76.150.17500 > 255.255.255.255.17500: udp 325 (encap)
> > > 15:48:54.735115 (authentic,confidential): SPI 0x25f1f8ce: 127.0.0.1 > 192.168.76.150: icmp: 255.255.255.255 udp port 17500 unreachable (encap)
> > >
> > > icmp_reflect() is using the ip from the interface with the route,
> > > which is 127.0.0.1 on lo0 which has the blackhole route on it.
> > >
> > > with this diff, the icmp replies come from the route sourceaddr,
> > > consistent with the pcb source address selection:
> > >
> > > 16:15:26.459005 (authentic,confidential): SPI 0xe6b11fd2: 192.168.76.169.50377 > 255.255.255.255.1947: udp 40 (encap)
> > > 16:15:26.459039 (authentic,confidential): SPI 0x37658bde: 192.0.2.101 > 192.168.76.169: icmp: 255.255.255.255 udp port 1947 unreachable (encap)
> > > 16:15:46.200838 (authentic,confidential): SPI 0x8bfcd15f: 192.168.76.216.57621 > 255.255.255.255.8612: udp 16 (encap)
> > > 16:15:46.200849 (authentic,confidential): SPI 0xb0047d40: 192.0.2.101 > 192.168.76.216: icmp: 255.255.255.255 udp port 8612 unreachable (encap)
> > >
> > > however, i still believe icmp_reflection should prefer the IP on
> > > the interface a forwarded packet was received on over the local ip
> > > of the destination network. again, this helps avoid picking 127.0.0.1
> > > for blackhole/reject routes. this is not a situation that pcb source
> > > address selection has to deal with, so it's ok for the decisions
> > > to be made a little differently.
> >
> > I thought that route sourceaddr would also influence icmp_reflect().
> > It seems this was not done but it should be done since on IXP routers you
> > really don't want to reflect with the IXP lan IP.
> >
> > I think the route sourceaddr selection vs directly connected host is not
> > done correctly. At least matching on just RTF_HOST will also include
> > PMTUD cloned routes which is for sure wrong here.
> > ISSET(rt->rt_flags, RTF_LLINFO|RTF_HOST)) {
> >
> > I wonder if this should instead use !ISSET(..., RTF_GATEWAY) since gateway
> > routes are by definition not directly reachable.
> >
> > Anyway this affects also the in_pcb code and I think that should be fixed
> > afterwards in tree.
>
> This test passed regress. Except ipsec where some regex on expected
> icmp packets did not match. But these can be fixed.
>
> But I am still not convinced why we need a different algorithm for
> ICMP packets. We generate ICMP packets and we use the regular
> algorithm to do source selection. It is the address of the outgoing
> interface.
that's not quite true yet. the diff changes two things. the first
problem is that icmp_reflect does not consider route sourceaddr
like in_pcbselsrc does, which this diff fixes. the second is to use
where the mbuf came from (m->m_pkthdr.ph_ifidx) to pick an ip.
i would argue we're using the same algorithm for the common set of
info this code and in_pcbselsrc can make decisions with.
i could factor the code out for the address selection and use it for
both in_pcbselsrc and icmp_reflect, but because in_pcbselsrc doesn't
know if it's picking an address for a reply or not (ie, it doesn't have
access to an m->m_pkthdr.ph_ifidx), it wont use that chunk of code.
> All your trouble comes from the reject route to loopback. Then the
> ICMP packet has 127.0.0.1. Unfortunately reject only works with
> loopback.
>
> Why not implement the route reject feature in ipv4_input()? Then
> you control the route that is used to generate the ICMP source
> address. By giving the reject route a gateway, the source IP will
> be in its network. What do you think about this idea?
if the stack handles RTF_REJECT|RTF_BLACKHOLE, then i can do this in my
vpn concentrator setup:
$ route -qn add 192.168.76.128/25 192.0.2.101 -blackhole
that would look like this:
$ route -n get 192.168.76.128
route to: 192.168.76.128
destination: 192.168.76.128
mask: 255.255.255.128
gateway: 192.0.2.101
interface: vmx0
if address: 192.0.2.101
priority: 8 (static)
flags: <UP,GATEWAY,DONE,STATIC,BLACKHOLE>
use mtu expire
0 0 0
is that what you're suggesting?
i like that idea of handling RTF_REJECT|RTF_BLACKHOLE in the IP
stack instead of lo(4), and proposed it around the same time i sent
the original icmp_reflect diff out. i got the strong impression at
the time that you thought it was too risky. maybe you were talking
about the other changes i made to lo(4) at the time. i will revisit
that one day, but i have enough balls in the air already right now.
how about i make this a bit easier for you and split the changes
in this diff up?
this one has icmp_reflect respect the route sourceaddr like
in_pcbselsrc, but doesn't use the rcvif to pick an address.
Index: ip_icmp.c
===================================================================
RCS file: /cvs/src/sys/netinet/ip_icmp.c,v
diff -u -p -r1.199 ip_icmp.c
--- ip_icmp.c 2 Mar 2025 21:28:32 -0000 1.199
+++ ip_icmp.c 7 Jun 2025 03:51:30 -0000
@@ -691,7 +691,8 @@ icmp_reflect(struct mbuf *m, struct mbuf
struct ip *ip = mtod(m, struct ip *);
struct mbuf *opts = NULL;
struct sockaddr_in sin;
- struct rtentry *rt = NULL;
+ struct rtentry *rt;
+ struct in_addr ip_src = { INADDR_ANY };
int optlen = (ip->ip_hl << 2) - sizeof(struct ip);
u_int rtableid;
u_int8_t pfflags;
@@ -708,10 +709,6 @@ icmp_reflect(struct mbuf *m, struct mbuf
return (ELOOP);
}
rtableid = m->m_pkthdr.ph_rtableid;
- pfflags = m->m_pkthdr.pf.flags;
- m_resethdr(m);
- m->m_pkthdr.ph_rtableid = rtableid;
- m->m_pkthdr.pf.flags = pfflags & PF_TAG_GENERATED;
/*
* If the incoming packet was addressed directly to us,
@@ -725,19 +722,24 @@ icmp_reflect(struct mbuf *m, struct mbuf
sin.sin_addr = ip->ip_dst;
rt = rtalloc(sintosa(&sin), 0, rtableid);
- if (rtisvalid(rt) &&
- ISSET(rt->rt_flags, RTF_LOCAL|RTF_BROADCAST))
- ia = ifatoia(rt->rt_ifa);
- }
+ if (rtisvalid(rt)) {
+ if (ISSET(rt->rt_flags, RTF_LOCAL))
+ ip_src = ip->ip_dst;
+ else if (ISSET(rt->rt_flags, RTF_BROADCAST)) {
+ ia = ifatoia(rt->rt_ifa);
+ ip_src = ia->ia_addr.sin_addr;
+ }
+ }
+ rtfree(rt);
+ } else
+ ip_src = ia->ia_addr.sin_addr;
/*
* The following happens if the packet was not addressed to us.
- * Use the new source address and do a route lookup. If it fails
- * drop the packet as there is no path to the host.
+ * If we're directly connected use the closest address, otherwise
+ * try to use the sourceaddr from the routing table.
*/
- if (ia == NULL) {
- rtfree(rt);
-
+ if (ip_src.s_addr == INADDR_ANY) {
memset(&sin, 0, sizeof(sin));
sin.sin_len = sizeof(sin);
sin.sin_family = AF_INET;
@@ -745,21 +747,38 @@ icmp_reflect(struct mbuf *m, struct mbuf
/* keep packet in the original virtual instance */
rt = rtalloc(sintosa(&sin), RT_RESOLVE, rtableid);
- if (rt == NULL) {
- ipstat_inc(ips_noroute);
- m_freem(m);
- return (EHOSTUNREACH);
+ if (rtisvalid(rt) &&
+ ISSET(rt->rt_flags, RTF_LLINFO|RTF_HOST)) {
+ ia = ifatoia(rt->rt_ifa);
+ ip_src = ia->ia_addr.sin_addr;
+ } else {
+ struct sockaddr *sourceaddr;
+ struct ifaddr *ifa;
+
+ sourceaddr = rtable_getsource(rtableid, AF_INET);
+ if (sourceaddr != NULL) {
+ ifa = ifa_ifwithaddr(sourceaddr, rtableid);
+ if (ifa != NULL &&
+ ISSET(ifa->ifa_ifp->if_flags, IFF_UP))
+ ip_src = satosin(sourceaddr)->sin_addr;
+ }
}
-
- ia = ifatoia(rt->rt_ifa);
+ rtfree(rt);
}
+ /*
+ * If the above didn't find an ip_src, ip_output() will try
+ * and fill it in for us.
+ */
+
+ pfflags = m->m_pkthdr.pf.flags;
+
+ m_resethdr(m);
+ m->m_pkthdr.ph_rtableid = rtableid;
+ m->m_pkthdr.pf.flags = pfflags & PF_TAG_GENERATED;
ip->ip_dst = ip->ip_src;
+ ip->ip_src = ip_src;
ip->ip_ttl = MAXTTL;
-
- /* It is safe to dereference ``ia'' iff ``rt'' is valid. */
- ip->ip_src = ia->ia_addr.sin_addr;
- rtfree(rt);
if (optlen > 0) {
u_char *cp;
ipv4 icmp_reflect() source address selection optimisation