Download raw body.
SoftLRO for ixl(4), bnxt(4) and em(4)
On Fri, Apr 04, 2025 at 07:13:41PM GMT, Jan Klemkow wrote:
> On Thu, Mar 20, 2025 at 05:37:30PM +0100, Alexander Bluhm wrote:
> > On Thu, Mar 20, 2025 at 03:20:25PM +0100, Alexander Bluhm wrote:
> > > On Tue, Mar 04, 2025 at 08:02:31PM +0100, Jan Klemkow wrote:
> > > > On Fri, Nov 15, 2024 at 11:30:08AM GMT, Jan Klemkow wrote:
> > > > > On Thu, Nov 07, 2024 at 11:30:26AM GMT, David Gwynne wrote:
> > > > > > On Thu, Nov 07, 2024 at 01:10:10AM +0100, Jan Klemkow wrote:
> > > > > > > This diff introduces a software solution for TCP Large Receive Offload
> > > > > > > (SoftLRO) for network interfaces don't hat hardware support for it.
> > > > > > > This is needes at least for newer Intel interfaces as their
> > > > > > > documentation said that LRO a.k.a. Receive Side Coalescing (RSC) has to
> > > > > > > be done by software.
> > > > > > > This diff coalesces TCP segments during the receive interrupt before
> > > > > > > queueing them. Thus, our TCP/IP stack has to process less packet
> > > > > > > headers per amount of received data.
> > > > > > >
> > > > > > > I measured receiving performance with Intel XXV710 25 GbE interfaces.
> > > > > > > It increased from 6 Gbit/s to 23 Gbit/s.
> > > > > > >
> > > > > > > Even if we saturate em(4) without any of these technique its also part
> > > > > > > this diff. I'm interested if this diff helps to reach 1 Gbit/s on old
> > > > > > > or slow hardware.
> > > > > > >
> > > > > > > I also add bnxt(4) to this diff to increase test coverage. If you want
> > > > > > > to tests this implementation with your favorite interface, just replace
> > > > > > > the ml_enqueue() call with the new tcp_softlro_enqueue() (as seen
> > > > > > > below). It should work with all kind network interfaces.
> > > > > > >
> > > > > > > Any comments and tests reports are welcome.
> > > > > >
> > > > > > nice.
> > > > > >
> > > > > > i would argue this should be ether_softlro_enqueue and put in
> > > > > > if_ethersubr.c because it's specific to ethernet interfaces. we don't
> > > > > > really have any other type of interface that bundles reception of
> > > > > > packets that we can take advantage of like this, and internally it
> > > > > > assumes it's pulling ethernet packets apart.
> > > > > >
> > > > > > aside from that, just a few comments on the code.
> > > > >
> > > > > I adapted your comments in the diff below.
> > > >
> > > > I refactored the SoftLRO diff. You just need to add the flags IFXF_LRO
> > > > / IFCAP_LRO, and repalce ml_enqueue() with tcp_softlro_enqueue() to
> > > > enable this on you favorit network device.
> > > >
> > > > Janne: I adjusted your diff with correct headers. But, I'm unable to
> > > > test this part of the diff below, due to lack of hardware. Could you
> > > > test it again?
> > > >
> > > > Yuichiro: Could you also retest your UDP/TCP forwarding test? I added a
> > > > short path for non-TCP packets in the ixl(4) driver. Maybe its better
> > > > now.
> > > >
> > > > Further tests and comments are welcome.
> > >
> > > As release lock will come soon, we should be careful. I think we
> > > can add the code if we ensure that default behavior does not change
> > > anything.
> > >
> > > We should start with ixl(4), ifconfig tcplro flag disabled per
> > > default. Then we can improve the TCP code in tree and add other
> > > drivers later. I found some issues like ethernet padding and
> > > TCP option parsing that should be improved. But doing this
> > > on top of an initial commit is easier.
> > >
> > > With some minor changes I am OK with commiting the diff below.
> > >
> > > - remove all drivers except ixl(4) from the diff
> > > - turn off ifconfig tcplro per default
> > > - make sure that ixl(4) logic does not change, if tcplro is off
> > > - I have fixed the livelocked logic in ixl(4)
> > > - I renamed the function to tcp_enqueue_lro(), it is more consistent
> > > with the TSO functions.
> > >
> > > As said before, tcp_softlro() needs more love, but it want to
> > > do this in tree.
> > >
> > > With my comments, jan@'s diff looks like this and would be OK bluhm@
> >
> > Oops, I got a uvm fault in ixl_txeof() with this diff. Test was
> > sending single stream TCP from Linux to Linux machine while OpenBSD
> > was forwarding. ixl(4) LRO was activated. So it looks like it was
> > sending a packet that was previously received with LRO.
> >
> > uvm_fault(0xffffffff82ab4180, 0xfffffd90d6fd4ff8, 0, 1) -> e
> > kernel: page fault trap, code=0
> > Stopped at ixl_txeof+0x197: movl 0x8(%rdx,%rcx,1),%ecx
> > TID PID UID PRFLAGS PFLAGS CPU COMMAND
> > 314157 74541 0 0x14000 0x200 2 softnet3
> > 455485 27135 0 0x14000 0x200 1 softnet2
> > 432414 17394 0 0x14000 0x200 3 softnet1
> > ixl_txeof(ffff800000188000,ffff800000e1cf00) at ixl_txeof+0x197
> > ixl_intr_vector(ffff800000185300) at ixl_intr_vector+0x57
> > intr_handler(ffff800032c770c0,ffff800000175000) at intr_handler+0x91
> > Xintr_ioapic_edge28_untramp() at Xintr_ioapic_edge28_untramp+0x18f
> > acpicpu_idle() at acpicpu_idle+0x239
> > sched_idle(ffffffff829f2ff0) at sched_idle+0x298
> > end trace frame: 0x0, count: 9
> > https://www.openbsd.org/ddb.html describes the minimum info required in bug
> > reports. Insufficient info makes it difficult to find and fix bugs.
> >
> > ddb{0}> show panic
> > *cpu0: uvm_fault(0xffffffff82ab4180, 0xfffffd90d6fd4ff8, 0, 1) -> e
> >
> > ddb{0}> trace
> > ixl_txeof(ffff800000188000,ffff800000e1cf00) at ixl_txeof+0x197
> > ixl_intr_vector(ffff800000185300) at ixl_intr_vector+0x57
> > intr_handler(ffff800032c770c0,ffff800000175000) at intr_handler+0x91
> > Xintr_ioapic_edge28_untramp() at Xintr_ioapic_edge28_untramp+0x18f
> > acpicpu_idle() at acpicpu_idle+0x239
> > sched_idle(ffffffff829f2ff0) at sched_idle+0x298
> > end trace frame: 0x0, count: -6
> >
> > ddb{0}> show register
> > rdi 0x4
> > rsi 0xffff800001f6e5c0
> > rbp 0xffff800032c77010
> > rbx 0xffff800001f6c000
> > rdx 0xfffffd80d6fd5000
> > rcx 0xffffffff0
> > rax 0x192
> > r8 0x8
> > r9 0
> > r10 0x2096f2a745fa867e
> > r11 0x26076eed88feaba6
> > r12 0x196
> > r13 0x1
> > r14 0xffffffff
> > r15 0x192
> > rip 0xffffffff81d8a047 ixl_txeof+0x197
> > cs 0x8
> > rflags 0x10206 __ALIGN_SIZE+0xf206
> > rsp 0xffff800032c76fa0
> > ss 0x10
> > ixl_txeof+0x197: movl 0x8(%rdx,%rcx,1),%ecx
> >
> > ddb{0}> ps
> > PID TID PPID UID S FLAGS WAIT COMMAND
> > 38518 412339 93610 0 3 0x10008a kqread ssh
> > 47141 497841 69680 1000 3 0x100082 kqread tcpbench
> > 69680 482468 41677 1000 3 0x10008a sigsusp sh
> > 41677 137967 1 0 3 0x100088 sigsusp ksh
> > 93610 431262 70721 0 3 0x82 piperd perl
> > 70721 340494 84931 0 3 0x10008a sigsusp ksh
> > 84931 281543 76631 0 3 0x98 kqread sshd-session
> > 76631 384998 40558 0 3 0x92 kqread sshd-session
> > 2806 202482 1 0 3 0x100083 ttyin getty
> > 99278 67342 1 0 3 0x100098 kqread cron
> > 14460 108200 1 99 3 0x1100090 kqread sndiod
> > 49145 492411 1 110 3 0x100090 kqread sndiod
> > 88648 284214 17435 95 3 0x1100092 kqread smtpd
> > 85068 231497 17435 103 3 0x1100092 kqread smtpd
> > 48744 446192 17435 95 3 0x1100092 kqread smtpd
> > 81038 44933 17435 95 3 0x100092 kqread smtpd
> > 67852 435382 17435 95 3 0x1100092 kqread smtpd
> > 95510 326790 17435 95 3 0x1100092 kqread smtpd
> > 17435 136031 1 0 3 0x100080 kqread smtpd
> > 66494 299815 44745 91 3 0x92 kqread snmpd_metrics
> > 2427 408672 44745 91 3 0x1100092 kqread snmpd
> > 44745 351156 1 0 3 0x100080 kqread snmpd
> > 40558 490400 1 0 3 0x88 kqread sshd
> > 33095 460368 0 0 3 0x14200 acct acct
> > 86928 490746 0 0 3 0x14280 nfsidl nfsio
> > 40308 501410 0 0 3 0x14280 nfsidl nfsio
> > 14683 369864 0 0 3 0x14280 nfsidl nfsio
> > 86115 363991 0 0 3 0x14280 nfsidl nfsio
> > 80631 145485 1 0 3 0x100080 kqread ntpd
> > 62683 483071 70890 83 3 0x100092 kqread ntpd
> > 70890 270623 1 83 3 0x1100092 kqread ntpd
> > 59526 317744 53327 74 3 0x1100092 bpf pflogd
> > 53327 166856 1 0 3 0x80 sbwait pflogd
> > 46009 192808 60738 73 3 0x1100090 kqread syslogd
> > 60738 257468 1 0 3 0x100082 sbwait syslogd
> > 27089 12018 1 0 3 0x100080 kqread resolvd
> > 34559 118884 58086 77 3 0x100092 kqread dhcpleased
> > 50650 31850 58086 77 3 0x100092 kqread dhcpleased
> > 58086 260708 1 0 3 0x80 kqread dhcpleased
> > 57483 42427 93971 115 3 0x100092 kqread slaacd
> > 37908 42219 93971 115 3 0x100092 kqread slaacd
> > 93971 13019 1 0 3 0x100080 kqread slaacd
> > 82672 132677 0 0 3 0x14200 bored wsdisplay1
> > 85742 374184 0 0 3 0x14200 bored i915_flip
> > 72724 322745 0 0 3 0x14200 bored i915_modeset
> > 36100 66877 0 0 3 0x14200 bored card0-crtc2
> > 71449 154528 0 0 3 0x14200 bored card0-crtc1
> > 4004 449792 0 0 3 0x14200 bored card0-crtc0
> > 66389 398290 0 0 3 0x14200 bored ttm
> > 25629 195671 0 0 3 0x14200 bored i915-unordered
> > 30993 3026 0 0 3 0x14200 bored i915-dp
> > 64044 17892 0 0 3 0x14200 bored i915
> > 50213 284154 0 0 3 0x14200 bored smr
> > 40695 150715 0 0 3 0x14200 pgzero zerothread
> > 33090 235644 0 0 3 0x14200 aiodoned aiodoned
> > 89464 61329 0 0 3 0x14200 syncer update
> > 90507 444068 0 0 3 0x14200 cleaner cleaner
> > 60811 26863 0 0 3 0x14200 reaper reaper
> > 23643 261108 0 0 3 0x14200 pgdaemon pagedaemon
> > 48598 436260 0 0 3 0x14200 usbtsk usbtask
> > 44951 160405 0 0 3 0x14200 usbatsk usbatsk
> > 9085 470607 0 0 3 0x14200 bored drmtskl
> > 51075 14158 0 0 3 0x14200 bored drmlwq
> > 57977 332542 0 0 3 0x14200 bored drmlwq
> > 31093 34074 0 0 3 0x14200 bored drmlwq
> > 10444 303766 0 0 3 0x14200 bored drmlwq
> > 63278 505564 0 0 3 0x14200 bored drmubwq
> > 87295 374846 0 0 3 0x14200 bored drmubwq
> > 13798 142371 0 0 3 0x14200 bored drmubwq
> > 25664 191160 0 0 3 0x14200 bored drmubwq
> > 1920 14491 0 0 3 0x14200 bored drmhpwq
> > 32074 194715 0 0 3 0x14200 bored drmhpwq
> > 87299 408596 0 0 3 0x14200 bored drmhpwq
> > 84700 196185 0 0 3 0x14200 bored drmhpwq
> > 49492 134485 0 0 3 0x14200 bored drmwq
> > 85317 16007 0 0 3 0x14200 bored drmwq
> > 69861 222376 0 0 3 0x14200 bored drmwq
> > 58837 418320 0 0 3 0x14200 bored drmwq
> > 92915 102135 0 0 3 0x40014200 acpi0 acpi0
> > 62796 101256 0 0 3 0x40014200 idle3
> > 98829 504539 0 0 3 0x40014200 idle2
> > 74240 7414 0 0 3 0x40014200 idle1
> > 52393 30619 0 0 3 0x14200 bored sensors
> > 74541 314157 0 0 7 0x14200 softnet3
> > 27135 455485 0 0 7 0x14200 softnet2
> > 17394 432414 0 0 7 0x14200 softnet1
> > 67812 482110 0 0 3 0x14200 bored softnet0
> > 69613 421762 0 0 3 0x14200 bored systqmp
> > 69474 244487 0 0 3 0x14200 bored systq
> > 6630 315988 0 0 3 0x14200 tmoslp softclockmp
> > 4182 461805 0 0 3 0x40014200 tmoslp softclock
> > *98424 232380 0 0 7 0x40014200 idle0
> > 1 288021 0 0 3 0x82 wait init
> > 0 0 -1 0 3 0x10200 scheduler swapper
> >
> > ddb{0}> x/s version
> > version: OpenBSD 7.7-beta (GENERIC.MP) #cvs : D2025.03.19.00.00.00: Thu Mar 20 17:06:11 CET 2025\012 root@ot41.obsd-lab.genua.de:/usr/src/sys/arch/amd64/compile/GENERIC.MP\012
> >
> > Crash is here:
> > /home/bluhm/openbsd/cvs/src/sys/dev/pci/if_ixl.c:3032
> > 7a3c: 4c 89 f1 mov %r14,%rcx
> > 7a3f: 48 c1 e1 04 shl $0x4,%rcx
> > 7a43: 48 8b 55 98 mov 0xffffffffffffff98(%rbp),%rdx
> > 7a47: 8b 4c 0a 08 mov 0x8(%rdx,%rcx,1),%ecx
> > /home/bluhm/openbsd/cvs/src/sys/dev/pci/if_ixl.c:3033
> >
> > 3027 do {
> > 3028 txm = &txr->txr_maps[cons];
> > 3029 last = txm->txm_eop;
> > 3030 txd = &ring[last];
> > 3031
> > * 3032 dtype = txd->cmd & htole64(IXL_TX_DESC_DTYPE_MASK);
> > 3033 if (dtype != htole64(IXL_TX_DESC_DTYPE_DONE))
> > 3034 break;
> >
> > Variable last is -1.
> >
> > My guess is concatenating the TCP packet with LRO has some bugs.
> > Then an illegal packet causes the TSO send path to crash.
> >
> > As we disable LRO per default for now, I don't consider this crash
> > as a show stopper.
>
> I reproduced and debugged this panics. We get 2 to 3 panics on
> different CPUs from ixl_txeof() as bluhm showed above and from
> ixl_rxoef() in ixl_rxfill() while trying to get a new mbuf.
>
> While debugging I noticed a lot of m_defrag() calls. We get them
> due to too lang mbuf-chains generated via SoftLRO. Ixl(4) nics are
> limited to a max. of 8 memory chunks per packet. SoftLRO can combine
> up to 40 an more 1.5k chucks into one chain. The outgoing ixl(4)
> interface has to call m_defrag() to reorganize this chains into less
> mbufs with biger size.
>
> When I reduced the max. no. of mbufs per packet to 8 m_defrag() is
> never called and we don't run into the panics above.
>
> I guess there is some kind of race and locking problem in the ring and
> mbuf handling of txeof/rxeof. Maybe in combination with the m_defrag()
> function. Or the m_defrag() just changed the timing create this bug.
>
> The diff below contains most of bluhm improvements and code to reduce
> the created mbuf change to 8.
This follwing diff adds checks to VLAN priorities.
Thanks,
Jan
Index: dev/pci/if_ixl.c
===================================================================
RCS file: /cvs/src/sys/dev/pci/if_ixl.c,v
diff -u -p -r1.102 if_ixl.c
--- dev/pci/if_ixl.c 30 Oct 2024 18:02:45 -0000 1.102
+++ dev/pci/if_ixl.c 3 Apr 2025 15:45:43 -0000
@@ -883,6 +883,8 @@ struct ixl_rx_wb_desc_16 {
#define IXL_RX_DESC_PTYPE_SHIFT 30
#define IXL_RX_DESC_PTYPE_MASK (0xffULL << IXL_RX_DESC_PTYPE_SHIFT)
+#define IXL_RX_DESC_PTYPE_MAC_IPV4_TCP 26
+#define IXL_RX_DESC_PTYPE_MAC_IPV6_TCP 92
#define IXL_RX_DESC_PLEN_SHIFT 38
#define IXL_RX_DESC_PLEN_MASK (0x3fffULL << IXL_RX_DESC_PLEN_SHIFT)
@@ -1976,6 +1978,12 @@ ixl_attach(struct device *parent, struct
IFCAP_CSUM_TCPv6 | IFCAP_CSUM_UDPv6;
ifp->if_capabilities |= IFCAP_TSOv4 | IFCAP_TSOv6;
+ ifp->if_capabilities |= IFCAP_LRO;
+#if notyet
+ /* for now tcplro at ixl(4) is default off */
+ ifp->if_xflags |= IFXF_LRO;
+#endif
+
ifmedia_init(&sc->sc_media, 0, ixl_media_change, ixl_media_status);
ixl_media_add(sc, phy_types);
@@ -3255,9 +3263,11 @@ ixl_rxeof(struct ixl_softc *sc, struct i
struct ixl_rx_map *rxm;
bus_dmamap_t map;
unsigned int cons, prod;
+ struct mbuf_list mltcp = MBUF_LIST_INITIALIZER();
struct mbuf_list ml = MBUF_LIST_INITIALIZER();
struct mbuf *m;
uint64_t word;
+ unsigned int ptype;
unsigned int len;
unsigned int mask;
int done = 0;
@@ -3294,6 +3304,8 @@ ixl_rxeof(struct ixl_softc *sc, struct i
m = rxm->rxm_m;
rxm->rxm_m = NULL;
+ ptype = (word & IXL_RX_DESC_PTYPE_MASK)
+ >> IXL_RX_DESC_PTYPE_SHIFT;
len = (word & IXL_RX_DESC_PLEN_MASK) >> IXL_RX_DESC_PLEN_SHIFT;
m->m_len = len;
m->m_pkthdr.len = 0;
@@ -3324,7 +3336,13 @@ ixl_rxeof(struct ixl_softc *sc, struct i
#endif
ixl_rx_checksum(m, word);
- ml_enqueue(&ml, m);
+
+ if (ISSET(ifp->if_xflags, IFXF_LRO) &&
+ (ptype == IXL_RX_DESC_PTYPE_MAC_IPV4_TCP ||
+ ptype == IXL_RX_DESC_PTYPE_MAC_IPV6_TCP))
+ tcp_softlro_enqueue(ifp, &mltcp, m);
+ else
+ ml_enqueue(&ml, m);
} else {
ifp->if_ierrors++; /* XXX */
m_freem(m);
@@ -3341,8 +3359,14 @@ ixl_rxeof(struct ixl_softc *sc, struct i
} while (cons != prod);
if (done) {
+ int livelocked = 0;
+
rxr->rxr_cons = cons;
+ if (ifiq_input(ifiq, &mltcp))
+ livelocked = 1;
if (ifiq_input(ifiq, &ml))
+ livelocked = 1;
+ if (livelocked)
if_rxr_livelocked(&rxr->rxr_acct);
ixl_rxfill(sc, rxr);
}
Index: netinet/tcp_input.c
===================================================================
RCS file: /cvs/src/sys/netinet/tcp_input.c,v
diff -u -p -r1.434 tcp_input.c
--- netinet/tcp_input.c 10 Mar 2025 15:11:46 -0000 1.434
+++ netinet/tcp_input.c 5 Apr 2025 11:46:19 -0000
@@ -84,6 +84,7 @@
#include <net/if_var.h>
#include <net/route.h>
+#include <netinet/if_ether.h>
#include <netinet/in.h>
#include <netinet/ip.h>
#include <netinet/in_pcb.h>
@@ -4229,4 +4230,256 @@ syn_cache_respond(struct syn_cache *sc,
}
in_pcbunref(inp);
return (error);
+}
+
+int
+tcp_softlro(struct mbuf *mhead, struct mbuf *mtail)
+{
+ struct ether_extracted head;
+ struct ether_extracted tail;
+ struct mbuf *m;
+ unsigned int hdrlen;
+ unsigned int cnt = 0;
+
+ /*
+ * Check if head and tail are mergeable
+ */
+
+ ether_extract_headers(mhead, &head);
+ ether_extract_headers(mtail, &tail);
+
+ /* Don't merge packets inside and outside of VLANs */
+ if (head.evh && tail.evh) {
+ /* Don't merge packets of different VLANs */
+ if (EVL_VLANOFTAG(head.evh->evl_tag) !=
+ EVL_VLANOFTAG(tail.evh->evl_tag))
+ return 0;
+
+ /* Don't merge packets of different priorities */
+ if (EVL_PRIOFTAG(head.evh->evl_tag) !=
+ EVL_PRIOFTAG(tail.evh->evl_tag))
+ return 0;
+ } else if (head.evh || tail.evh)
+ return 0;
+
+ /* Check IP header. */
+ if (head.ip4 && tail.ip4) {
+ /* Don't merge packets with invalid header checksum. */
+ if (!ISSET(mhead->m_pkthdr.csum_flags, M_IPV4_CSUM_IN_OK) ||
+ !ISSET(mtail->m_pkthdr.csum_flags, M_IPV4_CSUM_IN_OK))
+ return 0;
+
+ /* Check IPv4 addresses. */
+ if (head.ip4->ip_src.s_addr != tail.ip4->ip_src.s_addr ||
+ head.ip4->ip_dst.s_addr != tail.ip4->ip_dst.s_addr)
+ return 0;
+
+ /* Don't merge IPv4 fragments. */
+ if (ISSET(head.ip4->ip_off, htons(IP_OFFMASK | IP_MF)) ||
+ ISSET(tail.ip4->ip_off, htons(IP_OFFMASK | IP_MF)))
+ return 0;
+
+ /* Check max. IPv4 length. */
+ if (head.iplen + tail.iplen > IP_MAXPACKET)
+ return 0;
+
+ /* Don't merge IPv4 packets with option headers. */
+ if (head.iphlen != sizeof(struct ip) ||
+ tail.iphlen != sizeof(struct ip))
+ return 0;
+
+ /* Don't non-TCP packets. */
+ if (head.ip4->ip_p != IPPROTO_TCP ||
+ tail.ip4->ip_p != IPPROTO_TCP)
+ return 0;
+ } else if (head.ip6 && tail.ip6) {
+ /* Check IPv6 addresses. */
+ if (!IN6_ARE_ADDR_EQUAL(&head.ip6->ip6_src, &tail.ip6->ip6_src) ||
+ !IN6_ARE_ADDR_EQUAL(&head.ip6->ip6_dst, &tail.ip6->ip6_dst))
+ return 0;
+
+ /* Check max. IPv6 length. */
+ if ((head.iplen - head.iphlen) +
+ (tail.iplen - tail.iphlen) > IPV6_MAXPACKET)
+ return 0;
+
+ /* Don't merge IPv6 packets with option headers nor non-TCP. */
+ if (head.ip6->ip6_nxt != IPPROTO_TCP ||
+ tail.ip6->ip6_nxt != IPPROTO_TCP)
+ return 0;
+ } else {
+ return 0;
+ }
+
+ /* Check TCP header. */
+ if (!head.tcp || !tail.tcp)
+ return 0;
+
+ /* Check TCP ports. */
+ if (head.tcp->th_sport != tail.tcp->th_sport ||
+ head.tcp->th_dport != tail.tcp->th_dport)
+ return 0;
+
+ /* Don't merge empty segments. */
+ if (head.paylen == 0 || tail.paylen == 0)
+ return 0;
+
+ /* Check for continues segments. */
+ if (ntohl(head.tcp->th_seq) + head.paylen != ntohl(tail.tcp->th_seq))
+ return 0;
+
+ /* Just ACK and PUSH TCP flags are allowed. */
+ if (ISSET(head.tcp->th_flags, ~(TH_ACK|TH_PUSH)) ||
+ ISSET(tail.tcp->th_flags, ~(TH_ACK|TH_PUSH)))
+ return 0;
+
+ /* TCP ACK flag has to be set. */
+ if (!ISSET(head.tcp->th_flags, TH_ACK) ||
+ !ISSET(tail.tcp->th_flags, TH_ACK))
+ return 0;
+
+ /* Ignore segments with different TCP options. */
+ if (head.tcphlen - sizeof(struct tcphdr) !=
+ tail.tcphlen - sizeof(struct tcphdr))
+ return 0;
+
+ /* Check for TCP options */
+ if (head.tcphlen > sizeof(struct tcphdr)) {
+ char *hopt = (char *)(head.tcp) + sizeof(struct tcphdr);
+ char *topt = (char *)(tail.tcp) + sizeof(struct tcphdr);
+ int optsize = head.tcphlen - sizeof(struct tcphdr);
+ int optlen;
+
+ for (; optsize > 0; optsize -= optlen) {
+ /* Ignore segments with different TCP options. */
+ if (hopt[0] != topt[0] || hopt[1] != topt[1])
+ return 0;
+
+ /* Get option length */
+ optlen = hopt[1];
+ if (hopt[0] == TCPOPT_NOP)
+ optlen = 1;
+ else if (optlen < 2 || optlen > optsize)
+ return 0; /* Illegal length */
+
+ if (hopt[0] != TCPOPT_NOP &&
+ hopt[0] != TCPOPT_TIMESTAMP)
+ return 0; /* Unsupported TCP option */
+
+ hopt += optlen;
+ topt += optlen;
+ }
+ }
+
+ /* Limit mbuf chain len to avoid m_defrag calls on forwarding. */
+ for (m = mhead; m != NULL; m = m->m_next)
+ if (cnt++ >= 8)
+ return 0;
+ for (m = mtail; m != NULL; m = m->m_next)
+ if (cnt++ >= 8)
+ return 0;
+
+ /*
+ * Prepare concatenation of head and tail.
+ */
+
+ /* Adjust IP header. */
+ if (head.ip4) {
+ head.ip4->ip_len = htons(head.iplen + tail.paylen);
+ } else if (head.ip6) {
+ head.ip6->ip6_plen =
+ htons(head.iplen - head.iphlen + tail.paylen);
+ }
+
+ /* Combine TCP flags from head and tail. */
+ if (ISSET(tail.tcp->th_flags, TH_PUSH))
+ SET(head.tcp->th_flags, TH_PUSH);
+
+ /* Adjust TCP header. */
+ head.tcp->th_win = tail.tcp->th_win;
+ head.tcp->th_ack = tail.tcp->th_ack;
+
+ /* Calculate header length of tail packet. */
+ hdrlen = sizeof(*tail.eh);
+ if (tail.evh)
+ hdrlen = sizeof(*tail.evh);
+ hdrlen += tail.iphlen;
+ hdrlen += tail.tcphlen;
+
+ /* Skip protocol headers in tail. */
+ m_adj(mtail, hdrlen);
+ CLR(mtail->m_flags, M_PKTHDR);
+
+ /* Concatenate */
+ for (m = mhead; m->m_next;)
+ m = m->m_next;
+ m->m_next = mtail;
+ mhead->m_pkthdr.len += tail.paylen;
+
+ /* Flag mbuf as TSO packet with MSS. */
+ if (!ISSET(mhead->m_pkthdr.csum_flags, M_TCP_TSO)) {
+ /* Set CSUM_OUT flags in case of forwarding. */
+ SET(mhead->m_pkthdr.csum_flags, M_TCP_CSUM_OUT);
+ head.tcp->th_sum = 0;
+ if (head.ip4) {
+ SET(mhead->m_pkthdr.csum_flags, M_IPV4_CSUM_OUT);
+ head.ip4->ip_sum = 0;
+ }
+
+ SET(mhead->m_pkthdr.csum_flags, M_TCP_TSO);
+ mhead->m_pkthdr.ph_mss = head.paylen;
+ tcpstat_inc(tcps_inswlro);
+ tcpstat_inc(tcps_inpktlro); /* count head */
+ }
+ mhead->m_pkthdr.ph_mss = MAX(mhead->m_pkthdr.ph_mss, tail.paylen);
+ tcpstat_inc(tcps_inpktlro); /* count tail */
+
+ return 1;
+}
+
+void
+tcp_softlro_enqueue(struct ifnet *ifp, struct mbuf_list *ml, struct mbuf *mtail)
+{
+ struct mbuf *mhead;
+
+ if (!ISSET(ifp->if_xflags, IFXF_LRO))
+ goto out;
+
+ /* Don't merge packets with invalid header checksum. */
+ if (!ISSET(mtail->m_pkthdr.csum_flags, M_TCP_CSUM_IN_OK))
+ goto out;
+
+ for (mhead = ml->ml_head; mhead != NULL; mhead = mhead->m_nextpkt) {
+ /* Don't merge packets with invalid header checksum. */
+ if (!ISSET(mhead->m_pkthdr.csum_flags, M_TCP_CSUM_IN_OK))
+ continue;
+
+ /* Use RSS hash to skip packets of different connections. */
+ if (ISSET(mhead->m_pkthdr.csum_flags, M_FLOWID) &&
+ ISSET(mtail->m_pkthdr.csum_flags, M_FLOWID) &&
+ mhead->m_pkthdr.ph_flowid != mtail->m_pkthdr.ph_flowid)
+ continue;
+
+ /* Don't merge packets inside and outside of VLANs */
+ if (ISSET(mhead->m_flags, M_VLANTAG) !=
+ ISSET(mtail->m_flags, M_VLANTAG))
+ continue;
+
+ if (ISSET(mhead->m_flags, M_VLANTAG)) {
+ /* Don't merge packets of different VLANs */
+ if (EVL_VLANOFTAG(mhead->m_pkthdr.ether_vtag) !=
+ EVL_VLANOFTAG(mtail->m_pkthdr.ether_vtag))
+ continue;
+
+ /* Don't merge packets of different priorities */
+ if (EVL_PRIOFTAG(mhead->m_pkthdr.ether_vtag) !=
+ EVL_PRIOFTAG(mtail->m_pkthdr.ether_vtag))
+ continue;
+ }
+
+ if (tcp_softlro(mhead, mtail))
+ return;
+ }
+ out:
+ ml_enqueue(ml, mtail);
}
Index: netinet/tcp_var.h
===================================================================
RCS file: /cvs/src/sys/netinet/tcp_var.h,v
diff -u -p -r1.186 tcp_var.h
--- netinet/tcp_var.h 2 Mar 2025 21:28:32 -0000 1.186
+++ netinet/tcp_var.h 3 Apr 2025 15:43:55 -0000
@@ -720,6 +720,7 @@ void tcp_init(void);
int tcp_input(struct mbuf **, int *, int, int, struct netstack *);
int tcp_mss(struct tcpcb *, int);
void tcp_mss_update(struct tcpcb *);
+void tcp_softlro_enqueue(struct ifnet *, struct mbuf_list *, struct mbuf *);
u_int tcp_hdrsz(struct tcpcb *);
void tcp_mtudisc(struct inpcb *, int);
void tcp_mtudisc_increase(struct inpcb *, int);
SoftLRO for ixl(4), bnxt(4) and em(4)