Index | Thread | Search

From:
Jan Klemkow <jan@openbsd.org>
Subject:
Re: SoftLRO for ixl(4), bnxt(4) and em(4)
To:
Alexander Bluhm <bluhm@openbsd.org>
Cc:
tech@openbsd.org, Janne Johansson <icepic.dz@gmail.com>, Yuichiro NAITO <naito.yuichiro@gmail.com>
Date:
Fri, 4 Apr 2025 19:13:41 +0200

Download raw body.

Thread
On Thu, Mar 20, 2025 at 05:37:30PM +0100, Alexander Bluhm wrote:
> On Thu, Mar 20, 2025 at 03:20:25PM +0100, Alexander Bluhm wrote:
> > On Tue, Mar 04, 2025 at 08:02:31PM +0100, Jan Klemkow wrote:
> > > On Fri, Nov 15, 2024 at 11:30:08AM GMT, Jan Klemkow wrote:
> > > > On Thu, Nov 07, 2024 at 11:30:26AM GMT, David Gwynne wrote:
> > > > > On Thu, Nov 07, 2024 at 01:10:10AM +0100, Jan Klemkow wrote:
> > > > > > This diff introduces a software solution for TCP Large Receive Offload
> > > > > > (SoftLRO) for network interfaces don't hat hardware support for it.
> > > > > > This is needes at least for newer Intel interfaces as their
> > > > > > documentation said that LRO a.k.a. Receive Side Coalescing (RSC) has to
> > > > > > be done by software.
> > > > > > This diff coalesces TCP segments during the receive interrupt before
> > > > > > queueing them.  Thus, our TCP/IP stack has to process less packet
> > > > > > headers per amount of received data.
> > > > > >
> > > > > > I measured receiving performance with Intel XXV710 25 GbE interfaces.
> > > > > > It increased from 6 Gbit/s to 23 Gbit/s.
> > > > > >
> > > > > > Even if we saturate em(4) without any of these technique its also part
> > > > > > this diff.  I'm interested if this diff helps to reach 1 Gbit/s on old
> > > > > > or slow hardware.
> > > > > >
> > > > > > I also add bnxt(4) to this diff to increase test coverage.  If you want
> > > > > > to tests this implementation with your favorite interface, just replace
> > > > > > the ml_enqueue() call with the new tcp_softlro_enqueue() (as seen
> > > > > > below).  It should work with all kind network interfaces.
> > > > > >
> > > > > > Any comments and tests reports are welcome.
> > > > >
> > > > > nice.
> > > > >
> > > > > i would argue this should be ether_softlro_enqueue and put in
> > > > > if_ethersubr.c because it's specific to ethernet interfaces. we don't
> > > > > really have any other type of interface that bundles reception of
> > > > > packets that we can take advantage of like this, and internally it
> > > > > assumes it's pulling ethernet packets apart.
> > > > >
> > > > > aside from that, just a few comments on the code.
> > > >
> > > > I adapted your comments in the diff below.
> > >
> > > I refactored the SoftLRO diff.  You just need to add the flags IFXF_LRO
> > > / IFCAP_LRO, and repalce ml_enqueue() with tcp_softlro_enqueue() to
> > > enable this on you favorit network device.
> > >
> > > Janne: I adjusted your diff with correct headers.  But, I'm unable to
> > > test this part of the diff below, due to lack of hardware.  Could you
> > > test it again?
> > >
> > > Yuichiro: Could you also retest your UDP/TCP forwarding test?  I added a
> > > short path for non-TCP packets in the ixl(4) driver.  Maybe its better
> > > now.
> > >
> > > Further tests and comments are welcome.
> > 
> > As release lock will come soon, we should be careful.  I think we
> > can add the code if we ensure that default behavior does not change
> > anything.
> > 
> > We should start with ixl(4), ifconfig tcplro flag disabled per
> > default.  Then we can improve the TCP code in tree and add other
> > drivers later.  I found some issues like ethernet padding and
> > TCP option parsing that should be improved.  But doing this
> > on top of an initial commit is easier.
> > 
> > With some minor changes I am OK with commiting the diff below.
> > 
> > - remove all drivers except ixl(4) from the diff
> > - turn off ifconfig tcplro per default
> > - make sure that ixl(4) logic does not change, if tcplro is off
> > - I have fixed the livelocked logic in ixl(4)
> > - I renamed the function to tcp_enqueue_lro(), it is more consistent
> >   with the TSO functions.
> > 
> > As said before, tcp_softlro() needs more love, but it want to
> > do this in tree.
> > 
> > With my comments, jan@'s diff looks like this and would be OK bluhm@
> 
> Oops, I got a uvm fault in ixl_txeof() with this diff.  Test was
> sending single stream TCP from Linux to Linux machine while OpenBSD
> was forwarding.  ixl(4) LRO was activated.  So it looks like it was
> sending a packet that was previously received with LRO.
> 
> uvm_fault(0xffffffff82ab4180, 0xfffffd90d6fd4ff8, 0, 1) -> e
> kernel: page fault trap, code=0
> Stopped at      ixl_txeof+0x197:        movl    0x8(%rdx,%rcx,1),%ecx
>     TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
>  314157  74541      0     0x14000      0x200    2  softnet3
>  455485  27135      0     0x14000      0x200    1  softnet2
>  432414  17394      0     0x14000      0x200    3  softnet1
> ixl_txeof(ffff800000188000,ffff800000e1cf00) at ixl_txeof+0x197
> ixl_intr_vector(ffff800000185300) at ixl_intr_vector+0x57
> intr_handler(ffff800032c770c0,ffff800000175000) at intr_handler+0x91
> Xintr_ioapic_edge28_untramp() at Xintr_ioapic_edge28_untramp+0x18f
> acpicpu_idle() at acpicpu_idle+0x239
> sched_idle(ffffffff829f2ff0) at sched_idle+0x298
> end trace frame: 0x0, count: 9
> https://www.openbsd.org/ddb.html describes the minimum info required in bug
> reports.  Insufficient info makes it difficult to find and fix bugs.
> 
> ddb{0}> show panic
> *cpu0: uvm_fault(0xffffffff82ab4180, 0xfffffd90d6fd4ff8, 0, 1) -> e
> 
> ddb{0}> trace
> ixl_txeof(ffff800000188000,ffff800000e1cf00) at ixl_txeof+0x197
> ixl_intr_vector(ffff800000185300) at ixl_intr_vector+0x57
> intr_handler(ffff800032c770c0,ffff800000175000) at intr_handler+0x91
> Xintr_ioapic_edge28_untramp() at Xintr_ioapic_edge28_untramp+0x18f
> acpicpu_idle() at acpicpu_idle+0x239
> sched_idle(ffffffff829f2ff0) at sched_idle+0x298
> end trace frame: 0x0, count: -6
> 
> ddb{0}> show register
> rdi                              0x4
> rsi               0xffff800001f6e5c0
> rbp               0xffff800032c77010
> rbx               0xffff800001f6c000
> rdx               0xfffffd80d6fd5000
> rcx                      0xffffffff0
> rax                            0x192
> r8                               0x8
> r9                                 0
> r10               0x2096f2a745fa867e
> r11               0x26076eed88feaba6
> r12                            0x196
> r13                              0x1
> r14                       0xffffffff
> r15                            0x192
> rip               0xffffffff81d8a047    ixl_txeof+0x197
> cs                               0x8
> rflags                       0x10206    __ALIGN_SIZE+0xf206
> rsp               0xffff800032c76fa0
> ss                              0x10
> ixl_txeof+0x197:        movl    0x8(%rdx,%rcx,1),%ecx
> 
> ddb{0}> ps
>    PID     TID   PPID    UID  S       FLAGS  WAIT          COMMAND
>  38518  412339  93610      0  3    0x10008a  kqread        ssh
>  47141  497841  69680   1000  3    0x100082  kqread        tcpbench
>  69680  482468  41677   1000  3    0x10008a  sigsusp       sh
>  41677  137967      1      0  3    0x100088  sigsusp       ksh
>  93610  431262  70721      0  3        0x82  piperd        perl
>  70721  340494  84931      0  3    0x10008a  sigsusp       ksh
>  84931  281543  76631      0  3        0x98  kqread        sshd-session
>  76631  384998  40558      0  3        0x92  kqread        sshd-session
>   2806  202482      1      0  3    0x100083  ttyin         getty
>  99278   67342      1      0  3    0x100098  kqread        cron
>  14460  108200      1     99  3   0x1100090  kqread        sndiod
>  49145  492411      1    110  3    0x100090  kqread        sndiod
>  88648  284214  17435     95  3   0x1100092  kqread        smtpd
>  85068  231497  17435    103  3   0x1100092  kqread        smtpd
>  48744  446192  17435     95  3   0x1100092  kqread        smtpd
>  81038   44933  17435     95  3    0x100092  kqread        smtpd
>  67852  435382  17435     95  3   0x1100092  kqread        smtpd
>  95510  326790  17435     95  3   0x1100092  kqread        smtpd
>  17435  136031      1      0  3    0x100080  kqread        smtpd
>  66494  299815  44745     91  3        0x92  kqread        snmpd_metrics
>   2427  408672  44745     91  3   0x1100092  kqread        snmpd
>  44745  351156      1      0  3    0x100080  kqread        snmpd
>  40558  490400      1      0  3        0x88  kqread        sshd
>  33095  460368      0      0  3     0x14200  acct          acct
>  86928  490746      0      0  3     0x14280  nfsidl        nfsio
>  40308  501410      0      0  3     0x14280  nfsidl        nfsio
>  14683  369864      0      0  3     0x14280  nfsidl        nfsio
>  86115  363991      0      0  3     0x14280  nfsidl        nfsio
>  80631  145485      1      0  3    0x100080  kqread        ntpd
>  62683  483071  70890     83  3    0x100092  kqread        ntpd
>  70890  270623      1     83  3   0x1100092  kqread        ntpd
>  59526  317744  53327     74  3   0x1100092  bpf           pflogd
>  53327  166856      1      0  3        0x80  sbwait        pflogd
>  46009  192808  60738     73  3   0x1100090  kqread        syslogd
>  60738  257468      1      0  3    0x100082  sbwait        syslogd
>  27089   12018      1      0  3    0x100080  kqread        resolvd
>  34559  118884  58086     77  3    0x100092  kqread        dhcpleased
>  50650   31850  58086     77  3    0x100092  kqread        dhcpleased
>  58086  260708      1      0  3        0x80  kqread        dhcpleased
>  57483   42427  93971    115  3    0x100092  kqread        slaacd
>  37908   42219  93971    115  3    0x100092  kqread        slaacd
>  93971   13019      1      0  3    0x100080  kqread        slaacd
>  82672  132677      0      0  3     0x14200  bored         wsdisplay1
>  85742  374184      0      0  3     0x14200  bored         i915_flip
>  72724  322745      0      0  3     0x14200  bored         i915_modeset
>  36100   66877      0      0  3     0x14200  bored         card0-crtc2
>  71449  154528      0      0  3     0x14200  bored         card0-crtc1
>   4004  449792      0      0  3     0x14200  bored         card0-crtc0
>  66389  398290      0      0  3     0x14200  bored         ttm
>  25629  195671      0      0  3     0x14200  bored         i915-unordered
>  30993    3026      0      0  3     0x14200  bored         i915-dp
>  64044   17892      0      0  3     0x14200  bored         i915
>  50213  284154      0      0  3     0x14200  bored         smr
>  40695  150715      0      0  3     0x14200  pgzero        zerothread
>  33090  235644      0      0  3     0x14200  aiodoned      aiodoned
>  89464   61329      0      0  3     0x14200  syncer        update
>  90507  444068      0      0  3     0x14200  cleaner       cleaner
>  60811   26863      0      0  3     0x14200  reaper        reaper
>  23643  261108      0      0  3     0x14200  pgdaemon      pagedaemon
>  48598  436260      0      0  3     0x14200  usbtsk        usbtask
>  44951  160405      0      0  3     0x14200  usbatsk       usbatsk
>   9085  470607      0      0  3     0x14200  bored         drmtskl
>  51075   14158      0      0  3     0x14200  bored         drmlwq
>  57977  332542      0      0  3     0x14200  bored         drmlwq
>  31093   34074      0      0  3     0x14200  bored         drmlwq
>  10444  303766      0      0  3     0x14200  bored         drmlwq
>  63278  505564      0      0  3     0x14200  bored         drmubwq
>  87295  374846      0      0  3     0x14200  bored         drmubwq
>  13798  142371      0      0  3     0x14200  bored         drmubwq
>  25664  191160      0      0  3     0x14200  bored         drmubwq
>   1920   14491      0      0  3     0x14200  bored         drmhpwq
>  32074  194715      0      0  3     0x14200  bored         drmhpwq
>  87299  408596      0      0  3     0x14200  bored         drmhpwq
>  84700  196185      0      0  3     0x14200  bored         drmhpwq
>  49492  134485      0      0  3     0x14200  bored         drmwq
>  85317   16007      0      0  3     0x14200  bored         drmwq
>  69861  222376      0      0  3     0x14200  bored         drmwq
>  58837  418320      0      0  3     0x14200  bored         drmwq
>  92915  102135      0      0  3  0x40014200  acpi0         acpi0
>  62796  101256      0      0  3  0x40014200                idle3
>  98829  504539      0      0  3  0x40014200                idle2
>  74240    7414      0      0  3  0x40014200                idle1
>  52393   30619      0      0  3     0x14200  bored         sensors
>  74541  314157      0      0  7     0x14200                softnet3
>  27135  455485      0      0  7     0x14200                softnet2
>  17394  432414      0      0  7     0x14200                softnet1
>  67812  482110      0      0  3     0x14200  bored         softnet0
>  69613  421762      0      0  3     0x14200  bored         systqmp
>  69474  244487      0      0  3     0x14200  bored         systq
>   6630  315988      0      0  3     0x14200  tmoslp        softclockmp
>   4182  461805      0      0  3  0x40014200  tmoslp        softclock
> *98424  232380      0      0  7  0x40014200                idle0
>      1  288021      0      0  3        0x82  wait          init
>      0       0     -1      0  3     0x10200  scheduler     swapper
> 
> ddb{0}> x/s version
> version:        OpenBSD 7.7-beta (GENERIC.MP) #cvs : D2025.03.19.00.00.00: Thu Mar 20 17:06:11 CET 2025\012    root@ot41.obsd-lab.genua.de:/usr/src/sys/arch/amd64/compile/GENERIC.MP\012
> 
> Crash is here:
> /home/bluhm/openbsd/cvs/src/sys/dev/pci/if_ixl.c:3032
>     7a3c:       4c 89 f1                mov    %r14,%rcx
>     7a3f:       48 c1 e1 04             shl    $0x4,%rcx
>     7a43:       48 8b 55 98             mov    0xffffffffffffff98(%rbp),%rdx
>     7a47:       8b 4c 0a 08             mov    0x8(%rdx,%rcx,1),%ecx
> /home/bluhm/openbsd/cvs/src/sys/dev/pci/if_ixl.c:3033
> 
>   3027          do {
>   3028                  txm = &txr->txr_maps[cons];
>   3029                  last = txm->txm_eop;
>   3030                  txd = &ring[last];
>   3031
> * 3032                  dtype = txd->cmd & htole64(IXL_TX_DESC_DTYPE_MASK);
>   3033                  if (dtype != htole64(IXL_TX_DESC_DTYPE_DONE))
>   3034                          break;
> 
> Variable last is -1.
> 
> My guess is concatenating the TCP packet with LRO has some bugs.
> Then an illegal packet causes the TSO send path to crash.
> 
> As we disable LRO per default for now, I don't consider this crash
> as a show stopper.

I reproduced and debugged this panics.  We get 2 to 3 panics on
different CPUs from ixl_txeof() as bluhm showed above and from
ixl_rxoef() in ixl_rxfill() while trying to get a new mbuf.

While debugging I noticed a lot of m_defrag() calls.  We get them
due to too lang mbuf-chains generated via SoftLRO.  Ixl(4) nics are
limited to a max. of 8 memory chunks per packet.  SoftLRO can combine
up to 40 an more 1.5k chucks into one chain.  The outgoing ixl(4)
interface has to call m_defrag() to reorganize this chains into less 
mbufs with biger size.

When I reduced the max. no. of mbufs per packet to 8 m_defrag() is
never called and we don't run into the panics above.

I guess there is some kind of race and locking problem in the ring and
mbuf handling of txeof/rxeof.  Maybe in combination with the m_defrag()
function.  Or the m_defrag() just changed the timing create this bug.

The diff below contains most of bluhm improvements and code to reduce
the created mbuf change to 8.

Thanks,
Jan

Index: dev/pci/if_ixl.c
===================================================================
RCS file: /cvs/src/sys/dev/pci/if_ixl.c,v
diff -u -p -r1.102 if_ixl.c
--- dev/pci/if_ixl.c	30 Oct 2024 18:02:45 -0000	1.102
+++ dev/pci/if_ixl.c	3 Apr 2025 15:45:43 -0000
@@ -883,6 +883,8 @@ struct ixl_rx_wb_desc_16 {
 
 #define IXL_RX_DESC_PTYPE_SHIFT		30
 #define IXL_RX_DESC_PTYPE_MASK		(0xffULL << IXL_RX_DESC_PTYPE_SHIFT)
+#define IXL_RX_DESC_PTYPE_MAC_IPV4_TCP	26
+#define IXL_RX_DESC_PTYPE_MAC_IPV6_TCP	92
 
 #define IXL_RX_DESC_PLEN_SHIFT		38
 #define IXL_RX_DESC_PLEN_MASK		(0x3fffULL << IXL_RX_DESC_PLEN_SHIFT)
@@ -1976,6 +1978,12 @@ ixl_attach(struct device *parent, struct
 	    IFCAP_CSUM_TCPv6 | IFCAP_CSUM_UDPv6;
 	ifp->if_capabilities |= IFCAP_TSOv4 | IFCAP_TSOv6;
 
+	ifp->if_capabilities |= IFCAP_LRO;
+#if notyet
+	/* for now tcplro at ixl(4) is default off */
+	ifp->if_xflags |= IFXF_LRO;
+#endif
+
 	ifmedia_init(&sc->sc_media, 0, ixl_media_change, ixl_media_status);
 
 	ixl_media_add(sc, phy_types);
@@ -3255,9 +3263,11 @@ ixl_rxeof(struct ixl_softc *sc, struct i
 	struct ixl_rx_map *rxm;
 	bus_dmamap_t map;
 	unsigned int cons, prod;
+	struct mbuf_list mltcp = MBUF_LIST_INITIALIZER();
 	struct mbuf_list ml = MBUF_LIST_INITIALIZER();
 	struct mbuf *m;
 	uint64_t word;
+	unsigned int ptype;
 	unsigned int len;
 	unsigned int mask;
 	int done = 0;
@@ -3294,6 +3304,8 @@ ixl_rxeof(struct ixl_softc *sc, struct i
 		m = rxm->rxm_m;
 		rxm->rxm_m = NULL;
 
+		ptype = (word & IXL_RX_DESC_PTYPE_MASK)
+		    >> IXL_RX_DESC_PTYPE_SHIFT;
 		len = (word & IXL_RX_DESC_PLEN_MASK) >> IXL_RX_DESC_PLEN_SHIFT;
 		m->m_len = len;
 		m->m_pkthdr.len = 0;
@@ -3324,7 +3336,13 @@ ixl_rxeof(struct ixl_softc *sc, struct i
 #endif
 
 				ixl_rx_checksum(m, word);
-				ml_enqueue(&ml, m);
+
+				if (ISSET(ifp->if_xflags, IFXF_LRO) &&
+				    (ptype == IXL_RX_DESC_PTYPE_MAC_IPV4_TCP ||
+				     ptype == IXL_RX_DESC_PTYPE_MAC_IPV6_TCP))
+					tcp_softlro_enqueue(ifp, &mltcp, m);
+				else
+					ml_enqueue(&ml, m);
 			} else {
 				ifp->if_ierrors++; /* XXX */
 				m_freem(m);
@@ -3341,8 +3359,14 @@ ixl_rxeof(struct ixl_softc *sc, struct i
 	} while (cons != prod);
 
 	if (done) {
+		int livelocked = 0;
+
 		rxr->rxr_cons = cons;
+		if (ifiq_input(ifiq, &mltcp))
+			livelocked = 1;
 		if (ifiq_input(ifiq, &ml))
+			livelocked = 1;
+		if (livelocked)
 			if_rxr_livelocked(&rxr->rxr_acct);
 		ixl_rxfill(sc, rxr);
 	}
Index: netinet/tcp_input.c
===================================================================
RCS file: /cvs/src/sys/netinet/tcp_input.c,v
diff -u -p -r1.434 tcp_input.c
--- netinet/tcp_input.c	10 Mar 2025 15:11:46 -0000	1.434
+++ netinet/tcp_input.c	4 Apr 2025 07:59:09 -0000
@@ -84,6 +84,7 @@
 #include <net/if_var.h>
 #include <net/route.h>
 
+#include <netinet/if_ether.h>
 #include <netinet/in.h>
 #include <netinet/ip.h>
 #include <netinet/in_pcb.h>
@@ -4229,4 +4230,242 @@ syn_cache_respond(struct syn_cache *sc, 
 	}
 	in_pcbunref(inp);
 	return (error);
+}
+
+int
+tcp_softlro(struct mbuf *mhead, struct mbuf *mtail)
+{
+	struct ether_extracted	 head;
+	struct ether_extracted	 tail;
+	struct mbuf		*m;
+	unsigned int		 hdrlen;
+	unsigned int		 cnt = 0;
+
+	/*
+	 * Check if head and tail are mergeable
+	 */
+
+	ether_extract_headers(mhead, &head);
+	ether_extract_headers(mtail, &tail);
+
+	/* Don't merge packets of different VLANs */
+	if (head.evh && tail.evh) {
+		if (head.evh->evl_tag != tail.evh->evl_tag)
+			return 0;
+	} else if (head.evh || tail.evh)
+		return 0;
+
+	/* Check IP header. */
+	if (head.ip4 && tail.ip4) {
+		/* Don't merge packets with invalid header checksum. */
+		if (!ISSET(mhead->m_pkthdr.csum_flags, M_IPV4_CSUM_IN_OK) ||
+		    !ISSET(mtail->m_pkthdr.csum_flags, M_IPV4_CSUM_IN_OK))
+			return 0;
+
+		/* Check IPv4 addresses. */
+		if (head.ip4->ip_src.s_addr != tail.ip4->ip_src.s_addr ||
+		    head.ip4->ip_dst.s_addr != tail.ip4->ip_dst.s_addr)
+			return 0;
+
+		/* Don't merge IPv4 fragments. */
+		if (ISSET(head.ip4->ip_off, htons(IP_OFFMASK | IP_MF)) ||
+		    ISSET(tail.ip4->ip_off, htons(IP_OFFMASK | IP_MF)))
+			return 0;
+
+		/* Check max. IPv4 length. */
+		if (head.iplen + tail.iplen > IP_MAXPACKET)
+			return 0;
+
+		/* Don't merge IPv4 packets with option headers. */
+		if (head.iphlen != sizeof(struct ip) ||
+		    tail.iphlen != sizeof(struct ip))
+			return 0;
+
+		/* Don't non-TCP packets. */
+		if (head.ip4->ip_p != IPPROTO_TCP ||
+		    tail.ip4->ip_p != IPPROTO_TCP)
+			return 0;
+	} else if (head.ip6 && tail.ip6) {
+		/* Check IPv6 addresses. */
+		if (!IN6_ARE_ADDR_EQUAL(&head.ip6->ip6_src, &tail.ip6->ip6_src) ||
+		    !IN6_ARE_ADDR_EQUAL(&head.ip6->ip6_dst, &tail.ip6->ip6_dst))
+			return 0;
+
+		/* Check max. IPv6 length. */
+		if ((head.iplen - head.iphlen) +
+		    (tail.iplen - tail.iphlen) > IPV6_MAXPACKET)
+			return 0;
+
+		/* Don't merge IPv6 packets with option headers nor non-TCP. */
+		if (head.ip6->ip6_nxt != IPPROTO_TCP ||
+		    tail.ip6->ip6_nxt != IPPROTO_TCP)
+			return 0;
+	} else {
+		return 0;
+	}
+
+	/* Check TCP header. */
+	if (!head.tcp || !tail.tcp)
+		return 0;
+
+	/* Check TCP ports. */
+	if (head.tcp->th_sport != tail.tcp->th_sport ||
+	    head.tcp->th_dport != tail.tcp->th_dport)
+		return 0;
+
+	/* Don't merge empty segments. */
+	if (head.paylen == 0 || tail.paylen == 0)
+		return 0;
+
+	/* Check for continues segments. */
+	if (ntohl(head.tcp->th_seq) + head.paylen != ntohl(tail.tcp->th_seq))
+		return 0;
+
+	/* Just ACK and PUSH TCP flags are allowed. */
+	if (ISSET(head.tcp->th_flags, ~(TH_ACK|TH_PUSH)) ||
+	    ISSET(tail.tcp->th_flags, ~(TH_ACK|TH_PUSH)))
+		return 0;
+
+	/* TCP ACK flag has to be set. */
+	if (!ISSET(head.tcp->th_flags, TH_ACK) ||
+	    !ISSET(tail.tcp->th_flags, TH_ACK))
+		return 0;
+
+	/* Ignore segments with different TCP options. */
+	if (head.tcphlen - sizeof(struct tcphdr) !=
+	    tail.tcphlen - sizeof(struct tcphdr))
+		return 0;
+
+	/* Check for TCP options */
+	if (head.tcphlen > sizeof(struct tcphdr)) {
+		char *hopt = (char *)(head.tcp) + sizeof(struct tcphdr);
+		char *topt = (char *)(tail.tcp) + sizeof(struct tcphdr);
+		int optsize = head.tcphlen - sizeof(struct tcphdr);
+		int optlen;
+
+		for (; optsize > 0; optsize -= optlen) {
+			/* Ignore segments with different TCP options. */
+			if (hopt[0] != topt[0] || hopt[1] != topt[1])
+				return 0;
+
+			/* Get option length */
+			optlen = hopt[1];
+			if (hopt[0] == TCPOPT_NOP)
+				optlen = 1;
+			else if (optlen < 2 || optlen > optsize)
+				return 0;	/* Illegal length */
+
+			if (hopt[0] != TCPOPT_NOP &&
+			    hopt[0] != TCPOPT_TIMESTAMP)
+				return 0;	/* Unsupported TCP option */
+
+			hopt += optlen;
+			topt += optlen;
+		}
+	}
+
+	/* Limit mbuf chain len to avoid m_defrag calls on forwarding. */
+	for (m = mhead; m != NULL; m = m->m_next)
+		if (cnt++ >= 8)
+			return 0;
+	for (m = mtail; m != NULL; m = m->m_next)
+		if (cnt++ >= 8)
+			return 0;
+
+	/*
+	 * Prepare concatenation of head and tail.
+	 */
+
+	/* Adjust IP header. */
+	if (head.ip4) {
+		head.ip4->ip_len = htons(head.iplen + tail.paylen);
+	} else if (head.ip6) {
+		head.ip6->ip6_plen =
+		    htons(head.iplen - head.iphlen + tail.paylen);
+	}
+
+	/* Combine TCP flags from head and tail. */
+	if (ISSET(tail.tcp->th_flags, TH_PUSH))
+		SET(head.tcp->th_flags, TH_PUSH);
+
+	/* Adjust TCP header. */
+	head.tcp->th_win = tail.tcp->th_win;
+	head.tcp->th_ack = tail.tcp->th_ack;
+
+	/* Calculate header length of tail packet. */
+	hdrlen = sizeof(*tail.eh);
+	if (tail.evh)
+		hdrlen = sizeof(*tail.evh);
+	hdrlen += tail.iphlen;
+	hdrlen += tail.tcphlen;
+
+	/* Skip protocol headers in tail. */
+	m_adj(mtail, hdrlen);
+	CLR(mtail->m_flags, M_PKTHDR);
+
+	/* Concatenate */
+	for (m = mhead; m->m_next;)
+		m = m->m_next;
+	m->m_next = mtail;
+	mhead->m_pkthdr.len += tail.paylen;
+
+	/* Flag mbuf as TSO packet with MSS. */
+	if (!ISSET(mhead->m_pkthdr.csum_flags, M_TCP_TSO)) {
+		/* Set CSUM_OUT flags in case of forwarding. */
+		SET(mhead->m_pkthdr.csum_flags, M_TCP_CSUM_OUT);
+		head.tcp->th_sum = 0;
+		if (head.ip4) {
+			SET(mhead->m_pkthdr.csum_flags, M_IPV4_CSUM_OUT);
+			head.ip4->ip_sum = 0;
+		}
+
+		SET(mhead->m_pkthdr.csum_flags, M_TCP_TSO);
+		mhead->m_pkthdr.ph_mss = head.paylen;
+		tcpstat_inc(tcps_inswlro);
+		tcpstat_inc(tcps_inpktlro);	/* count head */
+	}
+	mhead->m_pkthdr.ph_mss = MAX(mhead->m_pkthdr.ph_mss, tail.paylen);
+	tcpstat_inc(tcps_inpktlro);	/* count tail */
+
+	return 1;
+}
+
+void
+tcp_softlro_enqueue(struct ifnet *ifp, struct mbuf_list *ml, struct mbuf *mtail)
+{
+	struct mbuf *mhead;
+
+	if (!ISSET(ifp->if_xflags, IFXF_LRO))
+		goto out;
+
+	/* Don't merge packets with invalid header checksum. */
+	if (!ISSET(mtail->m_pkthdr.csum_flags, M_TCP_CSUM_IN_OK))
+		goto out;
+
+	for (mhead = ml->ml_head; mhead != NULL; mhead = mhead->m_nextpkt) {
+		/* Don't merge packets with invalid header checksum. */
+		if (!ISSET(mhead->m_pkthdr.csum_flags, M_TCP_CSUM_IN_OK))
+			continue;
+
+		/* Use RSS hash to skip packets of different connections. */
+		if (ISSET(mhead->m_pkthdr.csum_flags, M_FLOWID) &&
+		    ISSET(mtail->m_pkthdr.csum_flags, M_FLOWID) &&
+		    mhead->m_pkthdr.ph_flowid != mtail->m_pkthdr.ph_flowid)
+			continue;
+
+		/* Don't merge packets of different VLANs */
+		if (ISSET(mhead->m_flags, M_VLANTAG) !=
+		    ISSET(mtail->m_flags, M_VLANTAG))
+			continue;
+
+		if (ISSET(mhead->m_flags, M_VLANTAG) &&
+		    EVL_VLANOFTAG(mhead->m_pkthdr.ether_vtag) !=
+		    EVL_VLANOFTAG(mtail->m_pkthdr.ether_vtag))
+			continue;
+
+		if (tcp_softlro(mhead, mtail))
+			return;
+	}
+ out:
+	ml_enqueue(ml, mtail);
 }
Index: netinet/tcp_var.h
===================================================================
RCS file: /cvs/src/sys/netinet/tcp_var.h,v
diff -u -p -r1.186 tcp_var.h
--- netinet/tcp_var.h	2 Mar 2025 21:28:32 -0000	1.186
+++ netinet/tcp_var.h	3 Apr 2025 15:43:55 -0000
@@ -720,6 +720,7 @@ void	 tcp_init(void);
 int	 tcp_input(struct mbuf **, int *, int, int, struct netstack *);
 int	 tcp_mss(struct tcpcb *, int);
 void	 tcp_mss_update(struct tcpcb *);
+void	 tcp_softlro_enqueue(struct ifnet *, struct mbuf_list *, struct mbuf *);
 u_int	 tcp_hdrsz(struct tcpcb *);
 void	 tcp_mtudisc(struct inpcb *, int);
 void	 tcp_mtudisc_increase(struct inpcb *, int);