Index | Thread | Search

From:
Jan Klemkow <jan@openbsd.org>
Subject:
Re: SoftLRO for ixl(4), bnxt(4) and em(4)
To:
Alexander Bluhm <bluhm@openbsd.org>
Cc:
tech@openbsd.org, Janne Johansson <icepic.dz@gmail.com>, Yuichiro NAITO <naito.yuichiro@gmail.com>
Date:
Sat, 5 Apr 2025 14:22:13 +0200

Download raw body.

Thread
On Fri, Apr 04, 2025 at 07:13:41PM GMT, Jan Klemkow wrote:
> On Thu, Mar 20, 2025 at 05:37:30PM +0100, Alexander Bluhm wrote:
> > On Thu, Mar 20, 2025 at 03:20:25PM +0100, Alexander Bluhm wrote:
> > > On Tue, Mar 04, 2025 at 08:02:31PM +0100, Jan Klemkow wrote:
> > > > On Fri, Nov 15, 2024 at 11:30:08AM GMT, Jan Klemkow wrote:
> > > > > On Thu, Nov 07, 2024 at 11:30:26AM GMT, David Gwynne wrote:
> > > > > > On Thu, Nov 07, 2024 at 01:10:10AM +0100, Jan Klemkow wrote:
> > > > > > > This diff introduces a software solution for TCP Large Receive Offload
> > > > > > > (SoftLRO) for network interfaces don't hat hardware support for it.
> > > > > > > This is needes at least for newer Intel interfaces as their
> > > > > > > documentation said that LRO a.k.a. Receive Side Coalescing (RSC) has to
> > > > > > > be done by software.
> > > > > > > This diff coalesces TCP segments during the receive interrupt before
> > > > > > > queueing them.  Thus, our TCP/IP stack has to process less packet
> > > > > > > headers per amount of received data.
> > > > > > >
> > > > > > > I measured receiving performance with Intel XXV710 25 GbE interfaces.
> > > > > > > It increased from 6 Gbit/s to 23 Gbit/s.
> > > > > > >
> > > > > > > Even if we saturate em(4) without any of these technique its also part
> > > > > > > this diff.  I'm interested if this diff helps to reach 1 Gbit/s on old
> > > > > > > or slow hardware.
> > > > > > >
> > > > > > > I also add bnxt(4) to this diff to increase test coverage.  If you want
> > > > > > > to tests this implementation with your favorite interface, just replace
> > > > > > > the ml_enqueue() call with the new tcp_softlro_enqueue() (as seen
> > > > > > > below).  It should work with all kind network interfaces.
> > > > > > >
> > > > > > > Any comments and tests reports are welcome.
> > > > > >
> > > > > > nice.
> > > > > >
> > > > > > i would argue this should be ether_softlro_enqueue and put in
> > > > > > if_ethersubr.c because it's specific to ethernet interfaces. we don't
> > > > > > really have any other type of interface that bundles reception of
> > > > > > packets that we can take advantage of like this, and internally it
> > > > > > assumes it's pulling ethernet packets apart.
> > > > > >
> > > > > > aside from that, just a few comments on the code.
> > > > >
> > > > > I adapted your comments in the diff below.
> > > >
> > > > I refactored the SoftLRO diff.  You just need to add the flags IFXF_LRO
> > > > / IFCAP_LRO, and repalce ml_enqueue() with tcp_softlro_enqueue() to
> > > > enable this on you favorit network device.
> > > >
> > > > Janne: I adjusted your diff with correct headers.  But, I'm unable to
> > > > test this part of the diff below, due to lack of hardware.  Could you
> > > > test it again?
> > > >
> > > > Yuichiro: Could you also retest your UDP/TCP forwarding test?  I added a
> > > > short path for non-TCP packets in the ixl(4) driver.  Maybe its better
> > > > now.
> > > >
> > > > Further tests and comments are welcome.
> > > 
> > > As release lock will come soon, we should be careful.  I think we
> > > can add the code if we ensure that default behavior does not change
> > > anything.
> > > 
> > > We should start with ixl(4), ifconfig tcplro flag disabled per
> > > default.  Then we can improve the TCP code in tree and add other
> > > drivers later.  I found some issues like ethernet padding and
> > > TCP option parsing that should be improved.  But doing this
> > > on top of an initial commit is easier.
> > > 
> > > With some minor changes I am OK with commiting the diff below.
> > > 
> > > - remove all drivers except ixl(4) from the diff
> > > - turn off ifconfig tcplro per default
> > > - make sure that ixl(4) logic does not change, if tcplro is off
> > > - I have fixed the livelocked logic in ixl(4)
> > > - I renamed the function to tcp_enqueue_lro(), it is more consistent
> > >   with the TSO functions.
> > > 
> > > As said before, tcp_softlro() needs more love, but it want to
> > > do this in tree.
> > > 
> > > With my comments, jan@'s diff looks like this and would be OK bluhm@
> > 
> > Oops, I got a uvm fault in ixl_txeof() with this diff.  Test was
> > sending single stream TCP from Linux to Linux machine while OpenBSD
> > was forwarding.  ixl(4) LRO was activated.  So it looks like it was
> > sending a packet that was previously received with LRO.
> > 
> > uvm_fault(0xffffffff82ab4180, 0xfffffd90d6fd4ff8, 0, 1) -> e
> > kernel: page fault trap, code=0
> > Stopped at      ixl_txeof+0x197:        movl    0x8(%rdx,%rcx,1),%ecx
> >     TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
> >  314157  74541      0     0x14000      0x200    2  softnet3
> >  455485  27135      0     0x14000      0x200    1  softnet2
> >  432414  17394      0     0x14000      0x200    3  softnet1
> > ixl_txeof(ffff800000188000,ffff800000e1cf00) at ixl_txeof+0x197
> > ixl_intr_vector(ffff800000185300) at ixl_intr_vector+0x57
> > intr_handler(ffff800032c770c0,ffff800000175000) at intr_handler+0x91
> > Xintr_ioapic_edge28_untramp() at Xintr_ioapic_edge28_untramp+0x18f
> > acpicpu_idle() at acpicpu_idle+0x239
> > sched_idle(ffffffff829f2ff0) at sched_idle+0x298
> > end trace frame: 0x0, count: 9
> > https://www.openbsd.org/ddb.html describes the minimum info required in bug
> > reports.  Insufficient info makes it difficult to find and fix bugs.
> > 
> > ddb{0}> show panic
> > *cpu0: uvm_fault(0xffffffff82ab4180, 0xfffffd90d6fd4ff8, 0, 1) -> e
> > 
> > ddb{0}> trace
> > ixl_txeof(ffff800000188000,ffff800000e1cf00) at ixl_txeof+0x197
> > ixl_intr_vector(ffff800000185300) at ixl_intr_vector+0x57
> > intr_handler(ffff800032c770c0,ffff800000175000) at intr_handler+0x91
> > Xintr_ioapic_edge28_untramp() at Xintr_ioapic_edge28_untramp+0x18f
> > acpicpu_idle() at acpicpu_idle+0x239
> > sched_idle(ffffffff829f2ff0) at sched_idle+0x298
> > end trace frame: 0x0, count: -6
> > 
> > ddb{0}> show register
> > rdi                              0x4
> > rsi               0xffff800001f6e5c0
> > rbp               0xffff800032c77010
> > rbx               0xffff800001f6c000
> > rdx               0xfffffd80d6fd5000
> > rcx                      0xffffffff0
> > rax                            0x192
> > r8                               0x8
> > r9                                 0
> > r10               0x2096f2a745fa867e
> > r11               0x26076eed88feaba6
> > r12                            0x196
> > r13                              0x1
> > r14                       0xffffffff
> > r15                            0x192
> > rip               0xffffffff81d8a047    ixl_txeof+0x197
> > cs                               0x8
> > rflags                       0x10206    __ALIGN_SIZE+0xf206
> > rsp               0xffff800032c76fa0
> > ss                              0x10
> > ixl_txeof+0x197:        movl    0x8(%rdx,%rcx,1),%ecx
> > 
> > ddb{0}> ps
> >    PID     TID   PPID    UID  S       FLAGS  WAIT          COMMAND
> >  38518  412339  93610      0  3    0x10008a  kqread        ssh
> >  47141  497841  69680   1000  3    0x100082  kqread        tcpbench
> >  69680  482468  41677   1000  3    0x10008a  sigsusp       sh
> >  41677  137967      1      0  3    0x100088  sigsusp       ksh
> >  93610  431262  70721      0  3        0x82  piperd        perl
> >  70721  340494  84931      0  3    0x10008a  sigsusp       ksh
> >  84931  281543  76631      0  3        0x98  kqread        sshd-session
> >  76631  384998  40558      0  3        0x92  kqread        sshd-session
> >   2806  202482      1      0  3    0x100083  ttyin         getty
> >  99278   67342      1      0  3    0x100098  kqread        cron
> >  14460  108200      1     99  3   0x1100090  kqread        sndiod
> >  49145  492411      1    110  3    0x100090  kqread        sndiod
> >  88648  284214  17435     95  3   0x1100092  kqread        smtpd
> >  85068  231497  17435    103  3   0x1100092  kqread        smtpd
> >  48744  446192  17435     95  3   0x1100092  kqread        smtpd
> >  81038   44933  17435     95  3    0x100092  kqread        smtpd
> >  67852  435382  17435     95  3   0x1100092  kqread        smtpd
> >  95510  326790  17435     95  3   0x1100092  kqread        smtpd
> >  17435  136031      1      0  3    0x100080  kqread        smtpd
> >  66494  299815  44745     91  3        0x92  kqread        snmpd_metrics
> >   2427  408672  44745     91  3   0x1100092  kqread        snmpd
> >  44745  351156      1      0  3    0x100080  kqread        snmpd
> >  40558  490400      1      0  3        0x88  kqread        sshd
> >  33095  460368      0      0  3     0x14200  acct          acct
> >  86928  490746      0      0  3     0x14280  nfsidl        nfsio
> >  40308  501410      0      0  3     0x14280  nfsidl        nfsio
> >  14683  369864      0      0  3     0x14280  nfsidl        nfsio
> >  86115  363991      0      0  3     0x14280  nfsidl        nfsio
> >  80631  145485      1      0  3    0x100080  kqread        ntpd
> >  62683  483071  70890     83  3    0x100092  kqread        ntpd
> >  70890  270623      1     83  3   0x1100092  kqread        ntpd
> >  59526  317744  53327     74  3   0x1100092  bpf           pflogd
> >  53327  166856      1      0  3        0x80  sbwait        pflogd
> >  46009  192808  60738     73  3   0x1100090  kqread        syslogd
> >  60738  257468      1      0  3    0x100082  sbwait        syslogd
> >  27089   12018      1      0  3    0x100080  kqread        resolvd
> >  34559  118884  58086     77  3    0x100092  kqread        dhcpleased
> >  50650   31850  58086     77  3    0x100092  kqread        dhcpleased
> >  58086  260708      1      0  3        0x80  kqread        dhcpleased
> >  57483   42427  93971    115  3    0x100092  kqread        slaacd
> >  37908   42219  93971    115  3    0x100092  kqread        slaacd
> >  93971   13019      1      0  3    0x100080  kqread        slaacd
> >  82672  132677      0      0  3     0x14200  bored         wsdisplay1
> >  85742  374184      0      0  3     0x14200  bored         i915_flip
> >  72724  322745      0      0  3     0x14200  bored         i915_modeset
> >  36100   66877      0      0  3     0x14200  bored         card0-crtc2
> >  71449  154528      0      0  3     0x14200  bored         card0-crtc1
> >   4004  449792      0      0  3     0x14200  bored         card0-crtc0
> >  66389  398290      0      0  3     0x14200  bored         ttm
> >  25629  195671      0      0  3     0x14200  bored         i915-unordered
> >  30993    3026      0      0  3     0x14200  bored         i915-dp
> >  64044   17892      0      0  3     0x14200  bored         i915
> >  50213  284154      0      0  3     0x14200  bored         smr
> >  40695  150715      0      0  3     0x14200  pgzero        zerothread
> >  33090  235644      0      0  3     0x14200  aiodoned      aiodoned
> >  89464   61329      0      0  3     0x14200  syncer        update
> >  90507  444068      0      0  3     0x14200  cleaner       cleaner
> >  60811   26863      0      0  3     0x14200  reaper        reaper
> >  23643  261108      0      0  3     0x14200  pgdaemon      pagedaemon
> >  48598  436260      0      0  3     0x14200  usbtsk        usbtask
> >  44951  160405      0      0  3     0x14200  usbatsk       usbatsk
> >   9085  470607      0      0  3     0x14200  bored         drmtskl
> >  51075   14158      0      0  3     0x14200  bored         drmlwq
> >  57977  332542      0      0  3     0x14200  bored         drmlwq
> >  31093   34074      0      0  3     0x14200  bored         drmlwq
> >  10444  303766      0      0  3     0x14200  bored         drmlwq
> >  63278  505564      0      0  3     0x14200  bored         drmubwq
> >  87295  374846      0      0  3     0x14200  bored         drmubwq
> >  13798  142371      0      0  3     0x14200  bored         drmubwq
> >  25664  191160      0      0  3     0x14200  bored         drmubwq
> >   1920   14491      0      0  3     0x14200  bored         drmhpwq
> >  32074  194715      0      0  3     0x14200  bored         drmhpwq
> >  87299  408596      0      0  3     0x14200  bored         drmhpwq
> >  84700  196185      0      0  3     0x14200  bored         drmhpwq
> >  49492  134485      0      0  3     0x14200  bored         drmwq
> >  85317   16007      0      0  3     0x14200  bored         drmwq
> >  69861  222376      0      0  3     0x14200  bored         drmwq
> >  58837  418320      0      0  3     0x14200  bored         drmwq
> >  92915  102135      0      0  3  0x40014200  acpi0         acpi0
> >  62796  101256      0      0  3  0x40014200                idle3
> >  98829  504539      0      0  3  0x40014200                idle2
> >  74240    7414      0      0  3  0x40014200                idle1
> >  52393   30619      0      0  3     0x14200  bored         sensors
> >  74541  314157      0      0  7     0x14200                softnet3
> >  27135  455485      0      0  7     0x14200                softnet2
> >  17394  432414      0      0  7     0x14200                softnet1
> >  67812  482110      0      0  3     0x14200  bored         softnet0
> >  69613  421762      0      0  3     0x14200  bored         systqmp
> >  69474  244487      0      0  3     0x14200  bored         systq
> >   6630  315988      0      0  3     0x14200  tmoslp        softclockmp
> >   4182  461805      0      0  3  0x40014200  tmoslp        softclock
> > *98424  232380      0      0  7  0x40014200                idle0
> >      1  288021      0      0  3        0x82  wait          init
> >      0       0     -1      0  3     0x10200  scheduler     swapper
> > 
> > ddb{0}> x/s version
> > version:        OpenBSD 7.7-beta (GENERIC.MP) #cvs : D2025.03.19.00.00.00: Thu Mar 20 17:06:11 CET 2025\012    root@ot41.obsd-lab.genua.de:/usr/src/sys/arch/amd64/compile/GENERIC.MP\012
> > 
> > Crash is here:
> > /home/bluhm/openbsd/cvs/src/sys/dev/pci/if_ixl.c:3032
> >     7a3c:       4c 89 f1                mov    %r14,%rcx
> >     7a3f:       48 c1 e1 04             shl    $0x4,%rcx
> >     7a43:       48 8b 55 98             mov    0xffffffffffffff98(%rbp),%rdx
> >     7a47:       8b 4c 0a 08             mov    0x8(%rdx,%rcx,1),%ecx
> > /home/bluhm/openbsd/cvs/src/sys/dev/pci/if_ixl.c:3033
> > 
> >   3027          do {
> >   3028                  txm = &txr->txr_maps[cons];
> >   3029                  last = txm->txm_eop;
> >   3030                  txd = &ring[last];
> >   3031
> > * 3032                  dtype = txd->cmd & htole64(IXL_TX_DESC_DTYPE_MASK);
> >   3033                  if (dtype != htole64(IXL_TX_DESC_DTYPE_DONE))
> >   3034                          break;
> > 
> > Variable last is -1.
> > 
> > My guess is concatenating the TCP packet with LRO has some bugs.
> > Then an illegal packet causes the TSO send path to crash.
> > 
> > As we disable LRO per default for now, I don't consider this crash
> > as a show stopper.
> 
> I reproduced and debugged this panics.  We get 2 to 3 panics on
> different CPUs from ixl_txeof() as bluhm showed above and from
> ixl_rxoef() in ixl_rxfill() while trying to get a new mbuf.
> 
> While debugging I noticed a lot of m_defrag() calls.  We get them
> due to too lang mbuf-chains generated via SoftLRO.  Ixl(4) nics are
> limited to a max. of 8 memory chunks per packet.  SoftLRO can combine
> up to 40 an more 1.5k chucks into one chain.  The outgoing ixl(4)
> interface has to call m_defrag() to reorganize this chains into less 
> mbufs with biger size.
> 
> When I reduced the max. no. of mbufs per packet to 8 m_defrag() is
> never called and we don't run into the panics above.
> 
> I guess there is some kind of race and locking problem in the ring and
> mbuf handling of txeof/rxeof.  Maybe in combination with the m_defrag()
> function.  Or the m_defrag() just changed the timing create this bug.
> 
> The diff below contains most of bluhm improvements and code to reduce
> the created mbuf change to 8.

This follwing diff adds checks to VLAN priorities.

Thanks,
Jan

Index: dev/pci/if_ixl.c
===================================================================
RCS file: /cvs/src/sys/dev/pci/if_ixl.c,v
diff -u -p -r1.102 if_ixl.c
--- dev/pci/if_ixl.c	30 Oct 2024 18:02:45 -0000	1.102
+++ dev/pci/if_ixl.c	3 Apr 2025 15:45:43 -0000
@@ -883,6 +883,8 @@ struct ixl_rx_wb_desc_16 {
 
 #define IXL_RX_DESC_PTYPE_SHIFT		30
 #define IXL_RX_DESC_PTYPE_MASK		(0xffULL << IXL_RX_DESC_PTYPE_SHIFT)
+#define IXL_RX_DESC_PTYPE_MAC_IPV4_TCP	26
+#define IXL_RX_DESC_PTYPE_MAC_IPV6_TCP	92
 
 #define IXL_RX_DESC_PLEN_SHIFT		38
 #define IXL_RX_DESC_PLEN_MASK		(0x3fffULL << IXL_RX_DESC_PLEN_SHIFT)
@@ -1976,6 +1978,12 @@ ixl_attach(struct device *parent, struct
 	    IFCAP_CSUM_TCPv6 | IFCAP_CSUM_UDPv6;
 	ifp->if_capabilities |= IFCAP_TSOv4 | IFCAP_TSOv6;
 
+	ifp->if_capabilities |= IFCAP_LRO;
+#if notyet
+	/* for now tcplro at ixl(4) is default off */
+	ifp->if_xflags |= IFXF_LRO;
+#endif
+
 	ifmedia_init(&sc->sc_media, 0, ixl_media_change, ixl_media_status);
 
 	ixl_media_add(sc, phy_types);
@@ -3255,9 +3263,11 @@ ixl_rxeof(struct ixl_softc *sc, struct i
 	struct ixl_rx_map *rxm;
 	bus_dmamap_t map;
 	unsigned int cons, prod;
+	struct mbuf_list mltcp = MBUF_LIST_INITIALIZER();
 	struct mbuf_list ml = MBUF_LIST_INITIALIZER();
 	struct mbuf *m;
 	uint64_t word;
+	unsigned int ptype;
 	unsigned int len;
 	unsigned int mask;
 	int done = 0;
@@ -3294,6 +3304,8 @@ ixl_rxeof(struct ixl_softc *sc, struct i
 		m = rxm->rxm_m;
 		rxm->rxm_m = NULL;
 
+		ptype = (word & IXL_RX_DESC_PTYPE_MASK)
+		    >> IXL_RX_DESC_PTYPE_SHIFT;
 		len = (word & IXL_RX_DESC_PLEN_MASK) >> IXL_RX_DESC_PLEN_SHIFT;
 		m->m_len = len;
 		m->m_pkthdr.len = 0;
@@ -3324,7 +3336,13 @@ ixl_rxeof(struct ixl_softc *sc, struct i
 #endif
 
 				ixl_rx_checksum(m, word);
-				ml_enqueue(&ml, m);
+
+				if (ISSET(ifp->if_xflags, IFXF_LRO) &&
+				    (ptype == IXL_RX_DESC_PTYPE_MAC_IPV4_TCP ||
+				     ptype == IXL_RX_DESC_PTYPE_MAC_IPV6_TCP))
+					tcp_softlro_enqueue(ifp, &mltcp, m);
+				else
+					ml_enqueue(&ml, m);
 			} else {
 				ifp->if_ierrors++; /* XXX */
 				m_freem(m);
@@ -3341,8 +3359,14 @@ ixl_rxeof(struct ixl_softc *sc, struct i
 	} while (cons != prod);
 
 	if (done) {
+		int livelocked = 0;
+
 		rxr->rxr_cons = cons;
+		if (ifiq_input(ifiq, &mltcp))
+			livelocked = 1;
 		if (ifiq_input(ifiq, &ml))
+			livelocked = 1;
+		if (livelocked)
 			if_rxr_livelocked(&rxr->rxr_acct);
 		ixl_rxfill(sc, rxr);
 	}
Index: netinet/tcp_input.c
===================================================================
RCS file: /cvs/src/sys/netinet/tcp_input.c,v
diff -u -p -r1.434 tcp_input.c
--- netinet/tcp_input.c	10 Mar 2025 15:11:46 -0000	1.434
+++ netinet/tcp_input.c	5 Apr 2025 11:46:19 -0000
@@ -84,6 +84,7 @@
 #include <net/if_var.h>
 #include <net/route.h>
 
+#include <netinet/if_ether.h>
 #include <netinet/in.h>
 #include <netinet/ip.h>
 #include <netinet/in_pcb.h>
@@ -4229,4 +4230,256 @@ syn_cache_respond(struct syn_cache *sc, 
 	}
 	in_pcbunref(inp);
 	return (error);
+}
+
+int
+tcp_softlro(struct mbuf *mhead, struct mbuf *mtail)
+{
+	struct ether_extracted	 head;
+	struct ether_extracted	 tail;
+	struct mbuf		*m;
+	unsigned int		 hdrlen;
+	unsigned int		 cnt = 0;
+
+	/*
+	 * Check if head and tail are mergeable
+	 */
+
+	ether_extract_headers(mhead, &head);
+	ether_extract_headers(mtail, &tail);
+
+	/* Don't merge packets inside and outside of VLANs */
+	if (head.evh && tail.evh) {
+		/* Don't merge packets of different VLANs */
+		if (EVL_VLANOFTAG(head.evh->evl_tag) !=
+		    EVL_VLANOFTAG(tail.evh->evl_tag))
+			return 0;
+
+		/* Don't merge packets of different priorities */
+		if (EVL_PRIOFTAG(head.evh->evl_tag) !=
+		    EVL_PRIOFTAG(tail.evh->evl_tag))
+			return 0;
+	} else if (head.evh || tail.evh)
+		return 0;
+
+	/* Check IP header. */
+	if (head.ip4 && tail.ip4) {
+		/* Don't merge packets with invalid header checksum. */
+		if (!ISSET(mhead->m_pkthdr.csum_flags, M_IPV4_CSUM_IN_OK) ||
+		    !ISSET(mtail->m_pkthdr.csum_flags, M_IPV4_CSUM_IN_OK))
+			return 0;
+
+		/* Check IPv4 addresses. */
+		if (head.ip4->ip_src.s_addr != tail.ip4->ip_src.s_addr ||
+		    head.ip4->ip_dst.s_addr != tail.ip4->ip_dst.s_addr)
+			return 0;
+
+		/* Don't merge IPv4 fragments. */
+		if (ISSET(head.ip4->ip_off, htons(IP_OFFMASK | IP_MF)) ||
+		    ISSET(tail.ip4->ip_off, htons(IP_OFFMASK | IP_MF)))
+			return 0;
+
+		/* Check max. IPv4 length. */
+		if (head.iplen + tail.iplen > IP_MAXPACKET)
+			return 0;
+
+		/* Don't merge IPv4 packets with option headers. */
+		if (head.iphlen != sizeof(struct ip) ||
+		    tail.iphlen != sizeof(struct ip))
+			return 0;
+
+		/* Don't non-TCP packets. */
+		if (head.ip4->ip_p != IPPROTO_TCP ||
+		    tail.ip4->ip_p != IPPROTO_TCP)
+			return 0;
+	} else if (head.ip6 && tail.ip6) {
+		/* Check IPv6 addresses. */
+		if (!IN6_ARE_ADDR_EQUAL(&head.ip6->ip6_src, &tail.ip6->ip6_src) ||
+		    !IN6_ARE_ADDR_EQUAL(&head.ip6->ip6_dst, &tail.ip6->ip6_dst))
+			return 0;
+
+		/* Check max. IPv6 length. */
+		if ((head.iplen - head.iphlen) +
+		    (tail.iplen - tail.iphlen) > IPV6_MAXPACKET)
+			return 0;
+
+		/* Don't merge IPv6 packets with option headers nor non-TCP. */
+		if (head.ip6->ip6_nxt != IPPROTO_TCP ||
+		    tail.ip6->ip6_nxt != IPPROTO_TCP)
+			return 0;
+	} else {
+		return 0;
+	}
+
+	/* Check TCP header. */
+	if (!head.tcp || !tail.tcp)
+		return 0;
+
+	/* Check TCP ports. */
+	if (head.tcp->th_sport != tail.tcp->th_sport ||
+	    head.tcp->th_dport != tail.tcp->th_dport)
+		return 0;
+
+	/* Don't merge empty segments. */
+	if (head.paylen == 0 || tail.paylen == 0)
+		return 0;
+
+	/* Check for continues segments. */
+	if (ntohl(head.tcp->th_seq) + head.paylen != ntohl(tail.tcp->th_seq))
+		return 0;
+
+	/* Just ACK and PUSH TCP flags are allowed. */
+	if (ISSET(head.tcp->th_flags, ~(TH_ACK|TH_PUSH)) ||
+	    ISSET(tail.tcp->th_flags, ~(TH_ACK|TH_PUSH)))
+		return 0;
+
+	/* TCP ACK flag has to be set. */
+	if (!ISSET(head.tcp->th_flags, TH_ACK) ||
+	    !ISSET(tail.tcp->th_flags, TH_ACK))
+		return 0;
+
+	/* Ignore segments with different TCP options. */
+	if (head.tcphlen - sizeof(struct tcphdr) !=
+	    tail.tcphlen - sizeof(struct tcphdr))
+		return 0;
+
+	/* Check for TCP options */
+	if (head.tcphlen > sizeof(struct tcphdr)) {
+		char *hopt = (char *)(head.tcp) + sizeof(struct tcphdr);
+		char *topt = (char *)(tail.tcp) + sizeof(struct tcphdr);
+		int optsize = head.tcphlen - sizeof(struct tcphdr);
+		int optlen;
+
+		for (; optsize > 0; optsize -= optlen) {
+			/* Ignore segments with different TCP options. */
+			if (hopt[0] != topt[0] || hopt[1] != topt[1])
+				return 0;
+
+			/* Get option length */
+			optlen = hopt[1];
+			if (hopt[0] == TCPOPT_NOP)
+				optlen = 1;
+			else if (optlen < 2 || optlen > optsize)
+				return 0;	/* Illegal length */
+
+			if (hopt[0] != TCPOPT_NOP &&
+			    hopt[0] != TCPOPT_TIMESTAMP)
+				return 0;	/* Unsupported TCP option */
+
+			hopt += optlen;
+			topt += optlen;
+		}
+	}
+
+	/* Limit mbuf chain len to avoid m_defrag calls on forwarding. */
+	for (m = mhead; m != NULL; m = m->m_next)
+		if (cnt++ >= 8)
+			return 0;
+	for (m = mtail; m != NULL; m = m->m_next)
+		if (cnt++ >= 8)
+			return 0;
+
+	/*
+	 * Prepare concatenation of head and tail.
+	 */
+
+	/* Adjust IP header. */
+	if (head.ip4) {
+		head.ip4->ip_len = htons(head.iplen + tail.paylen);
+	} else if (head.ip6) {
+		head.ip6->ip6_plen =
+		    htons(head.iplen - head.iphlen + tail.paylen);
+	}
+
+	/* Combine TCP flags from head and tail. */
+	if (ISSET(tail.tcp->th_flags, TH_PUSH))
+		SET(head.tcp->th_flags, TH_PUSH);
+
+	/* Adjust TCP header. */
+	head.tcp->th_win = tail.tcp->th_win;
+	head.tcp->th_ack = tail.tcp->th_ack;
+
+	/* Calculate header length of tail packet. */
+	hdrlen = sizeof(*tail.eh);
+	if (tail.evh)
+		hdrlen = sizeof(*tail.evh);
+	hdrlen += tail.iphlen;
+	hdrlen += tail.tcphlen;
+
+	/* Skip protocol headers in tail. */
+	m_adj(mtail, hdrlen);
+	CLR(mtail->m_flags, M_PKTHDR);
+
+	/* Concatenate */
+	for (m = mhead; m->m_next;)
+		m = m->m_next;
+	m->m_next = mtail;
+	mhead->m_pkthdr.len += tail.paylen;
+
+	/* Flag mbuf as TSO packet with MSS. */
+	if (!ISSET(mhead->m_pkthdr.csum_flags, M_TCP_TSO)) {
+		/* Set CSUM_OUT flags in case of forwarding. */
+		SET(mhead->m_pkthdr.csum_flags, M_TCP_CSUM_OUT);
+		head.tcp->th_sum = 0;
+		if (head.ip4) {
+			SET(mhead->m_pkthdr.csum_flags, M_IPV4_CSUM_OUT);
+			head.ip4->ip_sum = 0;
+		}
+
+		SET(mhead->m_pkthdr.csum_flags, M_TCP_TSO);
+		mhead->m_pkthdr.ph_mss = head.paylen;
+		tcpstat_inc(tcps_inswlro);
+		tcpstat_inc(tcps_inpktlro);	/* count head */
+	}
+	mhead->m_pkthdr.ph_mss = MAX(mhead->m_pkthdr.ph_mss, tail.paylen);
+	tcpstat_inc(tcps_inpktlro);	/* count tail */
+
+	return 1;
+}
+
+void
+tcp_softlro_enqueue(struct ifnet *ifp, struct mbuf_list *ml, struct mbuf *mtail)
+{
+	struct mbuf *mhead;
+
+	if (!ISSET(ifp->if_xflags, IFXF_LRO))
+		goto out;
+
+	/* Don't merge packets with invalid header checksum. */
+	if (!ISSET(mtail->m_pkthdr.csum_flags, M_TCP_CSUM_IN_OK))
+		goto out;
+
+	for (mhead = ml->ml_head; mhead != NULL; mhead = mhead->m_nextpkt) {
+		/* Don't merge packets with invalid header checksum. */
+		if (!ISSET(mhead->m_pkthdr.csum_flags, M_TCP_CSUM_IN_OK))
+			continue;
+
+		/* Use RSS hash to skip packets of different connections. */
+		if (ISSET(mhead->m_pkthdr.csum_flags, M_FLOWID) &&
+		    ISSET(mtail->m_pkthdr.csum_flags, M_FLOWID) &&
+		    mhead->m_pkthdr.ph_flowid != mtail->m_pkthdr.ph_flowid)
+			continue;
+
+		/* Don't merge packets inside and outside of VLANs */
+		if (ISSET(mhead->m_flags, M_VLANTAG) !=
+		    ISSET(mtail->m_flags, M_VLANTAG))
+			continue;
+
+		if (ISSET(mhead->m_flags, M_VLANTAG)) {
+			/* Don't merge packets of different VLANs */
+			if (EVL_VLANOFTAG(mhead->m_pkthdr.ether_vtag) !=
+			    EVL_VLANOFTAG(mtail->m_pkthdr.ether_vtag))
+				continue;
+
+			/* Don't merge packets of different priorities */
+			if (EVL_PRIOFTAG(mhead->m_pkthdr.ether_vtag) !=
+			    EVL_PRIOFTAG(mtail->m_pkthdr.ether_vtag))
+				continue;
+		}
+
+		if (tcp_softlro(mhead, mtail))
+			return;
+	}
+ out:
+	ml_enqueue(ml, mtail);
 }
Index: netinet/tcp_var.h
===================================================================
RCS file: /cvs/src/sys/netinet/tcp_var.h,v
diff -u -p -r1.186 tcp_var.h
--- netinet/tcp_var.h	2 Mar 2025 21:28:32 -0000	1.186
+++ netinet/tcp_var.h	3 Apr 2025 15:43:55 -0000
@@ -720,6 +720,7 @@ void	 tcp_init(void);
 int	 tcp_input(struct mbuf **, int *, int, int, struct netstack *);
 int	 tcp_mss(struct tcpcb *, int);
 void	 tcp_mss_update(struct tcpcb *);
+void	 tcp_softlro_enqueue(struct ifnet *, struct mbuf_list *, struct mbuf *);
 u_int	 tcp_hdrsz(struct tcpcb *);
 void	 tcp_mtudisc(struct inpcb *, int);
 void	 tcp_mtudisc_increase(struct inpcb *, int);