From: Jan Klemkow Subject: Re: SoftLRO for ixl(4), bnxt(4) and em(4) To: Alexander Bluhm Cc: tech@openbsd.org, Janne Johansson , Yuichiro NAITO Date: Fri, 4 Apr 2025 19:13:41 +0200 On Thu, Mar 20, 2025 at 05:37:30PM +0100, Alexander Bluhm wrote: > On Thu, Mar 20, 2025 at 03:20:25PM +0100, Alexander Bluhm wrote: > > On Tue, Mar 04, 2025 at 08:02:31PM +0100, Jan Klemkow wrote: > > > On Fri, Nov 15, 2024 at 11:30:08AM GMT, Jan Klemkow wrote: > > > > On Thu, Nov 07, 2024 at 11:30:26AM GMT, David Gwynne wrote: > > > > > On Thu, Nov 07, 2024 at 01:10:10AM +0100, Jan Klemkow wrote: > > > > > > This diff introduces a software solution for TCP Large Receive Offload > > > > > > (SoftLRO) for network interfaces don't hat hardware support for it. > > > > > > This is needes at least for newer Intel interfaces as their > > > > > > documentation said that LRO a.k.a. Receive Side Coalescing (RSC) has to > > > > > > be done by software. > > > > > > This diff coalesces TCP segments during the receive interrupt before > > > > > > queueing them. Thus, our TCP/IP stack has to process less packet > > > > > > headers per amount of received data. > > > > > > > > > > > > I measured receiving performance with Intel XXV710 25 GbE interfaces. > > > > > > It increased from 6 Gbit/s to 23 Gbit/s. > > > > > > > > > > > > Even if we saturate em(4) without any of these technique its also part > > > > > > this diff. I'm interested if this diff helps to reach 1 Gbit/s on old > > > > > > or slow hardware. > > > > > > > > > > > > I also add bnxt(4) to this diff to increase test coverage. If you want > > > > > > to tests this implementation with your favorite interface, just replace > > > > > > the ml_enqueue() call with the new tcp_softlro_enqueue() (as seen > > > > > > below). It should work with all kind network interfaces. > > > > > > > > > > > > Any comments and tests reports are welcome. > > > > > > > > > > nice. > > > > > > > > > > i would argue this should be ether_softlro_enqueue and put in > > > > > if_ethersubr.c because it's specific to ethernet interfaces. we don't > > > > > really have any other type of interface that bundles reception of > > > > > packets that we can take advantage of like this, and internally it > > > > > assumes it's pulling ethernet packets apart. > > > > > > > > > > aside from that, just a few comments on the code. > > > > > > > > I adapted your comments in the diff below. > > > > > > I refactored the SoftLRO diff. You just need to add the flags IFXF_LRO > > > / IFCAP_LRO, and repalce ml_enqueue() with tcp_softlro_enqueue() to > > > enable this on you favorit network device. > > > > > > Janne: I adjusted your diff with correct headers. But, I'm unable to > > > test this part of the diff below, due to lack of hardware. Could you > > > test it again? > > > > > > Yuichiro: Could you also retest your UDP/TCP forwarding test? I added a > > > short path for non-TCP packets in the ixl(4) driver. Maybe its better > > > now. > > > > > > Further tests and comments are welcome. > > > > As release lock will come soon, we should be careful. I think we > > can add the code if we ensure that default behavior does not change > > anything. > > > > We should start with ixl(4), ifconfig tcplro flag disabled per > > default. Then we can improve the TCP code in tree and add other > > drivers later. I found some issues like ethernet padding and > > TCP option parsing that should be improved. But doing this > > on top of an initial commit is easier. > > > > With some minor changes I am OK with commiting the diff below. > > > > - remove all drivers except ixl(4) from the diff > > - turn off ifconfig tcplro per default > > - make sure that ixl(4) logic does not change, if tcplro is off > > - I have fixed the livelocked logic in ixl(4) > > - I renamed the function to tcp_enqueue_lro(), it is more consistent > > with the TSO functions. > > > > As said before, tcp_softlro() needs more love, but it want to > > do this in tree. > > > > With my comments, jan@'s diff looks like this and would be OK bluhm@ > > Oops, I got a uvm fault in ixl_txeof() with this diff. Test was > sending single stream TCP from Linux to Linux machine while OpenBSD > was forwarding. ixl(4) LRO was activated. So it looks like it was > sending a packet that was previously received with LRO. > > uvm_fault(0xffffffff82ab4180, 0xfffffd90d6fd4ff8, 0, 1) -> e > kernel: page fault trap, code=0 > Stopped at ixl_txeof+0x197: movl 0x8(%rdx,%rcx,1),%ecx > TID PID UID PRFLAGS PFLAGS CPU COMMAND > 314157 74541 0 0x14000 0x200 2 softnet3 > 455485 27135 0 0x14000 0x200 1 softnet2 > 432414 17394 0 0x14000 0x200 3 softnet1 > ixl_txeof(ffff800000188000,ffff800000e1cf00) at ixl_txeof+0x197 > ixl_intr_vector(ffff800000185300) at ixl_intr_vector+0x57 > intr_handler(ffff800032c770c0,ffff800000175000) at intr_handler+0x91 > Xintr_ioapic_edge28_untramp() at Xintr_ioapic_edge28_untramp+0x18f > acpicpu_idle() at acpicpu_idle+0x239 > sched_idle(ffffffff829f2ff0) at sched_idle+0x298 > end trace frame: 0x0, count: 9 > https://www.openbsd.org/ddb.html describes the minimum info required in bug > reports. Insufficient info makes it difficult to find and fix bugs. > > ddb{0}> show panic > *cpu0: uvm_fault(0xffffffff82ab4180, 0xfffffd90d6fd4ff8, 0, 1) -> e > > ddb{0}> trace > ixl_txeof(ffff800000188000,ffff800000e1cf00) at ixl_txeof+0x197 > ixl_intr_vector(ffff800000185300) at ixl_intr_vector+0x57 > intr_handler(ffff800032c770c0,ffff800000175000) at intr_handler+0x91 > Xintr_ioapic_edge28_untramp() at Xintr_ioapic_edge28_untramp+0x18f > acpicpu_idle() at acpicpu_idle+0x239 > sched_idle(ffffffff829f2ff0) at sched_idle+0x298 > end trace frame: 0x0, count: -6 > > ddb{0}> show register > rdi 0x4 > rsi 0xffff800001f6e5c0 > rbp 0xffff800032c77010 > rbx 0xffff800001f6c000 > rdx 0xfffffd80d6fd5000 > rcx 0xffffffff0 > rax 0x192 > r8 0x8 > r9 0 > r10 0x2096f2a745fa867e > r11 0x26076eed88feaba6 > r12 0x196 > r13 0x1 > r14 0xffffffff > r15 0x192 > rip 0xffffffff81d8a047 ixl_txeof+0x197 > cs 0x8 > rflags 0x10206 __ALIGN_SIZE+0xf206 > rsp 0xffff800032c76fa0 > ss 0x10 > ixl_txeof+0x197: movl 0x8(%rdx,%rcx,1),%ecx > > ddb{0}> ps > PID TID PPID UID S FLAGS WAIT COMMAND > 38518 412339 93610 0 3 0x10008a kqread ssh > 47141 497841 69680 1000 3 0x100082 kqread tcpbench > 69680 482468 41677 1000 3 0x10008a sigsusp sh > 41677 137967 1 0 3 0x100088 sigsusp ksh > 93610 431262 70721 0 3 0x82 piperd perl > 70721 340494 84931 0 3 0x10008a sigsusp ksh > 84931 281543 76631 0 3 0x98 kqread sshd-session > 76631 384998 40558 0 3 0x92 kqread sshd-session > 2806 202482 1 0 3 0x100083 ttyin getty > 99278 67342 1 0 3 0x100098 kqread cron > 14460 108200 1 99 3 0x1100090 kqread sndiod > 49145 492411 1 110 3 0x100090 kqread sndiod > 88648 284214 17435 95 3 0x1100092 kqread smtpd > 85068 231497 17435 103 3 0x1100092 kqread smtpd > 48744 446192 17435 95 3 0x1100092 kqread smtpd > 81038 44933 17435 95 3 0x100092 kqread smtpd > 67852 435382 17435 95 3 0x1100092 kqread smtpd > 95510 326790 17435 95 3 0x1100092 kqread smtpd > 17435 136031 1 0 3 0x100080 kqread smtpd > 66494 299815 44745 91 3 0x92 kqread snmpd_metrics > 2427 408672 44745 91 3 0x1100092 kqread snmpd > 44745 351156 1 0 3 0x100080 kqread snmpd > 40558 490400 1 0 3 0x88 kqread sshd > 33095 460368 0 0 3 0x14200 acct acct > 86928 490746 0 0 3 0x14280 nfsidl nfsio > 40308 501410 0 0 3 0x14280 nfsidl nfsio > 14683 369864 0 0 3 0x14280 nfsidl nfsio > 86115 363991 0 0 3 0x14280 nfsidl nfsio > 80631 145485 1 0 3 0x100080 kqread ntpd > 62683 483071 70890 83 3 0x100092 kqread ntpd > 70890 270623 1 83 3 0x1100092 kqread ntpd > 59526 317744 53327 74 3 0x1100092 bpf pflogd > 53327 166856 1 0 3 0x80 sbwait pflogd > 46009 192808 60738 73 3 0x1100090 kqread syslogd > 60738 257468 1 0 3 0x100082 sbwait syslogd > 27089 12018 1 0 3 0x100080 kqread resolvd > 34559 118884 58086 77 3 0x100092 kqread dhcpleased > 50650 31850 58086 77 3 0x100092 kqread dhcpleased > 58086 260708 1 0 3 0x80 kqread dhcpleased > 57483 42427 93971 115 3 0x100092 kqread slaacd > 37908 42219 93971 115 3 0x100092 kqread slaacd > 93971 13019 1 0 3 0x100080 kqread slaacd > 82672 132677 0 0 3 0x14200 bored wsdisplay1 > 85742 374184 0 0 3 0x14200 bored i915_flip > 72724 322745 0 0 3 0x14200 bored i915_modeset > 36100 66877 0 0 3 0x14200 bored card0-crtc2 > 71449 154528 0 0 3 0x14200 bored card0-crtc1 > 4004 449792 0 0 3 0x14200 bored card0-crtc0 > 66389 398290 0 0 3 0x14200 bored ttm > 25629 195671 0 0 3 0x14200 bored i915-unordered > 30993 3026 0 0 3 0x14200 bored i915-dp > 64044 17892 0 0 3 0x14200 bored i915 > 50213 284154 0 0 3 0x14200 bored smr > 40695 150715 0 0 3 0x14200 pgzero zerothread > 33090 235644 0 0 3 0x14200 aiodoned aiodoned > 89464 61329 0 0 3 0x14200 syncer update > 90507 444068 0 0 3 0x14200 cleaner cleaner > 60811 26863 0 0 3 0x14200 reaper reaper > 23643 261108 0 0 3 0x14200 pgdaemon pagedaemon > 48598 436260 0 0 3 0x14200 usbtsk usbtask > 44951 160405 0 0 3 0x14200 usbatsk usbatsk > 9085 470607 0 0 3 0x14200 bored drmtskl > 51075 14158 0 0 3 0x14200 bored drmlwq > 57977 332542 0 0 3 0x14200 bored drmlwq > 31093 34074 0 0 3 0x14200 bored drmlwq > 10444 303766 0 0 3 0x14200 bored drmlwq > 63278 505564 0 0 3 0x14200 bored drmubwq > 87295 374846 0 0 3 0x14200 bored drmubwq > 13798 142371 0 0 3 0x14200 bored drmubwq > 25664 191160 0 0 3 0x14200 bored drmubwq > 1920 14491 0 0 3 0x14200 bored drmhpwq > 32074 194715 0 0 3 0x14200 bored drmhpwq > 87299 408596 0 0 3 0x14200 bored drmhpwq > 84700 196185 0 0 3 0x14200 bored drmhpwq > 49492 134485 0 0 3 0x14200 bored drmwq > 85317 16007 0 0 3 0x14200 bored drmwq > 69861 222376 0 0 3 0x14200 bored drmwq > 58837 418320 0 0 3 0x14200 bored drmwq > 92915 102135 0 0 3 0x40014200 acpi0 acpi0 > 62796 101256 0 0 3 0x40014200 idle3 > 98829 504539 0 0 3 0x40014200 idle2 > 74240 7414 0 0 3 0x40014200 idle1 > 52393 30619 0 0 3 0x14200 bored sensors > 74541 314157 0 0 7 0x14200 softnet3 > 27135 455485 0 0 7 0x14200 softnet2 > 17394 432414 0 0 7 0x14200 softnet1 > 67812 482110 0 0 3 0x14200 bored softnet0 > 69613 421762 0 0 3 0x14200 bored systqmp > 69474 244487 0 0 3 0x14200 bored systq > 6630 315988 0 0 3 0x14200 tmoslp softclockmp > 4182 461805 0 0 3 0x40014200 tmoslp softclock > *98424 232380 0 0 7 0x40014200 idle0 > 1 288021 0 0 3 0x82 wait init > 0 0 -1 0 3 0x10200 scheduler swapper > > ddb{0}> x/s version > version: OpenBSD 7.7-beta (GENERIC.MP) #cvs : D2025.03.19.00.00.00: Thu Mar 20 17:06:11 CET 2025\012 root@ot41.obsd-lab.genua.de:/usr/src/sys/arch/amd64/compile/GENERIC.MP\012 > > Crash is here: > /home/bluhm/openbsd/cvs/src/sys/dev/pci/if_ixl.c:3032 > 7a3c: 4c 89 f1 mov %r14,%rcx > 7a3f: 48 c1 e1 04 shl $0x4,%rcx > 7a43: 48 8b 55 98 mov 0xffffffffffffff98(%rbp),%rdx > 7a47: 8b 4c 0a 08 mov 0x8(%rdx,%rcx,1),%ecx > /home/bluhm/openbsd/cvs/src/sys/dev/pci/if_ixl.c:3033 > > 3027 do { > 3028 txm = &txr->txr_maps[cons]; > 3029 last = txm->txm_eop; > 3030 txd = &ring[last]; > 3031 > * 3032 dtype = txd->cmd & htole64(IXL_TX_DESC_DTYPE_MASK); > 3033 if (dtype != htole64(IXL_TX_DESC_DTYPE_DONE)) > 3034 break; > > Variable last is -1. > > My guess is concatenating the TCP packet with LRO has some bugs. > Then an illegal packet causes the TSO send path to crash. > > As we disable LRO per default for now, I don't consider this crash > as a show stopper. I reproduced and debugged this panics. We get 2 to 3 panics on different CPUs from ixl_txeof() as bluhm showed above and from ixl_rxoef() in ixl_rxfill() while trying to get a new mbuf. While debugging I noticed a lot of m_defrag() calls. We get them due to too lang mbuf-chains generated via SoftLRO. Ixl(4) nics are limited to a max. of 8 memory chunks per packet. SoftLRO can combine up to 40 an more 1.5k chucks into one chain. The outgoing ixl(4) interface has to call m_defrag() to reorganize this chains into less mbufs with biger size. When I reduced the max. no. of mbufs per packet to 8 m_defrag() is never called and we don't run into the panics above. I guess there is some kind of race and locking problem in the ring and mbuf handling of txeof/rxeof. Maybe in combination with the m_defrag() function. Or the m_defrag() just changed the timing create this bug. The diff below contains most of bluhm improvements and code to reduce the created mbuf change to 8. Thanks, Jan Index: dev/pci/if_ixl.c =================================================================== RCS file: /cvs/src/sys/dev/pci/if_ixl.c,v diff -u -p -r1.102 if_ixl.c --- dev/pci/if_ixl.c 30 Oct 2024 18:02:45 -0000 1.102 +++ dev/pci/if_ixl.c 3 Apr 2025 15:45:43 -0000 @@ -883,6 +883,8 @@ struct ixl_rx_wb_desc_16 { #define IXL_RX_DESC_PTYPE_SHIFT 30 #define IXL_RX_DESC_PTYPE_MASK (0xffULL << IXL_RX_DESC_PTYPE_SHIFT) +#define IXL_RX_DESC_PTYPE_MAC_IPV4_TCP 26 +#define IXL_RX_DESC_PTYPE_MAC_IPV6_TCP 92 #define IXL_RX_DESC_PLEN_SHIFT 38 #define IXL_RX_DESC_PLEN_MASK (0x3fffULL << IXL_RX_DESC_PLEN_SHIFT) @@ -1976,6 +1978,12 @@ ixl_attach(struct device *parent, struct IFCAP_CSUM_TCPv6 | IFCAP_CSUM_UDPv6; ifp->if_capabilities |= IFCAP_TSOv4 | IFCAP_TSOv6; + ifp->if_capabilities |= IFCAP_LRO; +#if notyet + /* for now tcplro at ixl(4) is default off */ + ifp->if_xflags |= IFXF_LRO; +#endif + ifmedia_init(&sc->sc_media, 0, ixl_media_change, ixl_media_status); ixl_media_add(sc, phy_types); @@ -3255,9 +3263,11 @@ ixl_rxeof(struct ixl_softc *sc, struct i struct ixl_rx_map *rxm; bus_dmamap_t map; unsigned int cons, prod; + struct mbuf_list mltcp = MBUF_LIST_INITIALIZER(); struct mbuf_list ml = MBUF_LIST_INITIALIZER(); struct mbuf *m; uint64_t word; + unsigned int ptype; unsigned int len; unsigned int mask; int done = 0; @@ -3294,6 +3304,8 @@ ixl_rxeof(struct ixl_softc *sc, struct i m = rxm->rxm_m; rxm->rxm_m = NULL; + ptype = (word & IXL_RX_DESC_PTYPE_MASK) + >> IXL_RX_DESC_PTYPE_SHIFT; len = (word & IXL_RX_DESC_PLEN_MASK) >> IXL_RX_DESC_PLEN_SHIFT; m->m_len = len; m->m_pkthdr.len = 0; @@ -3324,7 +3336,13 @@ ixl_rxeof(struct ixl_softc *sc, struct i #endif ixl_rx_checksum(m, word); - ml_enqueue(&ml, m); + + if (ISSET(ifp->if_xflags, IFXF_LRO) && + (ptype == IXL_RX_DESC_PTYPE_MAC_IPV4_TCP || + ptype == IXL_RX_DESC_PTYPE_MAC_IPV6_TCP)) + tcp_softlro_enqueue(ifp, &mltcp, m); + else + ml_enqueue(&ml, m); } else { ifp->if_ierrors++; /* XXX */ m_freem(m); @@ -3341,8 +3359,14 @@ ixl_rxeof(struct ixl_softc *sc, struct i } while (cons != prod); if (done) { + int livelocked = 0; + rxr->rxr_cons = cons; + if (ifiq_input(ifiq, &mltcp)) + livelocked = 1; if (ifiq_input(ifiq, &ml)) + livelocked = 1; + if (livelocked) if_rxr_livelocked(&rxr->rxr_acct); ixl_rxfill(sc, rxr); } Index: netinet/tcp_input.c =================================================================== RCS file: /cvs/src/sys/netinet/tcp_input.c,v diff -u -p -r1.434 tcp_input.c --- netinet/tcp_input.c 10 Mar 2025 15:11:46 -0000 1.434 +++ netinet/tcp_input.c 4 Apr 2025 07:59:09 -0000 @@ -84,6 +84,7 @@ #include #include +#include #include #include #include @@ -4229,4 +4230,242 @@ syn_cache_respond(struct syn_cache *sc, } in_pcbunref(inp); return (error); +} + +int +tcp_softlro(struct mbuf *mhead, struct mbuf *mtail) +{ + struct ether_extracted head; + struct ether_extracted tail; + struct mbuf *m; + unsigned int hdrlen; + unsigned int cnt = 0; + + /* + * Check if head and tail are mergeable + */ + + ether_extract_headers(mhead, &head); + ether_extract_headers(mtail, &tail); + + /* Don't merge packets of different VLANs */ + if (head.evh && tail.evh) { + if (head.evh->evl_tag != tail.evh->evl_tag) + return 0; + } else if (head.evh || tail.evh) + return 0; + + /* Check IP header. */ + if (head.ip4 && tail.ip4) { + /* Don't merge packets with invalid header checksum. */ + if (!ISSET(mhead->m_pkthdr.csum_flags, M_IPV4_CSUM_IN_OK) || + !ISSET(mtail->m_pkthdr.csum_flags, M_IPV4_CSUM_IN_OK)) + return 0; + + /* Check IPv4 addresses. */ + if (head.ip4->ip_src.s_addr != tail.ip4->ip_src.s_addr || + head.ip4->ip_dst.s_addr != tail.ip4->ip_dst.s_addr) + return 0; + + /* Don't merge IPv4 fragments. */ + if (ISSET(head.ip4->ip_off, htons(IP_OFFMASK | IP_MF)) || + ISSET(tail.ip4->ip_off, htons(IP_OFFMASK | IP_MF))) + return 0; + + /* Check max. IPv4 length. */ + if (head.iplen + tail.iplen > IP_MAXPACKET) + return 0; + + /* Don't merge IPv4 packets with option headers. */ + if (head.iphlen != sizeof(struct ip) || + tail.iphlen != sizeof(struct ip)) + return 0; + + /* Don't non-TCP packets. */ + if (head.ip4->ip_p != IPPROTO_TCP || + tail.ip4->ip_p != IPPROTO_TCP) + return 0; + } else if (head.ip6 && tail.ip6) { + /* Check IPv6 addresses. */ + if (!IN6_ARE_ADDR_EQUAL(&head.ip6->ip6_src, &tail.ip6->ip6_src) || + !IN6_ARE_ADDR_EQUAL(&head.ip6->ip6_dst, &tail.ip6->ip6_dst)) + return 0; + + /* Check max. IPv6 length. */ + if ((head.iplen - head.iphlen) + + (tail.iplen - tail.iphlen) > IPV6_MAXPACKET) + return 0; + + /* Don't merge IPv6 packets with option headers nor non-TCP. */ + if (head.ip6->ip6_nxt != IPPROTO_TCP || + tail.ip6->ip6_nxt != IPPROTO_TCP) + return 0; + } else { + return 0; + } + + /* Check TCP header. */ + if (!head.tcp || !tail.tcp) + return 0; + + /* Check TCP ports. */ + if (head.tcp->th_sport != tail.tcp->th_sport || + head.tcp->th_dport != tail.tcp->th_dport) + return 0; + + /* Don't merge empty segments. */ + if (head.paylen == 0 || tail.paylen == 0) + return 0; + + /* Check for continues segments. */ + if (ntohl(head.tcp->th_seq) + head.paylen != ntohl(tail.tcp->th_seq)) + return 0; + + /* Just ACK and PUSH TCP flags are allowed. */ + if (ISSET(head.tcp->th_flags, ~(TH_ACK|TH_PUSH)) || + ISSET(tail.tcp->th_flags, ~(TH_ACK|TH_PUSH))) + return 0; + + /* TCP ACK flag has to be set. */ + if (!ISSET(head.tcp->th_flags, TH_ACK) || + !ISSET(tail.tcp->th_flags, TH_ACK)) + return 0; + + /* Ignore segments with different TCP options. */ + if (head.tcphlen - sizeof(struct tcphdr) != + tail.tcphlen - sizeof(struct tcphdr)) + return 0; + + /* Check for TCP options */ + if (head.tcphlen > sizeof(struct tcphdr)) { + char *hopt = (char *)(head.tcp) + sizeof(struct tcphdr); + char *topt = (char *)(tail.tcp) + sizeof(struct tcphdr); + int optsize = head.tcphlen - sizeof(struct tcphdr); + int optlen; + + for (; optsize > 0; optsize -= optlen) { + /* Ignore segments with different TCP options. */ + if (hopt[0] != topt[0] || hopt[1] != topt[1]) + return 0; + + /* Get option length */ + optlen = hopt[1]; + if (hopt[0] == TCPOPT_NOP) + optlen = 1; + else if (optlen < 2 || optlen > optsize) + return 0; /* Illegal length */ + + if (hopt[0] != TCPOPT_NOP && + hopt[0] != TCPOPT_TIMESTAMP) + return 0; /* Unsupported TCP option */ + + hopt += optlen; + topt += optlen; + } + } + + /* Limit mbuf chain len to avoid m_defrag calls on forwarding. */ + for (m = mhead; m != NULL; m = m->m_next) + if (cnt++ >= 8) + return 0; + for (m = mtail; m != NULL; m = m->m_next) + if (cnt++ >= 8) + return 0; + + /* + * Prepare concatenation of head and tail. + */ + + /* Adjust IP header. */ + if (head.ip4) { + head.ip4->ip_len = htons(head.iplen + tail.paylen); + } else if (head.ip6) { + head.ip6->ip6_plen = + htons(head.iplen - head.iphlen + tail.paylen); + } + + /* Combine TCP flags from head and tail. */ + if (ISSET(tail.tcp->th_flags, TH_PUSH)) + SET(head.tcp->th_flags, TH_PUSH); + + /* Adjust TCP header. */ + head.tcp->th_win = tail.tcp->th_win; + head.tcp->th_ack = tail.tcp->th_ack; + + /* Calculate header length of tail packet. */ + hdrlen = sizeof(*tail.eh); + if (tail.evh) + hdrlen = sizeof(*tail.evh); + hdrlen += tail.iphlen; + hdrlen += tail.tcphlen; + + /* Skip protocol headers in tail. */ + m_adj(mtail, hdrlen); + CLR(mtail->m_flags, M_PKTHDR); + + /* Concatenate */ + for (m = mhead; m->m_next;) + m = m->m_next; + m->m_next = mtail; + mhead->m_pkthdr.len += tail.paylen; + + /* Flag mbuf as TSO packet with MSS. */ + if (!ISSET(mhead->m_pkthdr.csum_flags, M_TCP_TSO)) { + /* Set CSUM_OUT flags in case of forwarding. */ + SET(mhead->m_pkthdr.csum_flags, M_TCP_CSUM_OUT); + head.tcp->th_sum = 0; + if (head.ip4) { + SET(mhead->m_pkthdr.csum_flags, M_IPV4_CSUM_OUT); + head.ip4->ip_sum = 0; + } + + SET(mhead->m_pkthdr.csum_flags, M_TCP_TSO); + mhead->m_pkthdr.ph_mss = head.paylen; + tcpstat_inc(tcps_inswlro); + tcpstat_inc(tcps_inpktlro); /* count head */ + } + mhead->m_pkthdr.ph_mss = MAX(mhead->m_pkthdr.ph_mss, tail.paylen); + tcpstat_inc(tcps_inpktlro); /* count tail */ + + return 1; +} + +void +tcp_softlro_enqueue(struct ifnet *ifp, struct mbuf_list *ml, struct mbuf *mtail) +{ + struct mbuf *mhead; + + if (!ISSET(ifp->if_xflags, IFXF_LRO)) + goto out; + + /* Don't merge packets with invalid header checksum. */ + if (!ISSET(mtail->m_pkthdr.csum_flags, M_TCP_CSUM_IN_OK)) + goto out; + + for (mhead = ml->ml_head; mhead != NULL; mhead = mhead->m_nextpkt) { + /* Don't merge packets with invalid header checksum. */ + if (!ISSET(mhead->m_pkthdr.csum_flags, M_TCP_CSUM_IN_OK)) + continue; + + /* Use RSS hash to skip packets of different connections. */ + if (ISSET(mhead->m_pkthdr.csum_flags, M_FLOWID) && + ISSET(mtail->m_pkthdr.csum_flags, M_FLOWID) && + mhead->m_pkthdr.ph_flowid != mtail->m_pkthdr.ph_flowid) + continue; + + /* Don't merge packets of different VLANs */ + if (ISSET(mhead->m_flags, M_VLANTAG) != + ISSET(mtail->m_flags, M_VLANTAG)) + continue; + + if (ISSET(mhead->m_flags, M_VLANTAG) && + EVL_VLANOFTAG(mhead->m_pkthdr.ether_vtag) != + EVL_VLANOFTAG(mtail->m_pkthdr.ether_vtag)) + continue; + + if (tcp_softlro(mhead, mtail)) + return; + } + out: + ml_enqueue(ml, mtail); } Index: netinet/tcp_var.h =================================================================== RCS file: /cvs/src/sys/netinet/tcp_var.h,v diff -u -p -r1.186 tcp_var.h --- netinet/tcp_var.h 2 Mar 2025 21:28:32 -0000 1.186 +++ netinet/tcp_var.h 3 Apr 2025 15:43:55 -0000 @@ -720,6 +720,7 @@ void tcp_init(void); int tcp_input(struct mbuf **, int *, int, int, struct netstack *); int tcp_mss(struct tcpcb *, int); void tcp_mss_update(struct tcpcb *); +void tcp_softlro_enqueue(struct ifnet *, struct mbuf_list *, struct mbuf *); u_int tcp_hdrsz(struct tcpcb *); void tcp_mtudisc(struct inpcb *, int); void tcp_mtudisc_increase(struct inpcb *, int);