From: Yuichiro NAITO Subject: Re: SoftLRO for ixl(4), bnxt(4) and em(4) To: j.klemkow@wemelug.de Cc: tech@openbsd.org, icepic.dz@gmail.com Date: Fri, 14 Mar 2025 12:49:51 +0900 From: Yuichiro NAITO Subject: Re: SoftLRO for ixl(4), bnxt(4) and em(4) Date: Fri, 07 Mar 2025 17:34:22 +0900 (JST) > From: Jan Klemkow > Subject: Re: SoftLRO for ixl(4), bnxt(4) and em(4) > Date: Tue, 4 Mar 2025 20:02:31 +0100 > >> On Fri, Nov 15, 2024 at 11:30:08AM GMT, Jan Klemkow wrote: >>> On Thu, Nov 07, 2024 at 11:30:26AM GMT, David Gwynne wrote: >>> > On Thu, Nov 07, 2024 at 01:10:10AM +0100, Jan Klemkow wrote: >>> > > This diff introduces a software solution for TCP Large Receive Offload >>> > > (SoftLRO) for network interfaces don't hat hardware support for it. >>> > > This is needes at least for newer Intel interfaces as their >>> > > documentation said that LRO a.k.a. Receive Side Coalescing (RSC) has to >>> > > be done by software. >>> > > This diff coalesces TCP segments during the receive interrupt before >>> > > queueing them. Thus, our TCP/IP stack has to process less packet >>> > > headers per amount of received data. >>> > > >>> > > I measured receiving performance with Intel XXV710 25 GbE interfaces. >>> > > It increased from 6 Gbit/s to 23 Gbit/s. >>> > > >>> > > Even if we saturate em(4) without any of these technique its also part >>> > > this diff. I'm interested if this diff helps to reach 1 Gbit/s on old >>> > > or slow hardware. >>> > > >>> > > I also add bnxt(4) to this diff to increase test coverage. If you want >>> > > to tests this implementation with your favorite interface, just replace >>> > > the ml_enqueue() call with the new tcp_softlro_enqueue() (as seen >>> > > below). It should work with all kind network interfaces. >>> > > >>> > > Any comments and tests reports are welcome. >>> > >>> > nice. >>> > >>> > i would argue this should be ether_softlro_enqueue and put in >>> > if_ethersubr.c because it's specific to ethernet interfaces. we don't >>> > really have any other type of interface that bundles reception of >>> > packets that we can take advantage of like this, and internally it >>> > assumes it's pulling ethernet packets apart. >>> > >>> > aside from that, just a few comments on the code. >>> >>> I adapted your comments in the diff below. >> >> I refactored the SoftLRO diff. You just need to add the flags IFXF_LRO >> / IFCAP_LRO, and repalce ml_enqueue() with tcp_softlro_enqueue() to >> enable this on you favorit network device. >> >> Janne: I adjusted your diff with correct headers. But, I'm unable to >> test this part of the diff below, due to lack of hardware. Could you >> test it again? >> >> Yuichiro: Could you also retest your UDP/TCP forwarding test? I added a >> short path for non-TCP packets in the ixl(4) driver. Maybe its better >> now. > > Thanks for your latest patch. It improves the ipgen test with 2 ixl(4) nics. > It looks very good performance shown as below. > > ``` > rfc2544 tolerable error rate: 0.0000% > rfc2544 trial duration: 10 sec > rfc2544 pps resolution: 0.0000% > rfc2544 interval: 0 sec > rfc2544 warming duration: 1 sec > > framesize|0G 1G 2G 3G 4G 5G 6G 7G 8G 9G 10Gbps > ---------+----+----+----+----+----+----+----+----+----+----+ > 64 |#### 623.96Mbps, 1218672/14880952pps, 8.19% > 128 |####### 1249.54Mbps, 1220252/ 8445945pps, 14.45% > 512 |######################### 4813.20Mbps, 1175098/ 2349624pps, 50.01% > 1024 |################################################# 9771.27Mbps, 1192782/ 1197318pps, 99.62% > 1280 |################################################## 9845.54Mbps, 961479/ 961538pps, 99.99% > 1408 |################################################## 9859.94Mbps, 875350/ 875350pps, 100.00% > 1518 |################################################## 9869.91Mbps, 812740/ 812743pps, 100.00% > > framesize|0 |1m |2m |3m |4m |5m |6m |7m |8m |9m |10m |11m |12m |13m |14m |15m pps > ---------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+ > 64 |###### 1218672/14880952pps, 8.19% > 128 |###### 1220252/ 8445945pps, 14.45% > 512 |##### 1175098/ 2349624pps, 50.01% > 1024 |##### 1192782/ 1197318pps, 99.62% > 1280 |#### 961479/ 961538pps, 99.99% > 1408 |#### 875350/ 875350pps, 100.00% > 1518 |#### 812740/ 812743pps, 100.00% > ``` > > However TCP forwarding performance looks the same as the first patch. > > ``` > # iperf3 -c 192.168.200.10 > Connecting to host 192.168.200.10, port 5201 > [ 5] local 192.168.100.10 port 19800 connected to 192.168.200.10 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 343 MBytes 2.87 Gbits/sec 75 1.04 MBytes > [ 5] 1.00-2.00 sec 600 MBytes 5.03 Gbits/sec 0 1.61 MBytes > [ 5] 2.00-3.01 sec 582 MBytes 4.85 Gbits/sec 0 1.61 MBytes > [ 5] 3.01-4.01 sec 507 MBytes 4.26 Gbits/sec 0 1.61 MBytes > [ 5] 4.01-5.01 sec 420 MBytes 3.53 Gbits/sec 0 1.61 MBytes > [ 5] 5.01-6.01 sec 361 MBytes 3.03 Gbits/sec 0 1.61 MBytes > [ 5] 6.01-7.00 sec 349 MBytes 2.94 Gbits/sec 0 1.61 MBytes > [ 5] 7.00-8.00 sec 346 MBytes 2.91 Gbits/sec 0 1.61 MBytes > [ 5] 8.00-9.00 sec 354 MBytes 2.96 Gbits/sec 0 1.61 MBytes > [ 5] 9.00-10.00 sec 345 MBytes 2.90 Gbits/sec 0 1.61 MBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 4.11 GBytes 3.53 Gbits/sec 75 sender > [ 5] 0.00-10.00 sec 4.11 GBytes 3.53 Gbits/sec receiver > > iperf Done. > # iperf3 -c 192.168.200.10 -R > Connecting to host 192.168.200.10, port 5201 > Reverse mode, remote host 192.168.200.10 is sending > [ 5] local 192.168.100.10 port 10129 connected to 192.168.200.10 port 5201 > [ ID] Interval Transfer Bitrate > [ 5] 0.00-1.01 sec 461 MBytes 3.84 Gbits/sec > [ 5] 1.01-2.00 sec 453 MBytes 3.83 Gbits/sec > [ 5] 2.00-3.00 sec 458 MBytes 3.83 Gbits/sec > [ 5] 3.00-4.00 sec 451 MBytes 3.79 Gbits/sec > [ 5] 4.00-5.00 sec 445 MBytes 3.74 Gbits/sec > [ 5] 5.00-6.06 sec 477 MBytes 3.77 Gbits/sec > [ 5] 6.06-7.00 sec 427 MBytes 3.81 Gbits/sec > [ 5] 7.00-8.00 sec 449 MBytes 3.78 Gbits/sec > [ 5] 8.00-9.04 sec 467 MBytes 3.78 Gbits/sec > [ 5] 9.04-10.00 sec 436 MBytes 3.80 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 4.42 GBytes 3.80 Gbits/sec 0 sender > [ 5] 0.00-10.00 sec 4.42 GBytes 3.80 Gbits/sec receiver > > iperf Done. > ``` > > I will look into more deeper what's happening on my OpenBSD kernel. I found that the number of ACKs returned slows down the TCP bulk transfer. I'm using FreeBSD hosts for iperf3 client and server. The FreeBSD kernel returns ACK packets for every 2 packets received. It will be a larger number of ACKs. While the OpenBSD kernel is forwarding bulk TCP packets, it receives a lot of ACK packets. The source and destination address and the port number of ACK packets are swapped, but the same values, so that the RSS hash value will be the same. The bulk TCP input interface and the ACK input interface are different, but the same queue number is interrupted by the RSS. So the same softnet taskqueue is used. This means that while transferring the bulk TCP packets, ACK packets are delayed and vice versa. The delay of ACK packets slows down the sliding window of bulk TCP stream and won't reach 10Gbps speed. One solution of this issue is reducing the number of ACK packets from the FreeBSD kernel. I set LRO option in the iperf3 server interface (I'm using X520 nics for the FreeBSD host). The LRO feature merges received TCP packets of the same stream, the number of packets received for the FreeBSD kernel will be reduced, so that the number of ACKs returned will be reduced. I've got the following transfer performance with LRO option. The OpenBSD kernel transfers a 10Gbps TCP stream. ``` # iperf3 -c 192.168.200.10 Connecting to host 192.168.200.10, port 5201 [ 5] local 192.168.100.10 port 19855 connected to 192.168.200.10 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.06 sec 381 MBytes 3.00 Gbits/sec 26 1.07 MBytes [ 5] 1.06-2.06 sec 912 MBytes 7.68 Gbits/sec 37 1.44 MBytes [ 5] 2.06-3.06 sec 1.10 GBytes 9.41 Gbits/sec 0 1.60 MBytes [ 5] 3.06-4.00 sec 1.03 GBytes 9.41 Gbits/sec 0 1.60 MBytes [ 5] 4.00-5.06 sec 1.16 GBytes 9.41 Gbits/sec 0 1.60 MBytes [ 5] 5.06-6.03 sec 1.05 GBytes 9.41 Gbits/sec 0 1.61 MBytes [ 5] 6.03-7.06 sec 1.14 GBytes 9.41 Gbits/sec 0 1.61 MBytes [ 5] 7.06-8.00 sec 1.03 GBytes 9.42 Gbits/sec 0 1.61 MBytes [ 5] 8.00-9.01 sec 1.10 GBytes 9.41 Gbits/sec 0 1.61 MBytes [ 5] 9.01-10.00 sec 1.09 GBytes 9.41 Gbits/sec 0 1.61 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 9.97 GBytes 8.56 Gbits/sec 63 sender [ 5] 0.00-10.00 sec 9.97 GBytes 8.56 Gbits/sec receiver iperf Done. ``` I captured bulk TCP packets with LRO and without LRO cases and drew TCP sequence number graphs. The 'without_lro.png' shows that TCP sequence number increment sometimes delays. It happens on receiving a lot of ACK packets. The 'with_lro.png' doesn't show such a kind of delay as the same zoom level. Conclusion: This slow TCP transfer issue doesn't relate to the SoftLRO patch. The SoftLRO patch works as intended and also works nicely to me. >> >> Further tests and comments are welcome. >> >> Thanks, >> Jan >> >> Index: arch/octeon/dev/if_cnmac.c >> =================================================================== >> RCS file: /cvs/src/sys/arch/octeon/dev/if_cnmac.c,v >> diff -u -p -r1.86 if_cnmac.c >> --- arch/octeon/dev/if_cnmac.c 20 May 2024 23:13:33 -0000 1.86 >> +++ arch/octeon/dev/if_cnmac.c 4 Mar 2025 14:53:49 -0000 >> @@ -55,6 +55,9 @@ >> #include >> #include >> #include >> +#include >> +#include >> +#include >> >> #if NBPFILTER > 0 >> #include >> @@ -306,7 +309,7 @@ cnmac_attach(struct device *parent, stru >> strncpy(ifp->if_xname, sc->sc_dev.dv_xname, sizeof(ifp->if_xname)); >> ifp->if_softc = sc; >> ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST; >> - ifp->if_xflags = IFXF_MPSAFE; >> + ifp->if_xflags = IFXF_MPSAFE | IFXF_LRO; >> ifp->if_ioctl = cnmac_ioctl; >> ifp->if_qstart = cnmac_start; >> ifp->if_watchdog = cnmac_watchdog; >> @@ -314,7 +317,7 @@ cnmac_attach(struct device *parent, stru >> ifq_init_maxlen(&ifp->if_snd, max(GATHER_QUEUE_SIZE, IFQ_MAXLEN)); >> >> ifp->if_capabilities = IFCAP_VLAN_MTU | IFCAP_CSUM_TCPv4 | >> - IFCAP_CSUM_UDPv4 | IFCAP_CSUM_TCPv6 | IFCAP_CSUM_UDPv6; >> + IFCAP_CSUM_UDPv4 | IFCAP_CSUM_TCPv6 | IFCAP_CSUM_UDPv6 | IFCAP_LRO; >> >> cn30xxgmx_set_filter(sc->sc_gmx_port); >> >> @@ -1246,7 +1249,7 @@ cnmac_recv(struct cnmac_softc *sc, uint6 >> M_TCP_CSUM_IN_OK | M_UDP_CSUM_IN_OK; >> } >> >> - ml_enqueue(ml, m); >> + tcp_softlro_enqueue(ifp, ml, m); >> >> return nmbuf; >> >> Index: dev/pci/if_bnxt.c >> =================================================================== >> RCS file: /cvs/src/sys/dev/pci/if_bnxt.c,v >> diff -u -p -r1.52 if_bnxt.c >> --- dev/pci/if_bnxt.c 6 Oct 2024 23:43:18 -0000 1.52 >> +++ dev/pci/if_bnxt.c 4 Mar 2025 14:53:48 -0000 >> @@ -646,6 +646,8 @@ bnxt_attach(struct device *parent, struc >> IFCAP_CSUM_UDPv4 | IFCAP_CSUM_TCPv4 | IFCAP_CSUM_UDPv6 | >> IFCAP_CSUM_TCPv6; >> ifp->if_capabilities |= IFCAP_TSOv4 | IFCAP_TSOv6; >> + ifp->if_xflags |= IFXF_LRO; >> + ifp->if_capabilities |= IFCAP_LRO; >> #if NVLAN > 0 >> ifp->if_capabilities |= IFCAP_VLAN_HWTAGGING; >> #endif >> @@ -2283,6 +2285,7 @@ bnxt_rx(struct bnxt_softc *sc, struct bn >> struct bnxt_cp_ring *cpr, struct mbuf_list *ml, int *slots, int *agslots, >> struct cmpl_base *cmpl) >> { >> + struct ifnet *ifp = &sc->sc_ac.ac_if; >> struct mbuf *m, *am; >> struct bnxt_slot *bs; >> struct rx_pkt_cmpl *rxlo = (struct rx_pkt_cmpl *)cmpl; >> @@ -2355,7 +2358,7 @@ bnxt_rx(struct bnxt_softc *sc, struct bn >> (*agslots)++; >> } >> >> - ml_enqueue(ml, m); >> + tcp_softlro_enqueue(ifp, ml, m); >> return (0); >> } >> >> Index: dev/pci/if_em.c >> =================================================================== >> RCS file: /cvs/src/sys/dev/pci/if_em.c,v >> diff -u -p -r1.378 if_em.c >> --- dev/pci/if_em.c 31 Aug 2024 16:23:09 -0000 1.378 >> +++ dev/pci/if_em.c 4 Mar 2025 14:53:48 -0000 >> @@ -2013,6 +2013,9 @@ em_setup_interface(struct em_softc *sc) >> ifp->if_capabilities |= IFCAP_TSOv4 | IFCAP_TSOv6; >> } >> >> + ifp->if_xflags |= IFXF_LRO; >> + ifp->if_capabilities |= IFCAP_LRO; >> + >> /* >> * Specify the media types supported by this adapter and register >> * callbacks to update media and link information >> @@ -3185,7 +3188,8 @@ em_rxeof(struct em_queue *que) >> m->m_flags |= M_VLANTAG; >> } >> #endif >> - ml_enqueue(&ml, m); >> + >> + tcp_softlro_enqueue(ifp, &ml, m); >> >> que->rx.fmp = NULL; >> que->rx.lmp = NULL; >> Index: dev/pci/if_ixl.c >> =================================================================== >> RCS file: /cvs/src/sys/dev/pci/if_ixl.c,v >> diff -u -p -r1.102 if_ixl.c >> --- dev/pci/if_ixl.c 30 Oct 2024 18:02:45 -0000 1.102 >> +++ dev/pci/if_ixl.c 4 Mar 2025 14:53:49 -0000 >> @@ -883,6 +883,8 @@ struct ixl_rx_wb_desc_16 { >> >> #define IXL_RX_DESC_PTYPE_SHIFT 30 >> #define IXL_RX_DESC_PTYPE_MASK (0xffULL << IXL_RX_DESC_PTYPE_SHIFT) >> +#define IXL_RX_DESC_PTYPE_MAC_IPV4_TCP 26 >> +#define IXL_RX_DESC_PTYPE_MAC_IPV6_TCP 92 >> >> #define IXL_RX_DESC_PLEN_SHIFT 38 >> #define IXL_RX_DESC_PLEN_MASK (0x3fffULL << IXL_RX_DESC_PLEN_SHIFT) >> @@ -1976,6 +1978,9 @@ ixl_attach(struct device *parent, struct >> IFCAP_CSUM_TCPv6 | IFCAP_CSUM_UDPv6; >> ifp->if_capabilities |= IFCAP_TSOv4 | IFCAP_TSOv6; >> >> + ifp->if_xflags |= IFXF_LRO; >> + ifp->if_capabilities |= IFCAP_LRO; >> + >> ifmedia_init(&sc->sc_media, 0, ixl_media_change, ixl_media_status); >> >> ixl_media_add(sc, phy_types); >> @@ -3255,9 +3260,11 @@ ixl_rxeof(struct ixl_softc *sc, struct i >> struct ixl_rx_map *rxm; >> bus_dmamap_t map; >> unsigned int cons, prod; >> - struct mbuf_list ml = MBUF_LIST_INITIALIZER(); >> + struct mbuf_list mlt = MBUF_LIST_INITIALIZER(); >> + struct mbuf_list mlo = MBUF_LIST_INITIALIZER(); >> struct mbuf *m; >> uint64_t word; >> + unsigned int ptype; >> unsigned int len; >> unsigned int mask; >> int done = 0; >> @@ -3294,6 +3301,8 @@ ixl_rxeof(struct ixl_softc *sc, struct i >> m = rxm->rxm_m; >> rxm->rxm_m = NULL; >> >> + ptype = (word & IXL_RX_DESC_PTYPE_MASK) >> + >> IXL_RX_DESC_PTYPE_SHIFT; >> len = (word & IXL_RX_DESC_PLEN_MASK) >> IXL_RX_DESC_PLEN_SHIFT; >> m->m_len = len; >> m->m_pkthdr.len = 0; >> @@ -3324,7 +3333,12 @@ ixl_rxeof(struct ixl_softc *sc, struct i >> #endif >> >> ixl_rx_checksum(m, word); >> - ml_enqueue(&ml, m); >> + >> + if (ptype == IXL_RX_DESC_PTYPE_MAC_IPV4_TCP || >> + ptype == IXL_RX_DESC_PTYPE_MAC_IPV6_TCP) >> + tcp_softlro_enqueue(ifp, &mlt, m); >> + else >> + ml_enqueue(&mlo, m); >> } else { >> ifp->if_ierrors++; /* XXX */ >> m_freem(m); >> @@ -3342,7 +3356,7 @@ ixl_rxeof(struct ixl_softc *sc, struct i >> >> if (done) { >> rxr->rxr_cons = cons; >> - if (ifiq_input(ifiq, &ml)) >> + if (ifiq_input(ifiq, &mlt) || ifiq_input(ifiq, &mlo)) >> if_rxr_livelocked(&rxr->rxr_acct); >> ixl_rxfill(sc, rxr); >> } >> Index: net/if.c >> =================================================================== >> RCS file: /cvs/src/sys/net/if.c,v >> diff -u -p -r1.728 if.c >> --- net/if.c 2 Mar 2025 21:28:31 -0000 1.728 >> +++ net/if.c 4 Mar 2025 14:53:49 -0000 >> @@ -3388,6 +3388,7 @@ ifsetlro(struct ifnet *ifp, int on) >> } else if (!on && ISSET(ifp->if_xflags, IFXF_LRO)) >> CLR(ifp->if_xflags, IFXF_LRO); >> >> + error = 0; >> out: >> splx(s); >> >> Index: netinet/tcp_input.c >> =================================================================== >> RCS file: /cvs/src/sys/netinet/tcp_input.c,v >> diff -u -p -r1.433 tcp_input.c >> --- netinet/tcp_input.c 2 Mar 2025 21:28:32 -0000 1.433 >> +++ netinet/tcp_input.c 4 Mar 2025 15:07:04 -0000 >> @@ -84,6 +84,7 @@ >> #include >> #include >> >> +#include >> #include >> #include >> #include >> @@ -4229,4 +4230,233 @@ syn_cache_respond(struct syn_cache *sc, >> } >> in_pcbunref(inp); >> return (error); >> +} >> + >> +int >> +tcp_softlro(struct mbuf *mhead, struct mbuf *mtail) >> +{ >> + struct ether_extracted head; >> + struct ether_extracted tail; >> + struct mbuf *m; >> + unsigned int hdrlen; >> + >> + /* >> + * Check if head and tail are mergeable >> + */ >> + >> + ether_extract_headers(mhead, &head); >> + ether_extract_headers(mtail, &tail); >> + >> + /* Don't merge packets of different VLANs */ >> + if (head.evh && tail.evh) { >> + if (head.evh->evl_tag != tail.evh->evl_tag) >> + return 0; >> + } else if (head.evh || tail.evh) >> + return 0; >> + >> + /* Check IP header. */ >> + if (head.ip4 && tail.ip4) { >> + /* Don't merge packets with invalid header checksum. */ >> + if (!ISSET(mhead->m_pkthdr.csum_flags, M_IPV4_CSUM_IN_OK) || >> + !ISSET(mtail->m_pkthdr.csum_flags, M_IPV4_CSUM_IN_OK)) >> + return 0; >> + >> + /* Check IPv4 addresses. */ >> + if (head.ip4->ip_src.s_addr != tail.ip4->ip_src.s_addr || >> + head.ip4->ip_dst.s_addr != tail.ip4->ip_dst.s_addr) >> + return 0; >> + >> + /* Don't merge IPv4 fragments. */ >> + if (ISSET(head.ip4->ip_off, htons(IP_OFFMASK | IP_MF)) || >> + ISSET(tail.ip4->ip_off, htons(IP_OFFMASK | IP_MF))) >> + return 0; >> + >> + /* Check max. IPv4 length. */ >> + if (head.iplen + tail.iplen > IP_MAXPACKET) >> + return 0; >> + >> + /* Don't merge IPv4 packets with option headers. */ >> + if (head.iphlen != sizeof(struct ip) || >> + tail.iphlen != sizeof(struct ip)) >> + return 0; >> + >> + /* Don't non-TCP packets. */ >> + if (head.ip4->ip_p != IPPROTO_TCP || >> + tail.ip4->ip_p != IPPROTO_TCP) >> + return 0; >> + } else if (head.ip6 && tail.ip6) { >> + /* Check IPv6 addresses. */ >> + if (!IN6_ARE_ADDR_EQUAL(&head.ip6->ip6_src, &tail.ip6->ip6_src) || >> + !IN6_ARE_ADDR_EQUAL(&head.ip6->ip6_dst, &tail.ip6->ip6_dst)) >> + return 0; >> + >> + /* Check max. IPv6 length. */ >> + if ((head.iplen - head.iphlen) + >> + (tail.iplen - tail.iphlen) > IPV6_MAXPACKET) >> + return 0; >> + >> + /* Don't merge IPv6 packets with option headers nor non-TCP. */ >> + if (head.ip6->ip6_nxt != IPPROTO_TCP || >> + tail.ip6->ip6_nxt != IPPROTO_TCP) >> + return 0; >> + } else { >> + return 0; >> + } >> + >> + /* Check TCP header. */ >> + if (!head.tcp || !tail.tcp) >> + return 0; >> + >> + /* Check TCP ports. */ >> + if (head.tcp->th_sport != tail.tcp->th_sport || >> + head.tcp->th_dport != tail.tcp->th_dport) >> + return 0; >> + >> + /* Don't merge empty segments. */ >> + if (head.paylen == 0 || tail.paylen == 0) >> + return 0; >> + >> + /* Check for continues segments. */ >> + if (ntohl(head.tcp->th_seq) + head.paylen != ntohl(tail.tcp->th_seq)) >> + return 0; >> + >> + /* Just ACK and PUSH TCP flags are allowed. */ >> + if (ISSET(head.tcp->th_flags, ~(TH_ACK|TH_PUSH)) || >> + ISSET(tail.tcp->th_flags, ~(TH_ACK|TH_PUSH))) >> + return 0; >> + >> + /* TCP ACK flag has to be set. */ >> + if (!ISSET(head.tcp->th_flags, TH_ACK) || >> + !ISSET(tail.tcp->th_flags, TH_ACK)) >> + return 0; >> + >> + /* Ignore segments with different TCP options. */ >> + if (head.tcphlen - sizeof(struct tcphdr) != >> + tail.tcphlen - sizeof(struct tcphdr)) >> + return 0; >> + >> + /* Check for TCP options */ >> + if (head.tcphlen > sizeof(struct tcphdr)) { >> + char *hopt = (char *)(head.tcp) + sizeof(struct tcphdr); >> + char *topt = (char *)(tail.tcp) + sizeof(struct tcphdr); >> + int optsize = head.tcphlen - sizeof(struct tcphdr); >> + int optlen; >> + >> + for (; optsize > 0; optsize -= optlen) { >> + /* Ignore segments with different TCP options. */ >> + if (hopt[0] != topt[0] || hopt[1] != topt[1]) >> + return 0; >> + >> + /* Get option length */ >> + optlen = hopt[1]; >> + if (hopt[0] == TCPOPT_NOP) >> + optlen = 1; >> + else if (optlen < 2 || optlen > optsize) >> + return 0; /* Illegal length */ >> + >> + if (hopt[0] != TCPOPT_NOP && >> + hopt[0] != TCPOPT_TIMESTAMP) >> + return 0; /* Unsupported TCP option */ >> + >> + hopt += optlen; >> + topt += optlen; >> + } >> + } >> + >> + /* >> + * Prepare concatenation of head and tail. >> + */ >> + >> + /* Adjust IP header. */ >> + if (head.ip4) { >> + head.ip4->ip_len = htons(head.iplen + tail.paylen); >> + } else if (head.ip6) { >> + head.ip6->ip6_plen = >> + htons(head.iplen - head.iphlen + tail.paylen); >> + } >> + >> + /* Combine TCP flags from head and tail. */ >> + if (ISSET(tail.tcp->th_flags, TH_PUSH)) >> + SET(head.tcp->th_flags, TH_PUSH); >> + >> + /* Adjust TCP header. */ >> + head.tcp->th_win = tail.tcp->th_win; >> + head.tcp->th_ack = tail.tcp->th_ack; >> + >> + /* Calculate header length of tail packet. */ >> + hdrlen = sizeof(*tail.eh); >> + if (tail.evh) >> + hdrlen = sizeof(*tail.evh); >> + hdrlen += tail.iphlen; >> + hdrlen += tail.tcphlen; >> + >> + /* Skip protocol headers in tail. */ >> + m_adj(mtail, hdrlen); >> + CLR(mtail->m_flags, M_PKTHDR); >> + >> + /* Concatenate */ >> + for (m = mhead; m->m_next;) >> + m = m->m_next; >> + m->m_next = mtail; >> + mhead->m_pkthdr.len += tail.paylen; >> + >> + /* Flag mbuf as TSO packet with MSS. */ >> + if (!ISSET(mhead->m_pkthdr.csum_flags, M_TCP_TSO)) { >> + /* Set CSUM_OUT flags in case of forwarding. */ >> + SET(mhead->m_pkthdr.csum_flags, M_TCP_CSUM_OUT); >> + head.tcp->th_sum = 0; >> + if (head.ip4) { >> + SET(mhead->m_pkthdr.csum_flags, M_IPV4_CSUM_OUT); >> + head.ip4->ip_sum = 0; >> + } >> + >> + SET(mhead->m_pkthdr.csum_flags, M_TCP_TSO); >> + mhead->m_pkthdr.ph_mss = head.paylen; >> + tcpstat_inc(tcps_inhwlro); >> + tcpstat_inc(tcps_inpktlro); /* count head */ >> + } >> + mhead->m_pkthdr.ph_mss = MAX(mhead->m_pkthdr.ph_mss, tail.paylen); >> + tcpstat_inc(tcps_inpktlro); /* count tail */ >> + >> + return 1; >> +} >> + >> +void >> +tcp_softlro_enqueue(struct ifnet *ifp, struct mbuf_list *ml, struct mbuf *mtail) >> +{ >> + struct mbuf *mhead; >> + >> + if (!ISSET(ifp->if_xflags, IFXF_LRO)) >> + goto out; >> + >> + /* Don't merge packets with invalid header checksum. */ >> + if (!ISSET(mtail->m_pkthdr.csum_flags, M_TCP_CSUM_IN_OK)) >> + goto out; >> + >> + for (mhead = ml->ml_head; mhead != NULL; mhead = mhead->m_nextpkt) { >> + /* Don't merge packets with invalid header checksum. */ >> + if (!ISSET(mhead->m_pkthdr.csum_flags, M_TCP_CSUM_IN_OK)) >> + continue; >> + >> + /* Use RSS hash to skip packets of different connections. */ >> + if (ISSET(mhead->m_pkthdr.csum_flags, M_FLOWID) && >> + ISSET(mtail->m_pkthdr.csum_flags, M_FLOWID) && >> + mhead->m_pkthdr.ph_flowid != mtail->m_pkthdr.ph_flowid) >> + continue; >> + >> + /* Don't merge packets of different VLANs */ >> + if (ISSET(mhead->m_flags, M_VLANTAG) != >> + ISSET(mtail->m_flags, M_VLANTAG)) >> + continue; >> + >> + if (ISSET(mhead->m_flags, M_VLANTAG) && >> + EVL_VLANOFTAG(mhead->m_pkthdr.ether_vtag) != >> + EVL_VLANOFTAG(mtail->m_pkthdr.ether_vtag)) >> + continue; >> + >> + if (tcp_softlro(mhead, mtail)) >> + return; >> + } >> + out: >> + ml_enqueue(ml, mtail); >> } >> Index: netinet/tcp_var.h >> =================================================================== >> RCS file: /cvs/src/sys/netinet/tcp_var.h,v >> diff -u -p -r1.186 tcp_var.h >> --- netinet/tcp_var.h 2 Mar 2025 21:28:32 -0000 1.186 >> +++ netinet/tcp_var.h 4 Mar 2025 14:53:48 -0000 >> @@ -720,6 +720,7 @@ void tcp_init(void); >> int tcp_input(struct mbuf **, int *, int, int, struct netstack *); >> int tcp_mss(struct tcpcb *, int); >> void tcp_mss_update(struct tcpcb *); >> +void tcp_softlro_enqueue(struct ifnet *, struct mbuf_list *, struct mbuf *); >> u_int tcp_hdrsz(struct tcpcb *); >> void tcp_mtudisc(struct inpcb *, int); >> void tcp_mtudisc_increase(struct inpcb *, int); > > -- > Yuichiro NAITO (naito.yuichiro@gmail.com) -- Yuichiro NAITO (naito.yuichiro@gmail.com)