From: Marcus Glocker Subject: Re: TSO with small packets To: Alexander Bluhm , Claudio Jeker Cc: tech@openbsd.org Date: Sat, 10 Feb 2024 21:15:37 +0100 On Fri, Feb 09, 2024 at 02:30:21PM +0100, Alexander Bluhm wrote: > On Fri, Feb 09, 2024 at 01:50:53PM +0100, Claudio Jeker wrote: > > On Fri, Feb 09, 2024 at 01:37:45PM +0100, Alexander Bluhm wrote: > > > On Fri, Feb 09, 2024 at 12:11:52PM +0100, Claudio Jeker wrote: > > > > On Fri, Feb 09, 2024 at 11:37:32AM +0100, Alexander Bluhm wrote: > > > > > Hi, > > > > > > > > > > After a lot of debugging and testing with Hrvoje and mglocker@, we > > > > > have found the cause for occasional watchdog timeouts with em TSO > > > > > diff. > > > > > > > > > > In this setup TCP packets are routed from ix(4) with LRO to em(4) > > > > > with TSO. It happens that ix hardware coalesces packets that are > > > > > in total smaller than MTU. In real live this is rare as either you > > > > > have large TCP packets in a row, or small TCP packets from time to > > > > > time. > > > > > > > > > > Small packets that went though LRO have the TSO bit set. But TSO > > > > > is not necessary as only few small TCP packets have been reassembled. > > > > > This confuses em hardware. So clear TSO flag during interface > > > > > output path in that case. > > > > > > > > > > This is only relevant when forwarding to hardware TSO. Our stack > > > > > does not generate such packets and can handle them. Diff below > > > > > prevents them to reach hardware TSO. > > > > > > > > > > ok? > > > > > > > > But shouldn't the system send out those small packets split up again? > > > > At least I think the idea is that the effect of LRO is reversed while > > > > sending out so that the same amount of packets are sent as without LRO/TSO. > > > > > > The small packets in our case have 2 bytes payload after LRO. We > > > instruct em to split them in 1 byte packets and that fails with > > > watchdog timeouts. It works if I clear the TSO bit. > > > > > > But you are right, my diff may break path MTU discovery. So > > > m_pkthdr.len <= mtu is not ideal. Is there a minmum TSO size > > > hardware can handle? We might use that instead of MTU. > > > > Probably the minumum IPv6 packet size (576 IIRC) would be a good minimum. > > RFC 791: All hosts must be prepared to accept datagrams of up to > 576 octets (whether they arrive whole or in fragments). > RFC 791: Every internet module must be able to forward a datagram > of 68 octets without further fragmentation. > RFC 1280: IPv6 requires that every link in the internet have an MTU > of 1280 octets or greater. > > I think 576 is a sane choice. We don't care about fragments when > doing path MTU. The world is not IPv6. > > Hopefully all hardware can handle TSO to split packets of this size. > We see the problems in em with around 60 byte size, depending on > TCP header length. > > The diff should fix the packets that Mglocker has seen in Hrvoje's > setup. For packets between 600 and 1400 bytes we don't know what > hardware will do, but it cannot be worse than the current code. > > ok? > > bluhm > > Index: netinet/tcp_output.c > =================================================================== > RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/tcp_output.c,v > diff -u -p -r1.141 tcp_output.c > --- netinet/tcp_output.c 26 Nov 2023 22:08:10 -0000 1.141 > +++ netinet/tcp_output.c 9 Feb 2024 13:11:58 -0000 > @@ -1360,6 +1360,12 @@ tcp_if_output_tso(struct ifnet *ifp, str > /* caller must fail later or fragment */ > if (!ISSET((*mp)->m_pkthdr.csum_flags, M_TCP_TSO)) > return 0; > + /* send without hardware TSO if within minimum IPv4 packet size */ > + if ((*mp)->m_pkthdr.len <= 576) { > + CLR((*mp)->m_pkthdr.csum_flags, M_TCP_TSO); > + return 0; > + } > + /* fragment or fail if interface cannot handle size after chopping */ > if ((*mp)->m_pkthdr.ph_mss > mtu) { > CLR((*mp)->m_pkthdr.csum_flags, M_TCP_TSO); > return 0; > The watchdog timeouts seen with em(4) tso enabled [1] do only happen when em_tso_setup() receives an empty tcp frame, so the payload is zero. I can understand that the hardware fails when we ask to tso zero bytes of payload -- This shouldn't happen. Those empty tcp frames are only arriving in em_tso_setup() when ix(4) forwards packets. I suspect the issue somewhere in http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/dev/pci/if_ix.c.diff?r1=1.196&r2=1.197 But I don't understand this LSO code good enough to spot what might go wrong there. I can see that when this happens, ix(4) forwards a packet with 60 bytes (vlan header + 2 bytes payload?) m_pkthdr.len, and 1 byte m_pkthdr.ph_mss. Why this results in em_tso_setup() receiving an packet with 54 bytes (no payload), I don't understand. The updated diff skips empty tcp frames em_tso_setup(). With that the watchdog timeouts reported by Hrvoje are gone. But obviously it's not the solution for the issue in ix(4). [1] We are only able to trigger those watchdog timeouts with an setup from Hrvoje which involves, vlan(4) + ix(4), and generate load with TRex traffic generator. Index: sys/dev/pci/if_em.c =================================================================== RCS file: /cvs/src/sys/dev/pci/if_em.c,v diff -u -p -u -p -r1.371 if_em.c --- sys/dev/pci/if_em.c 28 Jan 2024 18:42:58 -0000 1.371 +++ sys/dev/pci/if_em.c 5 Feb 2024 21:24:04 -0000 @@ -291,6 +291,8 @@ void em_receive_checksum(struct em_softc struct mbuf *); u_int em_transmit_checksum_setup(struct em_queue *, struct mbuf *, u_int, u_int32_t *, u_int32_t *); +u_int em_tso_setup(struct em_queue *, struct mbuf *, u_int, u_int32_t *, + u_int32_t *); u_int em_tx_ctx_setup(struct em_queue *, struct mbuf *, u_int, u_int32_t *, u_int32_t *); void em_iff(struct em_softc *); @@ -1188,7 +1190,7 @@ em_flowstatus(struct em_softc *sc) * * This routine maps the mbufs to tx descriptors. * - * return 0 on success, positive on failure + * return 0 on failure, positive on success **********************************************************************/ u_int em_encap(struct em_queue *que, struct mbuf *m) @@ -1236,7 +1238,15 @@ em_encap(struct em_queue *que, struct mb } if (sc->hw.mac_type >= em_82575 && sc->hw.mac_type <= em_i210) { - used += em_tx_ctx_setup(que, m, head, &txd_upper, &txd_lower); + if (ISSET(m->m_pkthdr.csum_flags, M_TCP_TSO)) { + used += em_tso_setup(que, m, head, &txd_upper, + &txd_lower); + if (!used) + return (used); + } else { + used += em_tx_ctx_setup(que, m, head, &txd_upper, + &txd_lower); + } } else if (sc->hw.mac_type >= em_82543) { used += em_transmit_checksum_setup(que, m, head, &txd_upper, &txd_lower); @@ -1569,6 +1579,21 @@ em_update_link_status(struct em_softc *s ifp->if_link_state = link_state; if_link_state_change(ifp); } + + /* Disable TSO for 10/100 speeds to avoid some hardware issues */ + switch (sc->link_speed) { + case SPEED_10: + case SPEED_100: + if (sc->hw.mac_type >= em_82575 && sc->hw.mac_type <= em_i210) { + ifp->if_capabilities &= ~IFCAP_TSOv4; + ifp->if_capabilities &= ~IFCAP_TSOv6; + } + break; + case SPEED_1000: + if (sc->hw.mac_type >= em_82575 && sc->hw.mac_type <= em_i210) + ifp->if_capabilities |= IFCAP_TSOv4 | IFCAP_TSOv6; + break; + } } /********************************************************************* @@ -1988,6 +2013,7 @@ em_setup_interface(struct em_softc *sc) if (sc->hw.mac_type >= em_82575 && sc->hw.mac_type <= em_i210) { ifp->if_capabilities |= IFCAP_CSUM_IPv4; ifp->if_capabilities |= IFCAP_CSUM_TCPv6 | IFCAP_CSUM_UDPv6; + ifp->if_capabilities |= IFCAP_TSOv4 | IFCAP_TSOv6; } /* @@ -2231,9 +2257,9 @@ em_setup_transmit_structures(struct em_s for (i = 0; i < sc->sc_tx_slots; i++) { pkt = &que->tx.sc_tx_pkts_ring[i]; - error = bus_dmamap_create(sc->sc_dmat, MAX_JUMBO_FRAME_SIZE, + error = bus_dmamap_create(sc->sc_dmat, EM_TSO_SIZE, EM_MAX_SCATTER / (sc->pcix_82544 ? 2 : 1), - MAX_JUMBO_FRAME_SIZE, 0, BUS_DMA_NOWAIT, &pkt->pkt_map); + EM_TSO_SEG_SIZE, 0, BUS_DMA_NOWAIT, &pkt->pkt_map); if (error != 0) { printf("%s: Unable to create TX DMA map\n", DEVNAME(sc)); @@ -2403,6 +2429,85 @@ em_free_transmit_structures(struct em_so 0, que->tx.sc_tx_dma.dma_map->dm_mapsize, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); } +} + +u_int +em_tso_setup(struct em_queue *que, struct mbuf *mp, u_int head, + u_int32_t *olinfo_status, u_int32_t *cmd_type_len) +{ + struct ether_extracted ext; + struct e1000_adv_tx_context_desc *TD; + uint32_t vlan_macip_lens = 0, type_tucmd_mlhl = 0, mss_l4len_idx = 0; + uint32_t paylen = 0; + uint8_t iphlen = 0; + + *olinfo_status = 0; + *cmd_type_len = 0; + TD = (struct e1000_adv_tx_context_desc *)&que->tx.sc_tx_desc_ring[head]; + +#if NVLAN > 0 + if (ISSET(mp->m_flags, M_VLANTAG)) { + uint32_t vtag = mp->m_pkthdr.ether_vtag; + vlan_macip_lens |= vtag << E1000_ADVTXD_VLAN_SHIFT; + *cmd_type_len |= E1000_ADVTXD_DCMD_VLE; + } +#endif + + ether_extract_headers(mp, &ext); + if (ext.tcp == NULL) + goto out; + + vlan_macip_lens |= (sizeof(*ext.eh) << E1000_ADVTXD_MACLEN_SHIFT); + + if (ext.ip4) { + iphlen = ext.ip4->ip_hl << 2; + + type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_IPV4; + *olinfo_status |= E1000_TXD_POPTS_IXSM << 8; +#ifdef INET6 + } else if (ext.ip6) { + iphlen = sizeof(*ext.ip6); + + type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_IPV6; +#endif + } else { + goto out; + } + + *cmd_type_len |= E1000_ADVTXD_DTYP_DATA | E1000_ADVTXD_DCMD_IFCS; + *cmd_type_len |= E1000_ADVTXD_DCMD_DEXT | E1000_ADVTXD_DCMD_TSE; + paylen = mp->m_pkthdr.len - sizeof(*ext.eh) - iphlen - + (ext.tcp->th_off << 2); + if (paylen == 0) { + /* Skip TCP frame with no payload to avoid watchdog timeout */ + return 0; + } + *olinfo_status |= paylen << E1000_ADVTXD_PAYLEN_SHIFT; + vlan_macip_lens |= iphlen; + type_tucmd_mlhl |= E1000_ADVTXD_DCMD_DEXT | E1000_ADVTXD_DTYP_CTXT; + + type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_L4T_TCP; + *olinfo_status |= E1000_TXD_POPTS_TXSM << 8; + + mss_l4len_idx |= mp->m_pkthdr.ph_mss << E1000_ADVTXD_MSS_SHIFT; + mss_l4len_idx |= (ext.tcp->th_off << 2) << E1000_ADVTXD_L4LEN_SHIFT; + /* 82575 needs the queue index added */ + if (que->sc->hw.mac_type == em_82575) + mss_l4len_idx |= (que->me & 0xff) << 4; + + htolem32(&TD->vlan_macip_lens, vlan_macip_lens); + htolem32(&TD->type_tucmd_mlhl, type_tucmd_mlhl); + htolem32(&TD->u.seqnum_seed, 0); + htolem32(&TD->mss_l4len_idx, mss_l4len_idx); + + tcpstat_add(tcps_outpkttso, (paylen + mp->m_pkthdr.ph_mss - 1) / + mp->m_pkthdr.ph_mss); + + return 1; + +out: + tcpstat_inc(tcps_outbadtso); + return 0; } u_int Index: sys/dev/pci/if_em.h =================================================================== RCS file: /cvs/src/sys/dev/pci/if_em.h,v diff -u -p -u -p -r1.82 if_em.h --- sys/dev/pci/if_em.h 28 Jan 2024 18:42:58 -0000 1.82 +++ sys/dev/pci/if_em.h 5 Feb 2024 21:24:04 -0000 @@ -55,11 +55,14 @@ POSSIBILITY OF SUCH DAMAGE. #include #include +#include #include #include #include #include +#include +#include #include #if NBPFILTER > 0 @@ -269,6 +272,7 @@ typedef int boolean_t; #define EM_MAX_SCATTER 64 #define EM_TSO_SIZE 65535 +#define EM_TSO_SEG_SIZE 4096 /* Max dma segment size */ struct em_packet { int pkt_eop; /* Index of the desc to watch */ Index: sys/dev/pci/if_em_hw.h =================================================================== RCS file: /cvs/src/sys/dev/pci/if_em_hw.h,v diff -u -p -u -p -r1.92 if_em_hw.h --- sys/dev/pci/if_em_hw.h 28 Jan 2024 18:42:58 -0000 1.92 +++ sys/dev/pci/if_em_hw.h 5 Feb 2024 21:24:04 -0000 @@ -2150,6 +2150,7 @@ struct e1000_adv_tx_context_desc { #define E1000_ADVTXD_DCMD_IFCS 0x02000000 /* Insert FCS (Ethernet CRC) */ #define E1000_ADVTXD_DCMD_DEXT 0x20000000 /* Descriptor extension (1=Adv) */ #define E1000_ADVTXD_DCMD_VLE 0x40000000 /* VLAN pkt enable */ +#define E1000_ADVTXD_DCMD_TSE 0x80000000 /* TCP Seg enable */ #define E1000_ADVTXD_PAYLEN_SHIFT 14 /* Adv desc PAYLEN shift */ /* Adv Transmit Descriptor Config Masks */ @@ -2159,6 +2160,10 @@ struct e1000_adv_tx_context_desc { #define E1000_ADVTXD_TUCMD_IPV6 0x00000000 /* IP Packet Type: 0=IPv6 */ #define E1000_ADVTXD_TUCMD_L4T_UDP 0x00000000 /* L4 Packet TYPE of UDP */ #define E1000_ADVTXD_TUCMD_L4T_TCP 0x00000800 /* L4 Packet TYPE of TCP */ + +/* Req requires Markers and CRC */ +#define E1000_ADVTXD_L4LEN_SHIFT 8 /* Adv ctxt L4LEN shift */ +#define E1000_ADVTXD_MSS_SHIFT 16 /* Adv ctxt MSS shift */ /* Multiple Receive Queue Control */ #define E1000_MRQC_ENABLE_MASK 0x00000003