From: Claudio Jeker Subject: Re: TSO with small packets To: Marcus Glocker Cc: Alexander Bluhm , tech@openbsd.org Date: Sat, 10 Feb 2024 22:22:03 +0100 On Sat, Feb 10, 2024 at 09:15:37PM +0100, Marcus Glocker wrote: > On Fri, Feb 09, 2024 at 02:30:21PM +0100, Alexander Bluhm wrote: > > > On Fri, Feb 09, 2024 at 01:50:53PM +0100, Claudio Jeker wrote: > > > On Fri, Feb 09, 2024 at 01:37:45PM +0100, Alexander Bluhm wrote: > > > > On Fri, Feb 09, 2024 at 12:11:52PM +0100, Claudio Jeker wrote: > > > > > On Fri, Feb 09, 2024 at 11:37:32AM +0100, Alexander Bluhm wrote: > > > > > > Hi, > > > > > > > > > > > > After a lot of debugging and testing with Hrvoje and mglocker@, we > > > > > > have found the cause for occasional watchdog timeouts with em TSO > > > > > > diff. > > > > > > > > > > > > In this setup TCP packets are routed from ix(4) with LRO to em(4) > > > > > > with TSO. It happens that ix hardware coalesces packets that are > > > > > > in total smaller than MTU. In real live this is rare as either you > > > > > > have large TCP packets in a row, or small TCP packets from time to > > > > > > time. > > > > > > > > > > > > Small packets that went though LRO have the TSO bit set. But TSO > > > > > > is not necessary as only few small TCP packets have been reassembled. > > > > > > This confuses em hardware. So clear TSO flag during interface > > > > > > output path in that case. > > > > > > > > > > > > This is only relevant when forwarding to hardware TSO. Our stack > > > > > > does not generate such packets and can handle them. Diff below > > > > > > prevents them to reach hardware TSO. > > > > > > > > > > > > ok? > > > > > > > > > > But shouldn't the system send out those small packets split up again? > > > > > At least I think the idea is that the effect of LRO is reversed while > > > > > sending out so that the same amount of packets are sent as without LRO/TSO. > > > > > > > > The small packets in our case have 2 bytes payload after LRO. We > > > > instruct em to split them in 1 byte packets and that fails with > > > > watchdog timeouts. It works if I clear the TSO bit. > > > > > > > > But you are right, my diff may break path MTU discovery. So > > > > m_pkthdr.len <= mtu is not ideal. Is there a minmum TSO size > > > > hardware can handle? We might use that instead of MTU. > > > > > > Probably the minumum IPv6 packet size (576 IIRC) would be a good minimum. > > > > RFC 791: All hosts must be prepared to accept datagrams of up to > > 576 octets (whether they arrive whole or in fragments). > > RFC 791: Every internet module must be able to forward a datagram > > of 68 octets without further fragmentation. > > RFC 1280: IPv6 requires that every link in the internet have an MTU > > of 1280 octets or greater. > > > > I think 576 is a sane choice. We don't care about fragments when > > doing path MTU. The world is not IPv6. > > > > Hopefully all hardware can handle TSO to split packets of this size. > > We see the problems in em with around 60 byte size, depending on > > TCP header length. > > > > The diff should fix the packets that Mglocker has seen in Hrvoje's > > setup. For packets between 600 and 1400 bytes we don't know what > > hardware will do, but it cannot be worse than the current code. > > > > ok? > > > > bluhm > > > > Index: netinet/tcp_output.c > > =================================================================== > > RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/tcp_output.c,v > > diff -u -p -r1.141 tcp_output.c > > --- netinet/tcp_output.c 26 Nov 2023 22:08:10 -0000 1.141 > > +++ netinet/tcp_output.c 9 Feb 2024 13:11:58 -0000 > > @@ -1360,6 +1360,12 @@ tcp_if_output_tso(struct ifnet *ifp, str > > /* caller must fail later or fragment */ > > if (!ISSET((*mp)->m_pkthdr.csum_flags, M_TCP_TSO)) > > return 0; > > + /* send without hardware TSO if within minimum IPv4 packet size */ > > + if ((*mp)->m_pkthdr.len <= 576) { > > + CLR((*mp)->m_pkthdr.csum_flags, M_TCP_TSO); > > + return 0; > > + } > > + /* fragment or fail if interface cannot handle size after chopping */ > > if ((*mp)->m_pkthdr.ph_mss > mtu) { > > CLR((*mp)->m_pkthdr.csum_flags, M_TCP_TSO); > > return 0; > > > > The watchdog timeouts seen with em(4) tso enabled [1] do only happen when > em_tso_setup() receives an empty tcp frame, so the payload is zero. > I can understand that the hardware fails when we ask to tso zero bytes > of payload -- This shouldn't happen. > > Those empty tcp frames are only arriving in em_tso_setup() when ix(4) > forwards packets. I suspect the issue somewhere in > > http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/dev/pci/if_ix.c.diff?r1=1.196&r2=1.197 > > But I don't understand this LSO code good enough to spot what might go > wrong there. I can see that when this happens, ix(4) forwards a packet > with 60 bytes (vlan header + 2 bytes payload?) m_pkthdr.len, and 1 byte > m_pkthdr.ph_mss. Why this results in em_tso_setup() receiving an > packet with 54 bytes (no payload), I don't understand. This is because the minimal Ethernet packet is 64 bytes but since vlan header is used it is reduced to 60. There is a trailer in this packet since the payload (IP len) is less than 60 / 64 bytes. 54 bytes is probably 12 (ETH HDR) + 20 (IP) + 20 (TCP) + 2 Payload > The updated diff skips empty tcp frames em_tso_setup(). With that the > watchdog timeouts reported by Hrvoje are gone. But obviously it's not > the solution for the issue in ix(4). Yes, this smells like a ix(4) LRO bug that may need a software workaround in ix(4) itself. > [1] We are only able to trigger those watchdog timeouts with an setup > from Hrvoje which involves, vlan(4) + ix(4), and generate load with > TRex traffic generator. > > > Index: sys/dev/pci/if_em.c > =================================================================== > RCS file: /cvs/src/sys/dev/pci/if_em.c,v > diff -u -p -u -p -r1.371 if_em.c > --- sys/dev/pci/if_em.c 28 Jan 2024 18:42:58 -0000 1.371 > +++ sys/dev/pci/if_em.c 5 Feb 2024 21:24:04 -0000 > @@ -291,6 +291,8 @@ void em_receive_checksum(struct em_softc > struct mbuf *); > u_int em_transmit_checksum_setup(struct em_queue *, struct mbuf *, u_int, > u_int32_t *, u_int32_t *); > +u_int em_tso_setup(struct em_queue *, struct mbuf *, u_int, u_int32_t *, > + u_int32_t *); > u_int em_tx_ctx_setup(struct em_queue *, struct mbuf *, u_int, u_int32_t *, > u_int32_t *); > void em_iff(struct em_softc *); > @@ -1188,7 +1190,7 @@ em_flowstatus(struct em_softc *sc) > * > * This routine maps the mbufs to tx descriptors. > * > - * return 0 on success, positive on failure > + * return 0 on failure, positive on success > **********************************************************************/ > u_int > em_encap(struct em_queue *que, struct mbuf *m) > @@ -1236,7 +1238,15 @@ em_encap(struct em_queue *que, struct mb > } > > if (sc->hw.mac_type >= em_82575 && sc->hw.mac_type <= em_i210) { > - used += em_tx_ctx_setup(que, m, head, &txd_upper, &txd_lower); > + if (ISSET(m->m_pkthdr.csum_flags, M_TCP_TSO)) { > + used += em_tso_setup(que, m, head, &txd_upper, > + &txd_lower); > + if (!used) > + return (used); > + } else { > + used += em_tx_ctx_setup(que, m, head, &txd_upper, > + &txd_lower); > + } > } else if (sc->hw.mac_type >= em_82543) { > used += em_transmit_checksum_setup(que, m, head, > &txd_upper, &txd_lower); > @@ -1569,6 +1579,21 @@ em_update_link_status(struct em_softc *s > ifp->if_link_state = link_state; > if_link_state_change(ifp); > } > + > + /* Disable TSO for 10/100 speeds to avoid some hardware issues */ > + switch (sc->link_speed) { > + case SPEED_10: > + case SPEED_100: > + if (sc->hw.mac_type >= em_82575 && sc->hw.mac_type <= em_i210) { > + ifp->if_capabilities &= ~IFCAP_TSOv4; > + ifp->if_capabilities &= ~IFCAP_TSOv6; > + } > + break; > + case SPEED_1000: > + if (sc->hw.mac_type >= em_82575 && sc->hw.mac_type <= em_i210) > + ifp->if_capabilities |= IFCAP_TSOv4 | IFCAP_TSOv6; > + break; > + } > } > > /********************************************************************* > @@ -1988,6 +2013,7 @@ em_setup_interface(struct em_softc *sc) > if (sc->hw.mac_type >= em_82575 && sc->hw.mac_type <= em_i210) { > ifp->if_capabilities |= IFCAP_CSUM_IPv4; > ifp->if_capabilities |= IFCAP_CSUM_TCPv6 | IFCAP_CSUM_UDPv6; > + ifp->if_capabilities |= IFCAP_TSOv4 | IFCAP_TSOv6; > } > > /* > @@ -2231,9 +2257,9 @@ em_setup_transmit_structures(struct em_s > > for (i = 0; i < sc->sc_tx_slots; i++) { > pkt = &que->tx.sc_tx_pkts_ring[i]; > - error = bus_dmamap_create(sc->sc_dmat, MAX_JUMBO_FRAME_SIZE, > + error = bus_dmamap_create(sc->sc_dmat, EM_TSO_SIZE, > EM_MAX_SCATTER / (sc->pcix_82544 ? 2 : 1), > - MAX_JUMBO_FRAME_SIZE, 0, BUS_DMA_NOWAIT, &pkt->pkt_map); > + EM_TSO_SEG_SIZE, 0, BUS_DMA_NOWAIT, &pkt->pkt_map); > if (error != 0) { > printf("%s: Unable to create TX DMA map\n", > DEVNAME(sc)); > @@ -2403,6 +2429,85 @@ em_free_transmit_structures(struct em_so > 0, que->tx.sc_tx_dma.dma_map->dm_mapsize, > BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); > } > +} > + > +u_int > +em_tso_setup(struct em_queue *que, struct mbuf *mp, u_int head, > + u_int32_t *olinfo_status, u_int32_t *cmd_type_len) > +{ > + struct ether_extracted ext; > + struct e1000_adv_tx_context_desc *TD; > + uint32_t vlan_macip_lens = 0, type_tucmd_mlhl = 0, mss_l4len_idx = 0; > + uint32_t paylen = 0; > + uint8_t iphlen = 0; > + > + *olinfo_status = 0; > + *cmd_type_len = 0; > + TD = (struct e1000_adv_tx_context_desc *)&que->tx.sc_tx_desc_ring[head]; > + > +#if NVLAN > 0 > + if (ISSET(mp->m_flags, M_VLANTAG)) { > + uint32_t vtag = mp->m_pkthdr.ether_vtag; > + vlan_macip_lens |= vtag << E1000_ADVTXD_VLAN_SHIFT; > + *cmd_type_len |= E1000_ADVTXD_DCMD_VLE; > + } > +#endif > + > + ether_extract_headers(mp, &ext); > + if (ext.tcp == NULL) > + goto out; > + > + vlan_macip_lens |= (sizeof(*ext.eh) << E1000_ADVTXD_MACLEN_SHIFT); > + > + if (ext.ip4) { > + iphlen = ext.ip4->ip_hl << 2; > + > + type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_IPV4; > + *olinfo_status |= E1000_TXD_POPTS_IXSM << 8; > +#ifdef INET6 > + } else if (ext.ip6) { > + iphlen = sizeof(*ext.ip6); > + > + type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_IPV6; > +#endif > + } else { > + goto out; > + } > + > + *cmd_type_len |= E1000_ADVTXD_DTYP_DATA | E1000_ADVTXD_DCMD_IFCS; > + *cmd_type_len |= E1000_ADVTXD_DCMD_DEXT | E1000_ADVTXD_DCMD_TSE; > + paylen = mp->m_pkthdr.len - sizeof(*ext.eh) - iphlen - > + (ext.tcp->th_off << 2); > + if (paylen == 0) { > + /* Skip TCP frame with no payload to avoid watchdog timeout */ > + return 0; > + } > + *olinfo_status |= paylen << E1000_ADVTXD_PAYLEN_SHIFT; > + vlan_macip_lens |= iphlen; > + type_tucmd_mlhl |= E1000_ADVTXD_DCMD_DEXT | E1000_ADVTXD_DTYP_CTXT; > + > + type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_L4T_TCP; > + *olinfo_status |= E1000_TXD_POPTS_TXSM << 8; > + > + mss_l4len_idx |= mp->m_pkthdr.ph_mss << E1000_ADVTXD_MSS_SHIFT; > + mss_l4len_idx |= (ext.tcp->th_off << 2) << E1000_ADVTXD_L4LEN_SHIFT; > + /* 82575 needs the queue index added */ > + if (que->sc->hw.mac_type == em_82575) > + mss_l4len_idx |= (que->me & 0xff) << 4; > + > + htolem32(&TD->vlan_macip_lens, vlan_macip_lens); > + htolem32(&TD->type_tucmd_mlhl, type_tucmd_mlhl); > + htolem32(&TD->u.seqnum_seed, 0); > + htolem32(&TD->mss_l4len_idx, mss_l4len_idx); > + > + tcpstat_add(tcps_outpkttso, (paylen + mp->m_pkthdr.ph_mss - 1) / > + mp->m_pkthdr.ph_mss); > + > + return 1; > + > +out: > + tcpstat_inc(tcps_outbadtso); > + return 0; > } > > u_int > Index: sys/dev/pci/if_em.h > =================================================================== > RCS file: /cvs/src/sys/dev/pci/if_em.h,v > diff -u -p -u -p -r1.82 if_em.h > --- sys/dev/pci/if_em.h 28 Jan 2024 18:42:58 -0000 1.82 > +++ sys/dev/pci/if_em.h 5 Feb 2024 21:24:04 -0000 > @@ -55,11 +55,14 @@ POSSIBILITY OF SUCH DAMAGE. > > #include > #include > +#include > > #include > #include > #include > #include > +#include > +#include > #include > > #if NBPFILTER > 0 > @@ -269,6 +272,7 @@ typedef int boolean_t; > > #define EM_MAX_SCATTER 64 > #define EM_TSO_SIZE 65535 > +#define EM_TSO_SEG_SIZE 4096 /* Max dma segment size */ > > struct em_packet { > int pkt_eop; /* Index of the desc to watch */ > Index: sys/dev/pci/if_em_hw.h > =================================================================== > RCS file: /cvs/src/sys/dev/pci/if_em_hw.h,v > diff -u -p -u -p -r1.92 if_em_hw.h > --- sys/dev/pci/if_em_hw.h 28 Jan 2024 18:42:58 -0000 1.92 > +++ sys/dev/pci/if_em_hw.h 5 Feb 2024 21:24:04 -0000 > @@ -2150,6 +2150,7 @@ struct e1000_adv_tx_context_desc { > #define E1000_ADVTXD_DCMD_IFCS 0x02000000 /* Insert FCS (Ethernet CRC) */ > #define E1000_ADVTXD_DCMD_DEXT 0x20000000 /* Descriptor extension (1=Adv) */ > #define E1000_ADVTXD_DCMD_VLE 0x40000000 /* VLAN pkt enable */ > +#define E1000_ADVTXD_DCMD_TSE 0x80000000 /* TCP Seg enable */ > #define E1000_ADVTXD_PAYLEN_SHIFT 14 /* Adv desc PAYLEN shift */ > > /* Adv Transmit Descriptor Config Masks */ > @@ -2159,6 +2160,10 @@ struct e1000_adv_tx_context_desc { > #define E1000_ADVTXD_TUCMD_IPV6 0x00000000 /* IP Packet Type: 0=IPv6 */ > #define E1000_ADVTXD_TUCMD_L4T_UDP 0x00000000 /* L4 Packet TYPE of UDP */ > #define E1000_ADVTXD_TUCMD_L4T_TCP 0x00000800 /* L4 Packet TYPE of TCP */ > + > +/* Req requires Markers and CRC */ > +#define E1000_ADVTXD_L4LEN_SHIFT 8 /* Adv ctxt L4LEN shift */ > +#define E1000_ADVTXD_MSS_SHIFT 16 /* Adv ctxt MSS shift */ > > /* Multiple Receive Queue Control */ > #define E1000_MRQC_ENABLE_MASK 0x00000003 > -- :wq Claudio