Download raw body.
TSO with small packets
On Fri, Feb 09, 2024 at 02:30:21PM +0100, Alexander Bluhm wrote:
> On Fri, Feb 09, 2024 at 01:50:53PM +0100, Claudio Jeker wrote:
> > On Fri, Feb 09, 2024 at 01:37:45PM +0100, Alexander Bluhm wrote:
> > > On Fri, Feb 09, 2024 at 12:11:52PM +0100, Claudio Jeker wrote:
> > > > On Fri, Feb 09, 2024 at 11:37:32AM +0100, Alexander Bluhm wrote:
> > > > > Hi,
> > > > >
> > > > > After a lot of debugging and testing with Hrvoje and mglocker@, we
> > > > > have found the cause for occasional watchdog timeouts with em TSO
> > > > > diff.
> > > > >
> > > > > In this setup TCP packets are routed from ix(4) with LRO to em(4)
> > > > > with TSO. It happens that ix hardware coalesces packets that are
> > > > > in total smaller than MTU. In real live this is rare as either you
> > > > > have large TCP packets in a row, or small TCP packets from time to
> > > > > time.
> > > > >
> > > > > Small packets that went though LRO have the TSO bit set. But TSO
> > > > > is not necessary as only few small TCP packets have been reassembled.
> > > > > This confuses em hardware. So clear TSO flag during interface
> > > > > output path in that case.
> > > > >
> > > > > This is only relevant when forwarding to hardware TSO. Our stack
> > > > > does not generate such packets and can handle them. Diff below
> > > > > prevents them to reach hardware TSO.
> > > > >
> > > > > ok?
> > > >
> > > > But shouldn't the system send out those small packets split up again?
> > > > At least I think the idea is that the effect of LRO is reversed while
> > > > sending out so that the same amount of packets are sent as without LRO/TSO.
> > >
> > > The small packets in our case have 2 bytes payload after LRO. We
> > > instruct em to split them in 1 byte packets and that fails with
> > > watchdog timeouts. It works if I clear the TSO bit.
> > >
> > > But you are right, my diff may break path MTU discovery. So
> > > m_pkthdr.len <= mtu is not ideal. Is there a minmum TSO size
> > > hardware can handle? We might use that instead of MTU.
> >
> > Probably the minumum IPv6 packet size (576 IIRC) would be a good minimum.
>
> RFC 791: All hosts must be prepared to accept datagrams of up to
> 576 octets (whether they arrive whole or in fragments).
> RFC 791: Every internet module must be able to forward a datagram
> of 68 octets without further fragmentation.
> RFC 1280: IPv6 requires that every link in the internet have an MTU
> of 1280 octets or greater.
>
> I think 576 is a sane choice. We don't care about fragments when
> doing path MTU. The world is not IPv6.
>
> Hopefully all hardware can handle TSO to split packets of this size.
> We see the problems in em with around 60 byte size, depending on
> TCP header length.
>
> The diff should fix the packets that Mglocker has seen in Hrvoje's
> setup. For packets between 600 and 1400 bytes we don't know what
> hardware will do, but it cannot be worse than the current code.
>
> ok?
>
> bluhm
>
> Index: netinet/tcp_output.c
> ===================================================================
> RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/tcp_output.c,v
> diff -u -p -r1.141 tcp_output.c
> --- netinet/tcp_output.c 26 Nov 2023 22:08:10 -0000 1.141
> +++ netinet/tcp_output.c 9 Feb 2024 13:11:58 -0000
> @@ -1360,6 +1360,12 @@ tcp_if_output_tso(struct ifnet *ifp, str
> /* caller must fail later or fragment */
> if (!ISSET((*mp)->m_pkthdr.csum_flags, M_TCP_TSO))
> return 0;
> + /* send without hardware TSO if within minimum IPv4 packet size */
> + if ((*mp)->m_pkthdr.len <= 576) {
> + CLR((*mp)->m_pkthdr.csum_flags, M_TCP_TSO);
> + return 0;
> + }
> + /* fragment or fail if interface cannot handle size after chopping */
> if ((*mp)->m_pkthdr.ph_mss > mtu) {
> CLR((*mp)->m_pkthdr.csum_flags, M_TCP_TSO);
> return 0;
>
The watchdog timeouts seen with em(4) tso enabled [1] do only happen when
em_tso_setup() receives an empty tcp frame, so the payload is zero.
I can understand that the hardware fails when we ask to tso zero bytes
of payload -- This shouldn't happen.
Those empty tcp frames are only arriving in em_tso_setup() when ix(4)
forwards packets. I suspect the issue somewhere in
http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/dev/pci/if_ix.c.diff?r1=1.196&r2=1.197
But I don't understand this LSO code good enough to spot what might go
wrong there. I can see that when this happens, ix(4) forwards a packet
with 60 bytes (vlan header + 2 bytes payload?) m_pkthdr.len, and 1 byte
m_pkthdr.ph_mss. Why this results in em_tso_setup() receiving an
packet with 54 bytes (no payload), I don't understand.
The updated diff skips empty tcp frames em_tso_setup(). With that the
watchdog timeouts reported by Hrvoje are gone. But obviously it's not
the solution for the issue in ix(4).
[1] We are only able to trigger those watchdog timeouts with an setup
from Hrvoje which involves, vlan(4) + ix(4), and generate load with
TRex traffic generator.
Index: sys/dev/pci/if_em.c
===================================================================
RCS file: /cvs/src/sys/dev/pci/if_em.c,v
diff -u -p -u -p -r1.371 if_em.c
--- sys/dev/pci/if_em.c 28 Jan 2024 18:42:58 -0000 1.371
+++ sys/dev/pci/if_em.c 5 Feb 2024 21:24:04 -0000
@@ -291,6 +291,8 @@ void em_receive_checksum(struct em_softc
struct mbuf *);
u_int em_transmit_checksum_setup(struct em_queue *, struct mbuf *, u_int,
u_int32_t *, u_int32_t *);
+u_int em_tso_setup(struct em_queue *, struct mbuf *, u_int, u_int32_t *,
+ u_int32_t *);
u_int em_tx_ctx_setup(struct em_queue *, struct mbuf *, u_int, u_int32_t *,
u_int32_t *);
void em_iff(struct em_softc *);
@@ -1188,7 +1190,7 @@ em_flowstatus(struct em_softc *sc)
*
* This routine maps the mbufs to tx descriptors.
*
- * return 0 on success, positive on failure
+ * return 0 on failure, positive on success
**********************************************************************/
u_int
em_encap(struct em_queue *que, struct mbuf *m)
@@ -1236,7 +1238,15 @@ em_encap(struct em_queue *que, struct mb
}
if (sc->hw.mac_type >= em_82575 && sc->hw.mac_type <= em_i210) {
- used += em_tx_ctx_setup(que, m, head, &txd_upper, &txd_lower);
+ if (ISSET(m->m_pkthdr.csum_flags, M_TCP_TSO)) {
+ used += em_tso_setup(que, m, head, &txd_upper,
+ &txd_lower);
+ if (!used)
+ return (used);
+ } else {
+ used += em_tx_ctx_setup(que, m, head, &txd_upper,
+ &txd_lower);
+ }
} else if (sc->hw.mac_type >= em_82543) {
used += em_transmit_checksum_setup(que, m, head,
&txd_upper, &txd_lower);
@@ -1569,6 +1579,21 @@ em_update_link_status(struct em_softc *s
ifp->if_link_state = link_state;
if_link_state_change(ifp);
}
+
+ /* Disable TSO for 10/100 speeds to avoid some hardware issues */
+ switch (sc->link_speed) {
+ case SPEED_10:
+ case SPEED_100:
+ if (sc->hw.mac_type >= em_82575 && sc->hw.mac_type <= em_i210) {
+ ifp->if_capabilities &= ~IFCAP_TSOv4;
+ ifp->if_capabilities &= ~IFCAP_TSOv6;
+ }
+ break;
+ case SPEED_1000:
+ if (sc->hw.mac_type >= em_82575 && sc->hw.mac_type <= em_i210)
+ ifp->if_capabilities |= IFCAP_TSOv4 | IFCAP_TSOv6;
+ break;
+ }
}
/*********************************************************************
@@ -1988,6 +2013,7 @@ em_setup_interface(struct em_softc *sc)
if (sc->hw.mac_type >= em_82575 && sc->hw.mac_type <= em_i210) {
ifp->if_capabilities |= IFCAP_CSUM_IPv4;
ifp->if_capabilities |= IFCAP_CSUM_TCPv6 | IFCAP_CSUM_UDPv6;
+ ifp->if_capabilities |= IFCAP_TSOv4 | IFCAP_TSOv6;
}
/*
@@ -2231,9 +2257,9 @@ em_setup_transmit_structures(struct em_s
for (i = 0; i < sc->sc_tx_slots; i++) {
pkt = &que->tx.sc_tx_pkts_ring[i];
- error = bus_dmamap_create(sc->sc_dmat, MAX_JUMBO_FRAME_SIZE,
+ error = bus_dmamap_create(sc->sc_dmat, EM_TSO_SIZE,
EM_MAX_SCATTER / (sc->pcix_82544 ? 2 : 1),
- MAX_JUMBO_FRAME_SIZE, 0, BUS_DMA_NOWAIT, &pkt->pkt_map);
+ EM_TSO_SEG_SIZE, 0, BUS_DMA_NOWAIT, &pkt->pkt_map);
if (error != 0) {
printf("%s: Unable to create TX DMA map\n",
DEVNAME(sc));
@@ -2403,6 +2429,85 @@ em_free_transmit_structures(struct em_so
0, que->tx.sc_tx_dma.dma_map->dm_mapsize,
BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE);
}
+}
+
+u_int
+em_tso_setup(struct em_queue *que, struct mbuf *mp, u_int head,
+ u_int32_t *olinfo_status, u_int32_t *cmd_type_len)
+{
+ struct ether_extracted ext;
+ struct e1000_adv_tx_context_desc *TD;
+ uint32_t vlan_macip_lens = 0, type_tucmd_mlhl = 0, mss_l4len_idx = 0;
+ uint32_t paylen = 0;
+ uint8_t iphlen = 0;
+
+ *olinfo_status = 0;
+ *cmd_type_len = 0;
+ TD = (struct e1000_adv_tx_context_desc *)&que->tx.sc_tx_desc_ring[head];
+
+#if NVLAN > 0
+ if (ISSET(mp->m_flags, M_VLANTAG)) {
+ uint32_t vtag = mp->m_pkthdr.ether_vtag;
+ vlan_macip_lens |= vtag << E1000_ADVTXD_VLAN_SHIFT;
+ *cmd_type_len |= E1000_ADVTXD_DCMD_VLE;
+ }
+#endif
+
+ ether_extract_headers(mp, &ext);
+ if (ext.tcp == NULL)
+ goto out;
+
+ vlan_macip_lens |= (sizeof(*ext.eh) << E1000_ADVTXD_MACLEN_SHIFT);
+
+ if (ext.ip4) {
+ iphlen = ext.ip4->ip_hl << 2;
+
+ type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_IPV4;
+ *olinfo_status |= E1000_TXD_POPTS_IXSM << 8;
+#ifdef INET6
+ } else if (ext.ip6) {
+ iphlen = sizeof(*ext.ip6);
+
+ type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_IPV6;
+#endif
+ } else {
+ goto out;
+ }
+
+ *cmd_type_len |= E1000_ADVTXD_DTYP_DATA | E1000_ADVTXD_DCMD_IFCS;
+ *cmd_type_len |= E1000_ADVTXD_DCMD_DEXT | E1000_ADVTXD_DCMD_TSE;
+ paylen = mp->m_pkthdr.len - sizeof(*ext.eh) - iphlen -
+ (ext.tcp->th_off << 2);
+ if (paylen == 0) {
+ /* Skip TCP frame with no payload to avoid watchdog timeout */
+ return 0;
+ }
+ *olinfo_status |= paylen << E1000_ADVTXD_PAYLEN_SHIFT;
+ vlan_macip_lens |= iphlen;
+ type_tucmd_mlhl |= E1000_ADVTXD_DCMD_DEXT | E1000_ADVTXD_DTYP_CTXT;
+
+ type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_L4T_TCP;
+ *olinfo_status |= E1000_TXD_POPTS_TXSM << 8;
+
+ mss_l4len_idx |= mp->m_pkthdr.ph_mss << E1000_ADVTXD_MSS_SHIFT;
+ mss_l4len_idx |= (ext.tcp->th_off << 2) << E1000_ADVTXD_L4LEN_SHIFT;
+ /* 82575 needs the queue index added */
+ if (que->sc->hw.mac_type == em_82575)
+ mss_l4len_idx |= (que->me & 0xff) << 4;
+
+ htolem32(&TD->vlan_macip_lens, vlan_macip_lens);
+ htolem32(&TD->type_tucmd_mlhl, type_tucmd_mlhl);
+ htolem32(&TD->u.seqnum_seed, 0);
+ htolem32(&TD->mss_l4len_idx, mss_l4len_idx);
+
+ tcpstat_add(tcps_outpkttso, (paylen + mp->m_pkthdr.ph_mss - 1) /
+ mp->m_pkthdr.ph_mss);
+
+ return 1;
+
+out:
+ tcpstat_inc(tcps_outbadtso);
+ return 0;
}
u_int
Index: sys/dev/pci/if_em.h
===================================================================
RCS file: /cvs/src/sys/dev/pci/if_em.h,v
diff -u -p -u -p -r1.82 if_em.h
--- sys/dev/pci/if_em.h 28 Jan 2024 18:42:58 -0000 1.82
+++ sys/dev/pci/if_em.h 5 Feb 2024 21:24:04 -0000
@@ -55,11 +55,14 @@ POSSIBILITY OF SUCH DAMAGE.
#include <net/if.h>
#include <net/if_media.h>
+#include <net/route.h>
#include <netinet/in.h>
#include <netinet/ip.h>
#include <netinet/if_ether.h>
#include <netinet/tcp.h>
+#include <netinet/tcp_timer.h>
+#include <netinet/tcp_var.h>
#include <netinet/udp.h>
#if NBPFILTER > 0
@@ -269,6 +272,7 @@ typedef int boolean_t;
#define EM_MAX_SCATTER 64
#define EM_TSO_SIZE 65535
+#define EM_TSO_SEG_SIZE 4096 /* Max dma segment size */
struct em_packet {
int pkt_eop; /* Index of the desc to watch */
Index: sys/dev/pci/if_em_hw.h
===================================================================
RCS file: /cvs/src/sys/dev/pci/if_em_hw.h,v
diff -u -p -u -p -r1.92 if_em_hw.h
--- sys/dev/pci/if_em_hw.h 28 Jan 2024 18:42:58 -0000 1.92
+++ sys/dev/pci/if_em_hw.h 5 Feb 2024 21:24:04 -0000
@@ -2150,6 +2150,7 @@ struct e1000_adv_tx_context_desc {
#define E1000_ADVTXD_DCMD_IFCS 0x02000000 /* Insert FCS (Ethernet CRC) */
#define E1000_ADVTXD_DCMD_DEXT 0x20000000 /* Descriptor extension (1=Adv) */
#define E1000_ADVTXD_DCMD_VLE 0x40000000 /* VLAN pkt enable */
+#define E1000_ADVTXD_DCMD_TSE 0x80000000 /* TCP Seg enable */
#define E1000_ADVTXD_PAYLEN_SHIFT 14 /* Adv desc PAYLEN shift */
/* Adv Transmit Descriptor Config Masks */
@@ -2159,6 +2160,10 @@ struct e1000_adv_tx_context_desc {
#define E1000_ADVTXD_TUCMD_IPV6 0x00000000 /* IP Packet Type: 0=IPv6 */
#define E1000_ADVTXD_TUCMD_L4T_UDP 0x00000000 /* L4 Packet TYPE of UDP */
#define E1000_ADVTXD_TUCMD_L4T_TCP 0x00000800 /* L4 Packet TYPE of TCP */
+
+/* Req requires Markers and CRC */
+#define E1000_ADVTXD_L4LEN_SHIFT 8 /* Adv ctxt L4LEN shift */
+#define E1000_ADVTXD_MSS_SHIFT 16 /* Adv ctxt MSS shift */
/* Multiple Receive Queue Control */
#define E1000_MRQC_ENABLE_MASK 0x00000003
TSO with small packets