From: David Gwynne Subject: Re: igc(4): fix recv. jumbo frames To: Jan Klemkow Cc: Alexander Bluhm , tech@openbsd.org, Theo de Raadt Date: Fri, 9 Aug 2024 20:19:30 +1000 On Fri, Aug 09, 2024 at 11:13:19AM +0200, Jan Klemkow wrote: > On Fri, Aug 09, 2024 at 11:55:35AM +1000, David Gwynne wrote: > > On Thu, Aug 08, 2024 at 12:51:18PM +0200, Jan Klemkow wrote: > > > On Wed, Aug 07, 2024 at 11:38:53PM +0200, Alexander Bluhm wrote: > > > > On Thu, Aug 01, 2024 at 06:48:17PM +0200, Jan Klemkow wrote: > > > > > On Wed, Jul 31, 2024 at 10:39:25AM -0600, Theo de Raadt wrote: > > > > > > Alexander Bluhm wrote: > > > > > > > On Tue, Jul 30, 2024 at 08:12:08AM +0200, Jan Klemkow wrote: > > > > > > > > The DMA mapping and allocation of mbufs for jumbo frames uses different > > > > > > > > sizes. Thus, we are ending up with corrupt mbufs, which leads to panics > > > > > > > > in later part of the TCP/IP stack: > > > > > > > > > > > > > > > > panic: kernel diagnostic assertion "M_DATABUF(m) + M_SIZE(m) >= (m->m_data + m->m_len)" > > > > > > > > failed: file "/usr/src/sys/kern/uipc_mbuf.c", line 1364 > > > > > > > > > > > > > > > > With the following diff, we use the same size for mapping and > > > > > > > > allocation. > > > > > > > > > > > > > > I have tested this diff successfully. Before jumbo frames with > > > > > > > igc(4) paniced, now TCP is faster. > > > > > > > > > > > > > > But are we allocating too much in the non-jumbo case? > > > > > > > sc->rx_mbuf_sz = MAX_JUMBO_FRAME_SIZE + ETHER_ALIGN; > > > > > > > m = MCLGETL(NULL, M_DONTWAIT, sc->rx_mbuf_sz) > > > > > > > > > > > > > > This means for each receive packet we allocate a 16k mbuf cluster. > > > > > > > Usually only 1500 byte are needed. > > > > > > > > > > > > > > ix(4) has this code and manages to chain the mbufs correctly. > > > > > > > /* Use 2k clusters, even for jumbo frames */ > > > > > > > sc->rx_mbuf_sz = MCLBYTES + ETHER_ALIGN; > > > > > > > > > > > > This hardware chains fragments in the ring? > > > > > > > > > > > > It's a tradeoff. jumbos are rare, so why reserve so much memory to them > > > > > > on hardware which can do better. > > > > > > > > > > I oriented my diff at the FreeBSD code. Also, the Linux code looks like > > > > > as they don't use scatter/gather DMA. > > > > > > > > > > But, It works with the following diff below. > > > > > > > > > > The extra size of ETHER_ALIGN seems be wrong. It also leads to panics > > > > > caused by corrupted mbufs. FreeBSD just use MCLBYTES for rx mbufs size. > > > > > Also NetBSD removed ETHER_ALIGN from their copy of our driver [1]. > > > > > > > > > > [1]: https://github.com/NetBSD/src/commit/fb38d839b48b9b6204dbbee1672454d6e719ba01 > > > > > > > > > > Index: dev/pci/if_igc.c > > > > > =================================================================== > > > > > RCS file: /mount/openbsd/cvs/src/sys/dev/pci/if_igc.c,v > > > > > diff -u -p -r1.25 if_igc.c > > > > > --- dev/pci/if_igc.c 24 May 2024 06:02:53 -0000 1.25 > > > > > +++ dev/pci/if_igc.c 1 Aug 2024 15:45:58 -0000 > > > > > @@ -881,7 +881,7 @@ igc_init(void *arg) > > > > > } > > > > > igc_initialize_transmit_unit(sc); > > > > > > > > > > - sc->rx_mbuf_sz = MCLBYTES + ETHER_ALIGN; > > > > > + sc->rx_mbuf_sz = MCLBYTES; > > > > > /* Prepare receive descriptors and buffers. */ > > > > > if (igc_setup_receive_structures(sc)) { > > > > > printf("%s: Could not setup receive structures\n", > > > > > @@ -2159,7 +2159,7 @@ igc_allocate_receive_buffers(struct igc_ > > > > > rxbuf = rxr->rx_buffers; > > > > > for (i = 0; i < sc->num_rx_desc; i++, rxbuf++) { > > > > > error = bus_dmamap_create(rxr->rxdma.dma_tag, > > > > > - MAX_JUMBO_FRAME_SIZE, 1, MAX_JUMBO_FRAME_SIZE, 0, > > > > > + MAX_JUMBO_FRAME_SIZE, IGC_MAX_SCATTER, MCLBYTES, 0, > > > > > BUS_DMA_NOWAIT, &rxbuf->map); > > > > > if (error) { > > > > > printf("%s: Unable to create RX DMA map\n", > > > > > > > > The diff above survives my tests. But the #define IGC_MAX_SCATTER > > > > 40 seems like a magic value. It is nowhere mentioned in the spec. > > > > With MAX_JUMBO_FRAME_SIZE == 9216 and MCLBYTES == 2048 there can > > > > be at most 5 segements. The diff below also passes my tests. > > > > > > > > Is the more specific value IGC_MAX_GATHER == 5 better? > > > > > > I get your point to change the name from SCATTER to something else, but > > > GATHER isn't used in any other driver. In some cases the value is just > > > hardcoded all other cases use something like MAX_SEGMENTS or MAX_SEGS. > > > This also better matches with the parameter nsegments of > > > bus_dmamap_create(). > > > > > > > Or could the hardware try to create larger frames? > > > > > > bnx(4) uses 8 instead of 5 for busdma mappings for instance. And many > > > other driver just map and allocate jumbo size or hardmtu in one segment > > > for their receive buffers. > > > > > > Thus I would suggest to increase it a bit. Better safe the sorry, till > > > we know a accurate or official value for this limit. > > > > i dont think you can compare bnx and igc here. igc (like other intel > > nics) can chain multiple rx descriptors into a single packet, while > > bnx and most other nics rx a packet into a single rx descriptor. > > > > we (openbsd) also go to a lot of effort to make mbuf clusters > > physically contiguous. you basically only ever need a single > > descriptor for rx packets unless a chip has some stupid dma boundary > > or segment size limit. if it's not a stupid chip like that, what a > > chip can do and accomodate is often unecessary because of how our > > kernel is set up. > > > > you also have to be careful not to confuse the dmamaps created for rx > > and tx descriptors. this is the actual bus_dmamap_create calls for > > bnx rx descriptors: > > > > /* > > * Create DMA maps for the Rx buffer mbufs. > > */ > > for (i = 0; i < TOTAL_RX_BD; i++) { > > if (bus_dmamap_create(sc->bnx_dmatag, BNX_MAX_JUMBO_MRU, > > 1, BNX_MAX_JUMBO_MRU, 0, BUS_DMA_NOWAIT, > > &sc->rx_mbuf_map[i])) { > > printf(": Could not create Rx mbuf %d DMA map!\n", i); > > rc = ENOMEM; > > goto bnx_dma_alloc_exit; > > } > > } > > > > > Index: dev/pci/if_igc.c > > > =================================================================== > > > RCS file: /data/mirror/openbsd/cvs/src/sys/dev/pci/if_igc.c,v > > > diff -u -p -r1.25 if_igc.c > > > --- dev/pci/if_igc.c 24 May 2024 06:02:53 -0000 1.25 > > > +++ dev/pci/if_igc.c 6 Aug 2024 08:36:31 -0000 > > > @@ -881,7 +881,7 @@ igc_init(void *arg) > > > } > > > igc_initialize_transmit_unit(sc); > > > > > > - sc->rx_mbuf_sz = MCLBYTES + ETHER_ALIGN; > > > + sc->rx_mbuf_sz = MCLBYTES; > > > > i think this is a mistake. > > > > igc, like every other intel nic, only lets you specify the size of the > > rx buffers using a very limited set of values: > > > > #define IGC_RCTL_SZ_2048 0x00000000 /* Rx buffer size 2048 */ > > #define IGC_RCTL_SZ_1024 0x00010000 /* Rx buffer size 1024 */ > > #define IGC_RCTL_SZ_512 0x00020000 /* Rx buffer size 512 */ > > #define IGC_RCTL_SZ_256 0x00030000 /* Rx buffer size 256 */ > > > > this is configured at the chip level, which is why the rx ring > > descriptors dont have a length field. you just slap the address of > > memory in the descriptor and get on with it. > > > > it is convenient that we can tell igc to use the same sized memory > > buffer that our mbuf clusters provide by default. > > > > however, igc, also like every other intel nic, writes the ethernet > > packet into the rx buffer starting at the address provided. this > > means if we use the vanilla 2k mbuf clusters for the rx ring, then > > the IP header inside the received packets will not be aligned to a > > 4 byte boundary, which is a problem on strict alignment archs. igc > > is enabled on sparc64 and powerpc64, which will break or are now > > broken with this diff. > > > > because of the limited set of rx buffer sizes on intel nics, and cos of > > the ip header alignment requirements, and to avoid wasting even more > > memory by using much larger clusters, we have the mcl2k2 clusters. i > > like to call it the intel mbuf cluster pool. the other intel drivers > > seem to be able to use it just fine so im surprised igc has unique > > problems with this. > > > > > /* Prepare receive descriptors and buffers. */ > > > if (igc_setup_receive_structures(sc)) { > > > printf("%s: Could not setup receive structures\n", > > > @@ -2159,7 +2159,7 @@ igc_allocate_receive_buffers(struct igc_ > > > rxbuf = rxr->rx_buffers; > > > for (i = 0; i < sc->num_rx_desc; i++, rxbuf++) { > > > error = bus_dmamap_create(rxr->rxdma.dma_tag, > > > - MAX_JUMBO_FRAME_SIZE, 1, MAX_JUMBO_FRAME_SIZE, 0, > > > + MAX_JUMBO_FRAME_SIZE, IGC_MAX_SEGS, MCLBYTES, 0, > > > > igc_get_buf only uses the first segment in the rx desc dmamap and only > > populates a single rx descriptor. allowing more segments here feels > > like a workaround for having changed the maximum segment size to > > MCLBYTES, but before lowering rx_mbuf_sz. > > > > fwiw, em(4), ix(4), ixl(4), iavf(4) and even fxp(4) all have a > > bus_dmamap_create for rx descs that looks like this one before this > > diff. > > Hi David, > > thanks you for the explanation and details. What should I do now? Just > backout my last diff or apply the first diff I suggested here: > https://marc.info/?l=openbsd-tech&m=172231995913881&w=2 > > Without any diff, using Jumbo-Frames leads into a corrupt mbuf length > or modified mbufs in the freelist. > > > i dont think i have an igc at the moment, otherwise id try and > > fiddle with this. > > When you give me an sshkey, you can use our test machines: > > We have an amd64 with igc(4): http://obsd-lab.genua.de/hw/ot42.html > As well as an sparc64 with igc(4): http://obsd-lab.genua.de/hw/ot25.html > > Thanks, > Jan > you could try this. it compiles at least... the important change is that it uses MCLBYTES via sc->rx_mbuf_sz for all rx buffer sizes, except when it actually allocates a cluster. for cluster allocation we add ETHER_ALIGN bytes to the requested allocation size and offset m_data by that amount. my bet is the actual problem with the previous code hangs off this bit: - m->m_data += (m->m_ext.ext_size - sc->rx_mbuf_sz); the idea here was to put the mbuf data as close to the end of the cluster as possible to be nice to things doing m_prepend in the future. the problem is it's too smart, and assumes knowledge about mbufs. the mcl2k2 is 2048 + 2 bytes on paper, but is actually 2048 + 64 bytes in size because of the 64 byte alignent we ask the pool to provide. igc tried to be clever and use that space rather than only assume the 2048 + 2 bytes it asked for. i dont know if this will help. i dont know why igc triggers the assert in your original email and none of the other intel chips hit it. Index: if_igc.c =================================================================== RCS file: /cvs/src/sys/dev/pci/if_igc.c,v diff -u -p -r1.26 if_igc.c --- if_igc.c 8 Aug 2024 14:58:49 -0000 1.26 +++ if_igc.c 9 Aug 2024 10:08:29 -0000 @@ -202,6 +202,7 @@ igc_attach(struct device *parent, struct /* Determine hardware and mac info */ igc_identify_hardware(sc); + sc->rx_mbuf_sz = MCLBYTES; sc->num_tx_desc = IGC_DEFAULT_TXD; sc->num_rx_desc = IGC_DEFAULT_RXD; @@ -881,7 +882,6 @@ igc_init(void *arg) } igc_initialize_transmit_unit(sc); - sc->rx_mbuf_sz = MCLBYTES; /* Prepare receive descriptors and buffers. */ if (igc_setup_receive_structures(sc)) { printf("%s: Could not setup receive structures\n", @@ -1232,7 +1232,7 @@ igc_rxrinfo(struct igc_softc *sc, struct for (i = 0; i < sc->sc_nqueues; i++) { rxr = &sc->rx_rings[i]; - ifr[n].ifr_size = MCLBYTES; + ifr[n].ifr_size = sc->rx_mbuf_sz; snprintf(ifr[n].ifr_name, sizeof(ifr[n].ifr_name), "%d", i); ifr[n].ifr_info = rxr->rx_ring; n++; @@ -1673,11 +1673,11 @@ igc_get_buf(struct igc_rxring *rxr, int return ENOBUFS; } - m = MCLGETL(NULL, M_DONTWAIT, sc->rx_mbuf_sz); + m = MCLGETL(NULL, M_DONTWAIT, sc->rx_mbuf_sz + ETHER_ALIGN); if (!m) return ENOBUFS; - m->m_data += (m->m_ext.ext_size - sc->rx_mbuf_sz); + m->m_data += ETHER_ALIGN; m->m_len = m->m_pkthdr.len = sc->rx_mbuf_sz; error = bus_dmamap_load_mbuf(rxr->rxdma.dma_tag, rxbuf->map, m, @@ -2159,7 +2159,7 @@ igc_allocate_receive_buffers(struct igc_ rxbuf = rxr->rx_buffers; for (i = 0; i < sc->num_rx_desc; i++, rxbuf++) { error = bus_dmamap_create(rxr->rxdma.dma_tag, - MAX_JUMBO_FRAME_SIZE, IGC_MAX_SCATTER, MCLBYTES, 0, + sc->rx_mbuf_sz, 1, sc->rx_mbuf_sz, 0, BUS_DMA_NOWAIT, &rxbuf->map); if (error) { printf("%s: Unable to create RX DMA map\n", @@ -2223,7 +2223,8 @@ igc_setup_receive_ring(struct igc_rxring rxr->next_to_check = 0; rxr->last_desc_filled = sc->num_rx_desc - 1; - if_rxr_init(&rxr->rx_ring, 2 * ((ifp->if_hardmtu / MCLBYTES) + 1), + if_rxr_init(&rxr->rx_ring, + 2 * howmany(ifp->if_hardmtu, sc->rx_mbuf_sz) + 1, sc->num_rx_desc - 1); return 0;