Index | Thread | Search

From:
Jan Klemkow <jan@openbsd.org>
Subject:
Re: igc(4): fix recv. jumbo frames
To:
David Gwynne <david@gwynne.id.au>
Cc:
Alexander Bluhm <bluhm@openbsd.org>, tech@openbsd.org, Theo de Raadt <deraadt@openbsd.org>
Date:
Fri, 9 Aug 2024 11:13:19 +0200

Download raw body.

Thread
On Fri, Aug 09, 2024 at 11:55:35AM +1000, David Gwynne wrote:
> On Thu, Aug 08, 2024 at 12:51:18PM +0200, Jan Klemkow wrote:
> > On Wed, Aug 07, 2024 at 11:38:53PM +0200, Alexander Bluhm wrote:
> > > On Thu, Aug 01, 2024 at 06:48:17PM +0200, Jan Klemkow wrote:
> > > > On Wed, Jul 31, 2024 at 10:39:25AM -0600, Theo de Raadt wrote:
> > > > > Alexander Bluhm <bluhm@openbsd.org> wrote:
> > > > > > On Tue, Jul 30, 2024 at 08:12:08AM +0200, Jan Klemkow wrote:
> > > > > > > The DMA mapping and allocation of mbufs for jumbo frames uses different
> > > > > > > sizes.  Thus, we are ending up with corrupt mbufs, which leads to panics
> > > > > > > in later part of the TCP/IP stack:
> > > > > > > 
> > > > > > > panic: kernel diagnostic assertion "M_DATABUF(m) + M_SIZE(m) >= (m->m_data + m->m_len)"
> > > > > > > failed: file "/usr/src/sys/kern/uipc_mbuf.c", line 1364
> > > > > > > 
> > > > > > > With the following diff, we use the same size for mapping and
> > > > > > > allocation.
> > > > > > 
> > > > > > I have tested this diff successfully.  Before jumbo frames with
> > > > > > igc(4) paniced, now TCP is faster.
> > > > > > 
> > > > > > But are we allocating too much in the non-jumbo case?
> > > > > >     sc->rx_mbuf_sz = MAX_JUMBO_FRAME_SIZE + ETHER_ALIGN;
> > > > > >     m = MCLGETL(NULL, M_DONTWAIT, sc->rx_mbuf_sz)
> > > > > > 
> > > > > > This means for each receive packet we allocate a 16k mbuf cluster.
> > > > > > Usually only 1500 byte are needed.
> > > > > > 
> > > > > > ix(4) has this code and manages to chain the mbufs correctly.
> > > > > >     /* Use 2k clusters, even for jumbo frames */
> > > > > >     sc->rx_mbuf_sz = MCLBYTES + ETHER_ALIGN;
> > > > > 
> > > > > This hardware chains fragments in the ring?
> > > > > 
> > > > > It's a tradeoff.  jumbos are rare, so why reserve so much memory to them
> > > > > on hardware which can do better.
> > > > 
> > > > I oriented my diff at the FreeBSD code.  Also, the Linux code looks like
> > > > as they don't use scatter/gather DMA.
> > > > 
> > > > But, It works with the following diff below.
> > > > 
> > > > The extra size of ETHER_ALIGN seems be wrong.  It also leads to panics
> > > > caused by corrupted mbufs.  FreeBSD just use MCLBYTES for rx mbufs size.
> > > > Also NetBSD removed ETHER_ALIGN from their copy of our driver [1].
> > > > 
> > > > [1]: https://github.com/NetBSD/src/commit/fb38d839b48b9b6204dbbee1672454d6e719ba01
> > > > 
> > > > Index: dev/pci/if_igc.c
> > > > ===================================================================
> > > > RCS file: /mount/openbsd/cvs/src/sys/dev/pci/if_igc.c,v
> > > > diff -u -p -r1.25 if_igc.c
> > > > --- dev/pci/if_igc.c	24 May 2024 06:02:53 -0000	1.25
> > > > +++ dev/pci/if_igc.c	1 Aug 2024 15:45:58 -0000
> > > > @@ -881,7 +881,7 @@ igc_init(void *arg)
> > > >  	}
> > > >  	igc_initialize_transmit_unit(sc);
> > > >  
> > > > -	sc->rx_mbuf_sz = MCLBYTES + ETHER_ALIGN;
> > > > +	sc->rx_mbuf_sz = MCLBYTES;
> > > >  	/* Prepare receive descriptors and buffers. */
> > > >  	if (igc_setup_receive_structures(sc)) {
> > > >  		printf("%s: Could not setup receive structures\n",
> > > > @@ -2159,7 +2159,7 @@ igc_allocate_receive_buffers(struct igc_
> > > >  	rxbuf = rxr->rx_buffers;
> > > >  	for (i = 0; i < sc->num_rx_desc; i++, rxbuf++) {
> > > >  		error = bus_dmamap_create(rxr->rxdma.dma_tag,
> > > > -		    MAX_JUMBO_FRAME_SIZE, 1, MAX_JUMBO_FRAME_SIZE, 0,
> > > > +		    MAX_JUMBO_FRAME_SIZE, IGC_MAX_SCATTER, MCLBYTES, 0,
> > > >  		    BUS_DMA_NOWAIT, &rxbuf->map);
> > > >  		if (error) {
> > > >  			printf("%s: Unable to create RX DMA map\n",
> > > 
> > > The diff above survives my tests.  But the #define IGC_MAX_SCATTER
> > > 40 seems like a magic value.  It is nowhere mentioned in the spec.
> > > With MAX_JUMBO_FRAME_SIZE == 9216 and MCLBYTES == 2048 there can
> > > be at most 5 segements.  The diff below also passes my tests.
> > > 
> > > Is the more specific value IGC_MAX_GATHER == 5 better?
> > 
> > I get your point to change the name from SCATTER to something else, but
> > GATHER isn't used in any other driver.  In some cases the value is just
> > hardcoded all other cases use something like MAX_SEGMENTS or MAX_SEGS.
> > This also better matches with the parameter nsegments of
> > bus_dmamap_create().
> > 
> > > Or could the hardware try to create larger frames?
> > 
> > bnx(4) uses 8 instead of 5 for busdma mappings for instance.  And many
> > other driver just map and allocate jumbo size or hardmtu in one segment
> > for their receive buffers.
> > 
> > Thus I would suggest to increase it a bit.  Better safe the sorry, till
> > we know a accurate or official value for this limit.
> 
> i dont think you can compare bnx and igc here. igc (like other intel
> nics) can chain multiple rx descriptors into a single packet, while
> bnx and most other nics rx a packet into a single rx descriptor.
> 
> we (openbsd) also go to a lot of effort to make mbuf clusters
> physically contiguous. you basically only ever need a single
> descriptor for rx packets unless a chip has some stupid dma boundary
> or segment size limit. if it's not a stupid chip like that, what a
> chip can do and accomodate is often unecessary because of how our
> kernel is set up.
> 
> you also have to be careful not to confuse the dmamaps created for rx
> and tx descriptors. this is the actual bus_dmamap_create calls for
> bnx rx descriptors:
> 
> 	/*
> 	 * Create DMA maps for the Rx buffer mbufs.
> 	 */
> 	for (i = 0; i < TOTAL_RX_BD; i++) {
> 		if (bus_dmamap_create(sc->bnx_dmatag, BNX_MAX_JUMBO_MRU,
> 		    1, BNX_MAX_JUMBO_MRU, 0, BUS_DMA_NOWAIT,
> 		    &sc->rx_mbuf_map[i])) {
> 			printf(": Could not create Rx mbuf %d DMA map!\n", i);
> 			rc = ENOMEM;
> 			goto bnx_dma_alloc_exit;
> 		}
> 	}
> 
> > Index: dev/pci/if_igc.c
> > ===================================================================
> > RCS file: /data/mirror/openbsd/cvs/src/sys/dev/pci/if_igc.c,v
> > diff -u -p -r1.25 if_igc.c
> > --- dev/pci/if_igc.c	24 May 2024 06:02:53 -0000	1.25
> > +++ dev/pci/if_igc.c	6 Aug 2024 08:36:31 -0000
> > @@ -881,7 +881,7 @@ igc_init(void *arg)
> >  	}
> >  	igc_initialize_transmit_unit(sc);
> >  
> > -	sc->rx_mbuf_sz = MCLBYTES + ETHER_ALIGN;
> > +	sc->rx_mbuf_sz = MCLBYTES;
> 
> i think this is a mistake.
> 
> igc, like every other intel nic, only lets you specify the size of the
> rx buffers using a very limited set of values:
> 
> #define IGC_RCTL_SZ_2048        0x00000000 /* Rx buffer size 2048 */
> #define IGC_RCTL_SZ_1024        0x00010000 /* Rx buffer size 1024 */
> #define IGC_RCTL_SZ_512         0x00020000 /* Rx buffer size 512 */
> #define IGC_RCTL_SZ_256         0x00030000 /* Rx buffer size 256 */
> 
> this is configured at the chip level, which is why the rx ring
> descriptors dont have a length field. you just slap the address of
> memory in the descriptor and get on with it.
> 
> it is convenient that we can tell igc to use the same sized memory
> buffer that our mbuf clusters provide by default.
> 
> however, igc, also like every other intel nic, writes the ethernet
> packet into the rx buffer starting at the address provided. this
> means if we use the vanilla 2k mbuf clusters for the rx ring, then
> the IP header inside the received packets will not be aligned to a
> 4 byte boundary, which is a problem on strict alignment archs. igc
> is enabled on sparc64 and powerpc64, which will break or are now
> broken with this diff.
> 
> because of the limited set of rx buffer sizes on intel nics, and cos of
> the ip header alignment requirements, and to avoid wasting even more
> memory by using much larger clusters, we have the mcl2k2 clusters. i
> like to call it the intel mbuf cluster pool. the other intel drivers
> seem to be able to use it just fine so im surprised igc has unique
> problems with this.
> 
> >  	/* Prepare receive descriptors and buffers. */
> >  	if (igc_setup_receive_structures(sc)) {
> >  		printf("%s: Could not setup receive structures\n",
> > @@ -2159,7 +2159,7 @@ igc_allocate_receive_buffers(struct igc_
> >  	rxbuf = rxr->rx_buffers;
> >  	for (i = 0; i < sc->num_rx_desc; i++, rxbuf++) {
> >  		error = bus_dmamap_create(rxr->rxdma.dma_tag,
> > -		    MAX_JUMBO_FRAME_SIZE, 1, MAX_JUMBO_FRAME_SIZE, 0,
> > +		    MAX_JUMBO_FRAME_SIZE, IGC_MAX_SEGS, MCLBYTES, 0,
> 
> igc_get_buf only uses the first segment in the rx desc dmamap and only
> populates a single rx descriptor. allowing more segments here feels
> like a workaround for having changed the maximum segment size to
> MCLBYTES, but before lowering rx_mbuf_sz.
> 
> fwiw, em(4), ix(4), ixl(4), iavf(4) and even fxp(4) all have a
> bus_dmamap_create for rx descs that looks like this one before this
> diff.

Hi David,

thanks you for the explanation and details.  What should I do now?  Just
backout my last diff or apply the first diff I suggested here:
https://marc.info/?l=openbsd-tech&m=172231995913881&w=2

Without any diff, using Jumbo-Frames leads into a corrupt mbuf length
or modified mbufs in the freelist.

> i dont think i have an igc at the moment, otherwise id try and
> fiddle with this.

When you give me an sshkey, you can use our test machines:

We have an amd64 with igc(4): http://obsd-lab.genua.de/hw/ot42.html
As well as an sparc64 with igc(4): http://obsd-lab.genua.de/hw/ot25.html

Thanks,
Jan