Download raw body.
igc(4): fix recv. jumbo frames
On Thu, Aug 08, 2024 at 12:51:18PM +0200, Jan Klemkow wrote:
> On Wed, Aug 07, 2024 at 11:38:53PM +0200, Alexander Bluhm wrote:
> > On Thu, Aug 01, 2024 at 06:48:17PM +0200, Jan Klemkow wrote:
> > > On Wed, Jul 31, 2024 at 10:39:25AM -0600, Theo de Raadt wrote:
> > > > Alexander Bluhm <bluhm@openbsd.org> wrote:
> > > > > On Tue, Jul 30, 2024 at 08:12:08AM +0200, Jan Klemkow wrote:
> > > > > > The DMA mapping and allocation of mbufs for jumbo frames uses different
> > > > > > sizes. Thus, we are ending up with corrupt mbufs, which leads to panics
> > > > > > in later part of the TCP/IP stack:
> > > > > >
> > > > > > panic: kernel diagnostic assertion "M_DATABUF(m) + M_SIZE(m) >= (m->m_data + m->m_len)"
> > > > > > failed: file "/usr/src/sys/kern/uipc_mbuf.c", line 1364
> > > > > >
> > > > > > With the following diff, we use the same size for mapping and
> > > > > > allocation.
> > > > > >
> > > > > > ok?
> > > > >
> > > > > I have tested this diff successfully. Before jumbo frames with
> > > > > igc(4) paniced, now TCP is faster.
> > > > >
> > > > > But are we allocating too much in the non-jumbo case?
> > > > > sc->rx_mbuf_sz = MAX_JUMBO_FRAME_SIZE + ETHER_ALIGN;
> > > > > m = MCLGETL(NULL, M_DONTWAIT, sc->rx_mbuf_sz)
> > > > >
> > > > > This means for each receive packet we allocate a 16k mbuf cluster.
> > > > > Usually only 1500 byte are needed.
> > > > >
> > > > > ix(4) has this code and manages to chain the mbufs correctly.
> > > > > /* Use 2k clusters, even for jumbo frames */
> > > > > sc->rx_mbuf_sz = MCLBYTES + ETHER_ALIGN;
> > > >
> > > > This hardware chains fragments in the ring?
> > > >
> > > > It's a tradeoff. jumbos are rare, so why reserve so much memory to them
> > > > on hardware which can do better.
> > >
> > > I oriented my diff at the FreeBSD code. Also, the Linux code looks like
> > > as they don't use scatter/gather DMA.
> > >
> > > But, It works with the following diff below.
> > >
> > > The extra size of ETHER_ALIGN seems be wrong. It also leads to panics
> > > caused by corrupted mbufs. FreeBSD just use MCLBYTES for rx mbufs size.
> > > Also NetBSD removed ETHER_ALIGN from their copy of our driver [1].
> > >
> > > [1]: https://github.com/NetBSD/src/commit/fb38d839b48b9b6204dbbee1672454d6e719ba01
> > >
> > > Index: dev/pci/if_igc.c
> > > ===================================================================
> > > RCS file: /mount/openbsd/cvs/src/sys/dev/pci/if_igc.c,v
> > > diff -u -p -r1.25 if_igc.c
> > > --- dev/pci/if_igc.c 24 May 2024 06:02:53 -0000 1.25
> > > +++ dev/pci/if_igc.c 1 Aug 2024 15:45:58 -0000
> > > @@ -881,7 +881,7 @@ igc_init(void *arg)
> > > }
> > > igc_initialize_transmit_unit(sc);
> > >
> > > - sc->rx_mbuf_sz = MCLBYTES + ETHER_ALIGN;
> > > + sc->rx_mbuf_sz = MCLBYTES;
> > > /* Prepare receive descriptors and buffers. */
> > > if (igc_setup_receive_structures(sc)) {
> > > printf("%s: Could not setup receive structures\n",
> > > @@ -2159,7 +2159,7 @@ igc_allocate_receive_buffers(struct igc_
> > > rxbuf = rxr->rx_buffers;
> > > for (i = 0; i < sc->num_rx_desc; i++, rxbuf++) {
> > > error = bus_dmamap_create(rxr->rxdma.dma_tag,
> > > - MAX_JUMBO_FRAME_SIZE, 1, MAX_JUMBO_FRAME_SIZE, 0,
> > > + MAX_JUMBO_FRAME_SIZE, IGC_MAX_SCATTER, MCLBYTES, 0,
> > > BUS_DMA_NOWAIT, &rxbuf->map);
> > > if (error) {
> > > printf("%s: Unable to create RX DMA map\n",
> >
> > The diff above survives my tests. But the #define IGC_MAX_SCATTER
> > 40 seems like a magic value. It is nowhere mentioned in the spec.
> > With MAX_JUMBO_FRAME_SIZE == 9216 and MCLBYTES == 2048 there can
> > be at most 5 segements. The diff below also passes my tests.
> >
> > Is the more specific value IGC_MAX_GATHER == 5 better?
>
> I get your point to change the name from SCATTER to something else, but
> GATHER isn't used in any other driver. In some cases the value is just
> hardcoded all other cases use something like MAX_SEGMENTS or MAX_SEGS.
> This also better matches with the parameter nsegments of
> bus_dmamap_create().
>
> > Or could the hardware try to create larger frames?
>
> bnx(4) uses 8 instead of 5 for busdma mappings for instance. And many
> other driver just map and allocate jumbo size or hardmtu in one segment
> for their receive buffers.
>
> Thus I would suggest to increase it a bit. Better safe the sorry, till
> we know a accurate or official value for this limit.
i dont think you can compare bnx and igc here. igc (like other intel
nics) can chain multiple rx descriptors into a single packet, while
bnx and most other nics rx a packet into a single rx descriptor.
we (openbsd) also go to a lot of effort to make mbuf clusters
physically contiguous. you basically only ever need a single
descriptor for rx packets unless a chip has some stupid dma boundary
or segment size limit. if it's not a stupid chip like that, what a
chip can do and accomodate is often unecessary because of how our
kernel is set up.
you also have to be careful not to confuse the dmamaps created for rx
and tx descriptors. this is the actual bus_dmamap_create calls for
bnx rx descriptors:
/*
* Create DMA maps for the Rx buffer mbufs.
*/
for (i = 0; i < TOTAL_RX_BD; i++) {
if (bus_dmamap_create(sc->bnx_dmatag, BNX_MAX_JUMBO_MRU,
1, BNX_MAX_JUMBO_MRU, 0, BUS_DMA_NOWAIT,
&sc->rx_mbuf_map[i])) {
printf(": Could not create Rx mbuf %d DMA map!\n", i);
rc = ENOMEM;
goto bnx_dma_alloc_exit;
}
}
> ok?
>
> bye,
> Jan
>
> Index: dev/pci/if_igc.c
> ===================================================================
> RCS file: /data/mirror/openbsd/cvs/src/sys/dev/pci/if_igc.c,v
> diff -u -p -r1.25 if_igc.c
> --- dev/pci/if_igc.c 24 May 2024 06:02:53 -0000 1.25
> +++ dev/pci/if_igc.c 6 Aug 2024 08:36:31 -0000
> @@ -881,7 +881,7 @@ igc_init(void *arg)
> }
> igc_initialize_transmit_unit(sc);
>
> - sc->rx_mbuf_sz = MCLBYTES + ETHER_ALIGN;
> + sc->rx_mbuf_sz = MCLBYTES;
i think this is a mistake.
igc, like every other intel nic, only lets you specify the size of the
rx buffers using a very limited set of values:
#define IGC_RCTL_SZ_2048 0x00000000 /* Rx buffer size 2048 */
#define IGC_RCTL_SZ_1024 0x00010000 /* Rx buffer size 1024 */
#define IGC_RCTL_SZ_512 0x00020000 /* Rx buffer size 512 */
#define IGC_RCTL_SZ_256 0x00030000 /* Rx buffer size 256 */
this is configured at the chip level, which is why the rx ring
descriptors dont have a length field. you just slap the address of
memory in the descriptor and get on with it.
it is convenient that we can tell igc to use the same sized memory
buffer that our mbuf clusters provide by default.
however, igc, also like every other intel nic, writes the ethernet
packet into the rx buffer starting at the address provided. this
means if we use the vanilla 2k mbuf clusters for the rx ring, then
the IP header inside the received packets will not be aligned to a
4 byte boundary, which is a problem on strict alignment archs. igc
is enabled on sparc64 and powerpc64, which will break or are now
broken with this diff.
because of the limited set of rx buffer sizes on intel nics, and cos of
the ip header alignment requirements, and to avoid wasting even more
memory by using much larger clusters, we have the mcl2k2 clusters. i
like to call it the intel mbuf cluster pool. the other intel drivers
seem to be able to use it just fine so im surprised igc has unique
problems with this.
> /* Prepare receive descriptors and buffers. */
> if (igc_setup_receive_structures(sc)) {
> printf("%s: Could not setup receive structures\n",
> @@ -2159,7 +2159,7 @@ igc_allocate_receive_buffers(struct igc_
> rxbuf = rxr->rx_buffers;
> for (i = 0; i < sc->num_rx_desc; i++, rxbuf++) {
> error = bus_dmamap_create(rxr->rxdma.dma_tag,
> - MAX_JUMBO_FRAME_SIZE, 1, MAX_JUMBO_FRAME_SIZE, 0,
> + MAX_JUMBO_FRAME_SIZE, IGC_MAX_SEGS, MCLBYTES, 0,
igc_get_buf only uses the first segment in the rx desc dmamap and only
populates a single rx descriptor. allowing more segments here feels
like a workaround for having changed the maximum segment size to
MCLBYTES, but before lowering rx_mbuf_sz.
fwiw, em(4), ix(4), ixl(4), iavf(4) and even fxp(4) all have a
bus_dmamap_create for rx descs that looks like this one before this
diff.
i dont think i have an igc at the moment, otherwise id try and
fiddle with this.
> BUS_DMA_NOWAIT, &rxbuf->map);
> if (error) {
> printf("%s: Unable to create RX DMA map\n",
> Index: dev/pci/if_igc.h
> ===================================================================
> RCS file: /data/mirror/openbsd/cvs/src/sys/dev/pci/if_igc.h,v
> diff -u -p -r1.4 if_igc.h
> --- dev/pci/if_igc.h 21 May 2024 11:19:39 -0000 1.4
> +++ dev/pci/if_igc.h 6 Aug 2024 08:34:42 -0000
> @@ -191,6 +191,7 @@
> #define roundup2(size, unit) (((size) + (unit) - 1) & ~((unit) - 1))
> #define msec_delay(x) DELAY(1000 * (x))
>
> +#define IGC_MAX_SEGS 8
> #define IGC_MAX_SCATTER 40
> #define IGC_TSO_SIZE 65535
>
igc(4): fix recv. jumbo frames