Index | Thread | Search

From:
Claudio Jeker <cjeker@diehard.n-r-g.com>
Subject:
Re: revert m_defrag header align
To:
Alexander Bluhm <alexander.bluhm@gmx.net>
Cc:
tech@openbsd.org
Date:
Fri, 23 Feb 2024 17:23:24 +0100

Download raw body.

Thread
On Fri, Feb 23, 2024 at 04:54:59PM +0100, Alexander Bluhm wrote:
> On Fri, Feb 23, 2024 at 12:32:13PM +0100, Claudio Jeker wrote:
> > On Fri, Feb 23, 2024 at 11:55:43AM +0100, Alexander Bluhm wrote:
> > > Hi,
> > > 
> > > I think we should revert the diff that aligns the mbuf to IP header
> > > in m_defrag().  The cause of the change was that ether_extract_headers()
> > > crashed in em(4) when it was reading TCP header for offloading on
> > > sparc64.  This has been fixed by using memcpy().
> > > 
> > > Now viommu_dvmamap_load_raw() on sparc64 calls iommu_iomap_insert_page()
> > > twice as often with overlapping pages.  m_defrag() is intended as
> > > last resort to make DMA transfers to the hardware.  Therefore page
> > > alingment is more important than IP header alignment.
> > > 
> > > ok?
> > 
> > This makes little sense. We offset a 1500 byte packet by 2 bytes inside a
> > 8k page. This should not result in extra page use. Something else is off
> > here.
> 
> My printf debugging tells a different story.  Debugging diff
> attached, so you can see what I was looking at.

Thanks for sharing the diff.
 
> Note that I had to add a i > map->_dm_segcnt check to avoid that
> too many segments are loaded into DMA.  This prevents the "iomap
> insert error: 12 for pa 0xded4e6000" errors I was seeing before.
> 
> ixl supports only 8 segments.  The driver first tries to load the
> 1500 byte packets combined in 2k clusters and fails.  This mbuf
> chain in the TCP output buffer is a result of socket splicing small
> packets.  TSO creates a 31922 bytes packet consisting of about 22
> clusters.  The first atempt to load it into DMA fails.
> 
> _bus_dmamap_load_mbuf: j 9, i 9, _dm_segcnt 8, pktlen 31922
> 
> Then m_defrag creates one big cluster that contains the 31922 byte
> packet.  ixl supports up to 8 DMA segments each 16k, at least this
> is what the driver says.  The cluster is split into 3 segments.
> 
> _bus_dmamap_load_mbuf: ds_addr 0x004eba0002, ds_len 8192, j 1, i 3, _dm_segcnt 8, pktlen 31922

This looks like a bug in the _bus_dmamap_load_mbuf implementation for
sun4v. If ds_addr 0x004eba0002 then ds_len needs to be 8190 bytes.
The dma descriptor should not cross page boundaries.

I will dig into this.

> _bus_dmamap_load_mbuf: ds_addr 0x004eba2002, ds_len 8192, j 1, i 3, _dm_segcnt 8, pktlen 31922
> _bus_dmamap_load_mbuf: ds_addr 0x004eba4002, ds_len 15538, j 1, i 3, _dm_segcnt 8, pktlen 31922
> 
> Note that the segments start 2 byte after the page due to header
> alignment of the mbuf.  That means that each segments covers two
> pages, the page for the remaining 2 byte is overlapping and loaded
> twice.
> 
> viommu_dvmamap_load_raw: a 0x004eba0000, aend 0x004eba4000
> viommu_dvmamap_load_raw: a 0x004eba2000, aend 0x004eba4000
> viommu_dvmamap_load_raw: a 0x004eba2000, aend 0x004eba6000
> viommu_dvmamap_load_raw: a 0x004eba4000, aend 0x004eba6000
> viommu_dvmamap_load_raw: a 0x004eba4000, aend 0x004eba8000
> viommu_dvmamap_load_raw: a 0x004eba6000, aend 0x004eba8000
> 
> With m_defrag header alignment reverted, we are back to 4 pages
> instead of 6.  As the restrictions of ixl DMA for 64k TSO are tight,
> we should not waste resources here.  Header alignment after m_defrag
> does not make sense for DMA.
> 
> Unfortunately after fixing the two bugs, large TSO on ixl still has
> problems.  Memory layout after socket splcing is the worst test
> case.  Jan has tested ixl on a different sparc64 that uses iommu
> instead of viommu.  There everything works.  em and ix TSO works
> also on viommu.  Your m_defrag alignment diff had no regressions,
> as it works for em and ix, and ixl on viommu was broken before.
> 
> The critical corner case is TSO + ixl + viommu sparc64. 

I don't think I have an ixl at hand but I will look into the
_bus_dmamap_load_mbuf issue. We need to fix this proper.
 
> To make progress in debugging I have to
> 1. revert m_defrag header align
> 2. limit _bus_dmamap_load_mbuf to map->_dm_segcnt
> 3. find the remaining bugs
> 
> bluhm
> 
> Index: arch/sparc64/dev/viommu.c
> ===================================================================
> RCS file: /mount/openbsd/cvs/src/sys/arch/sparc64/dev/viommu.c,v
> diff -u -p -r1.20 viommu.c
> --- arch/sparc64/dev/viommu.c	16 May 2021 15:10:19 -0000	1.20
> +++ arch/sparc64/dev/viommu.c	22 Feb 2024 22:40:01 -0000
> @@ -223,6 +223,7 @@ viommu_dvmamap_create(bus_dma_tag_t t, b
>  	int ret;
>  	bus_dmamap_t map;
>  	struct iommu_map_state *ims;
> +	u_long pages;
>  
>  	BUS_DMA_FIND_PARENT(t, _dmamap_create);
>  	ret = (*t->_dmamap_create)(t, t0, size, nsegments, maxsegsz, boundary,
> @@ -231,7 +232,8 @@ viommu_dvmamap_create(bus_dma_tag_t t, b
>  	if (ret)
>  		return (ret);
>  
> -	ims = viommu_iomap_create(atop(round_page(size)));
> +	pages = atop(round_page(size));
> +	ims = viommu_iomap_create(4 * pages);
>  
>  	if (ims == NULL) {
>  		bus_dmamap_destroy(t0, map);
> @@ -535,6 +537,10 @@ viommu_dvmamap_load_raw(bus_dma_tag_t t,
>  					    "pa 0x%lx\n", err, a);
>  					iommu_iomap_clear_pages(ims);
>  					return (EFBIG);
> +				}
> +				if (size > 5000) {
> +					printf("%s: a 0x%010lx, aend 0x%010lx\n",
> +					    __func__, a, aend);
>  				}
>  			}
>  
> Index: arch/sparc64/sparc64/machdep.c
> ===================================================================
> RCS file: /mount/openbsd/cvs/src/sys/arch/sparc64/sparc64/machdep.c,v
> diff -u -p -r1.208 machdep.c
> --- arch/sparc64/sparc64/machdep.c	10 Feb 2024 07:10:13 -0000	1.208
> +++ arch/sparc64/sparc64/machdep.c	23 Feb 2024 12:18:36 -0000
> @@ -987,7 +987,7 @@ _bus_dmamap_load_mbuf(bus_dma_tag_t t, b
>      struct mbuf *m, int flags)
>  {
>  	bus_dma_segment_t segs[MAX_DMA_SEGS];
> -	int i;
> +	int i, j, k, pktlen;
>  	size_t len;
>  
>  	/*
> @@ -998,12 +998,13 @@ _bus_dmamap_load_mbuf(bus_dma_tag_t t, b
>  
>  	if (m->m_pkthdr.len > map->_dm_size)
>  		return (EINVAL);
> +	pktlen = m->m_pkthdr.len;
>  
>  	/* Record mbuf for *_unload */
>  	map->_dm_type = _DM_TYPE_MBUF;
>  	map->_dm_source = m;
>  
> -	i = 0;
> +	j = i = 0;
>  	len = 0;
>  	while (m) {
>  		vaddr_t vaddr = mtod(m, vaddr_t);
> @@ -1044,11 +1045,21 @@ _bus_dmamap_load_mbuf(bus_dma_tag_t t, b
>  			i++;
>  		}
>  		m = m->m_next;
> -		if (m && i >= MAX_DMA_SEGS) {
> +		j++;
> +		if ((m && i >= MAX_DMA_SEGS) || i > map->_dm_segcnt) {
> +			printf("%s: j %d, i %d, _dm_segcnt %d, pktlen %d\n",
> +			    __func__, j, i, map->_dm_segcnt, pktlen);
>  			/* Exceeded the size of our dmamap */
>  			map->_dm_type = 0;
>  			map->_dm_source = NULL;
>  			return (EFBIG);
> +		}
> +	}
> +
> +	if (pktlen > 5000) {
> +		for (k = 0; k < i; k++) {
> +			printf("%s: ds_addr 0x%010lx, ds_len %lu, j %d, i %d, _dm_segcnt %d, pktlen %d\n",
> +			    __func__, segs[k].ds_addr, segs[k].ds_len, j, i, map->_dm_segcnt, pktlen);
>  		}
>  	}
>  
> Index: dev/pci/if_ixl.c
> ===================================================================
> RCS file: /mount/openbsd/cvs/src/sys/dev/pci/if_ixl.c,v
> diff -u -p -r1.97 if_ixl.c
> --- dev/pci/if_ixl.c	14 Feb 2024 22:41:48 -0000	1.97
> +++ dev/pci/if_ixl.c	22 Feb 2024 20:49:26 -0000
> @@ -2897,6 +2897,7 @@ ixl_start(struct ifqueue *ifq)
>  #if NBPFILTER > 0
>  	caddr_t if_bpf;
>  #endif
> +	int error;
>  
>  	if (!LINK_STATE_IS_UP(ifp->if_link_state)) {
>  		ifq_purge(ifq);
> @@ -2937,7 +2938,8 @@ ixl_start(struct ifqueue *ifq)
>  			free--;
>  		}
>  
> -		if (ixl_load_mbuf(sc->sc_dmat, map, m) != 0) {
> +		if ((error = ixl_load_mbuf(sc->sc_dmat, map, m)) != 0) {
> +			printf("%s: ixl_load_mbuf error %d\n", __func__, error);
>  			ifq->ifq_errors++;
>  			m_freem(m);
>  			continue;
> 

-- 
:wq Claudio