Index | Thread | Search

From:
Claudio Jeker <cjeker@diehard.n-r-g.com>
Subject:
Re: revert m_defrag header align
To:
Alexander Bluhm <alexander.bluhm@gmx.net>
Cc:
tech@openbsd.org
Date:
Mon, 26 Feb 2024 17:02:18 +0100

Download raw body.

Thread
On Mon, Feb 26, 2024 at 04:17:10PM +0100, Alexander Bluhm wrote:
> On Mon, Feb 26, 2024 at 03:24:17PM +0100, Claudio Jeker wrote:
> > On Fri, Feb 23, 2024 at 04:54:59PM +0100, Alexander Bluhm wrote:
> > > On Fri, Feb 23, 2024 at 12:32:13PM +0100, Claudio Jeker wrote:
> > > > On Fri, Feb 23, 2024 at 11:55:43AM +0100, Alexander Bluhm wrote:
> > > > > Hi,
> > > > > 
> > > > > I think we should revert the diff that aligns the mbuf to IP header
> > > > > in m_defrag().  The cause of the change was that ether_extract_headers()
> > > > > crashed in em(4) when it was reading TCP header for offloading on
> > > > > sparc64.  This has been fixed by using memcpy().
> > > > > 
> > > > > Now viommu_dvmamap_load_raw() on sparc64 calls iommu_iomap_insert_page()
> > > > > twice as often with overlapping pages.  m_defrag() is intended as
> > > > > last resort to make DMA transfers to the hardware.  Therefore page
> > > > > alingment is more important than IP header alignment.
> > > > > 
> > > > > ok?
> > > > 
> > > > This makes little sense. We offset a 1500 byte packet by 2 bytes inside a
> > > > 8k page. This should not result in extra page use. Something else is off
> > > > here.
> > > 
> > > My printf debugging tells a different story.  Debugging diff
> > > attached, so you can see what I was looking at.
> > > 
> > > Note that I had to add a i > map->_dm_segcnt check to avoid that
> > > too many segments are loaded into DMA.  This prevents the "iomap
> > > insert error: 12 for pa 0xded4e6000" errors I was seeing before.
> > > 
> > > ixl supports only 8 segments.  The driver first tries to load the
> > > 1500 byte packets combined in 2k clusters and fails.  This mbuf
> > > chain in the TCP output buffer is a result of socket splicing small
> > > packets.  TSO creates a 31922 bytes packet consisting of about 22
> > > clusters.  The first atempt to load it into DMA fails.
> > > 
> > > _bus_dmamap_load_mbuf: j 9, i 9, _dm_segcnt 8, pktlen 31922
> > > 
> > > Then m_defrag creates one big cluster that contains the 31922 byte
> > > packet.  ixl supports up to 8 DMA segments each 16k, at least this
> > > is what the driver says.  The cluster is split into 3 segments.
> > > 
> > > _bus_dmamap_load_mbuf: ds_addr 0x004eba0002, ds_len 8192, j 1, i 3, _dm_segcnt 8, pktlen 31922
> > > _bus_dmamap_load_mbuf: ds_addr 0x004eba2002, ds_len 8192, j 1, i 3, _dm_segcnt 8, pktlen 31922
> > > _bus_dmamap_load_mbuf: ds_addr 0x004eba4002, ds_len 15538, j 1, i 3, _dm_segcnt 8, pktlen 31922
> > > 
> > > Note that the segments start 2 byte after the page due to header
> > > alignment of the mbuf.  That means that each segments covers two
> > > pages, the page for the remaining 2 byte is overlapping and loaded
> > > twice.
> > > 
> > > viommu_dvmamap_load_raw: a 0x004eba0000, aend 0x004eba4000
> > > viommu_dvmamap_load_raw: a 0x004eba2000, aend 0x004eba4000
> > > viommu_dvmamap_load_raw: a 0x004eba2000, aend 0x004eba6000
> > > viommu_dvmamap_load_raw: a 0x004eba4000, aend 0x004eba6000
> > > viommu_dvmamap_load_raw: a 0x004eba4000, aend 0x004eba8000
> > > viommu_dvmamap_load_raw: a 0x004eba6000, aend 0x004eba8000
> > > 
> > > With m_defrag header alignment reverted, we are back to 4 pages
> > > instead of 6.  As the restrictions of ixl DMA for 64k TSO are tight,
> > > we should not waste resources here.  Header alignment after m_defrag
> > > does not make sense for DMA.
> > > 
> > > Unfortunately after fixing the two bugs, large TSO on ixl still has
> > > problems.  Memory layout after socket splcing is the worst test
> > > case.  Jan has tested ixl on a different sparc64 that uses iommu
> > > instead of viommu.  There everything works.  em and ix TSO works
> > > also on viommu.  Your m_defrag alignment diff had no regressions,
> > > as it works for em and ix, and ixl on viommu was broken before.
> > > 
> > > The critical corner case is TSO + ixl + viommu sparc64. 
> > > 
> > > To make progress in debugging I have to
> > > 1. revert m_defrag header align
> > > 2. limit _bus_dmamap_load_mbuf to map->_dm_segcnt
> > > 3. find the remaining bugs
> > > 
> > 
> > The problem is _bus_dmamap_load_mbuf() that has a bug.
> > 
> >         while (m) {
> >                 vaddr_t vaddr = mtod(m, vaddr_t);
> >                 long buflen = (long)m->m_len;
> > 
> >                 len += buflen;
> >                 while (buflen > 0 && i < MAX_DMA_SEGS) {
> >                         paddr_t pa;
> >                         long incr;
> > 
> >                         incr = min(buflen, NBPG);
> > 
> > In our case this is a 64k mcluster with 31922. So buflen is bigger than
> > NBPG (8k) and so incr is set to 8192.
> 
> I did put i > map->_dm_segcnt in the condition after the loop to
> detect too many segment use.  But of course this is not the only
> bug.
> 
> > Later on the code does:
> > 
> >                         buflen -= incr;
> >                         vaddr += incr;
> > 
> > This does not work if vaddr is not at the start of a page.
> > Changing the min() to
> > 	incr = min(buflen, PAGE_SIZE - ((u_long)vaddr & PGOFSET));
> > may fix this.
> 
> I still think we should switch m_defrag() back to aligned mbuf
> creation.  Aligning to IP header does make sense when it comes to
> DMA transfers.  The problem arises when calling ether_extract_headers()
> after m_defrag().  We have a memcpy workaround for that.  Currently
> the m_defrag() align makes the situation worse, although
> _bus_dmamap_load_mbuf() does not work anyway.

I do not agree. m_defrag() should keep the alignment since it may also be
called in other parts of the tree. The DMA engine needs to be able to
start at any byte offset else a single mbuf (e.g. TCP ACK) could not be
sent out at all. m_defrag() currently breaks a contract we all depend
on -- in our network stack the outermost IP header is correctly alignend.
 
> It seems noone ever did large DMA transfers with this code.

At least not like this since these are super jumbo frames.

> > On the other hand most other archs implemented
> > _bus_dmamap_load_mbuf() as a wrapper around _dmamap_load_buffer()
> > so I wonder if sparc64 should do the same.
> > 
> > Also the uio function in sparc64 has the same logic bug and same issue.
> > 
> > I hate that we have so many versions of _bus_dmamap_load_mbuf() in our
> > tree since it makes it so much harder to adjust that code.
> > 
> > -- 
> > :wq Claudio
> 

Here the minimal diff. Can't test right now since my test system is down.

-- 
:wq Claudio

Index: sparc64/machdep.c
===================================================================
RCS file: /cvs/src/sys/arch/sparc64/sparc64/machdep.c,v
diff -u -p -r1.208 machdep.c
--- sparc64/machdep.c	10 Feb 2024 07:10:13 -0000	1.208
+++ sparc64/machdep.c	26 Feb 2024 16:01:12 -0000
@@ -1014,7 +1014,8 @@ _bus_dmamap_load_mbuf(bus_dma_tag_t t, b
 			paddr_t pa;
 			long incr;
 
-			incr = min(buflen, NBPG);
+			incr = min(buflen,
+			    PAGE_SIZE - ((u_long)vaddr & PGOFSET));
 
 			if (pmap_extract(pmap_kernel(), vaddr, &pa) == FALSE) {
 #ifdef DIAGNOSTIC
@@ -1100,7 +1101,9 @@ _bus_dmamap_load_uio(bus_dma_tag_t t, bu
 			paddr_t pa;
 			long incr;
 
-			incr = min(buflen, NBPG);
+			incr = min(buflen,
+			    PAGE_SIZE - ((u_long)vaddr & PGOFSET));
+
 			(void) pmap_extract(pmap_kernel(), vaddr, &pa);
 			buflen -= incr;
 			vaddr += incr;