From: Claudio Jeker Subject: Re: revert m_defrag header align To: Alexander Bluhm Cc: tech@openbsd.org Date: Mon, 26 Feb 2024 17:02:18 +0100 On Mon, Feb 26, 2024 at 04:17:10PM +0100, Alexander Bluhm wrote: > On Mon, Feb 26, 2024 at 03:24:17PM +0100, Claudio Jeker wrote: > > On Fri, Feb 23, 2024 at 04:54:59PM +0100, Alexander Bluhm wrote: > > > On Fri, Feb 23, 2024 at 12:32:13PM +0100, Claudio Jeker wrote: > > > > On Fri, Feb 23, 2024 at 11:55:43AM +0100, Alexander Bluhm wrote: > > > > > Hi, > > > > > > > > > > I think we should revert the diff that aligns the mbuf to IP header > > > > > in m_defrag(). The cause of the change was that ether_extract_headers() > > > > > crashed in em(4) when it was reading TCP header for offloading on > > > > > sparc64. This has been fixed by using memcpy(). > > > > > > > > > > Now viommu_dvmamap_load_raw() on sparc64 calls iommu_iomap_insert_page() > > > > > twice as often with overlapping pages. m_defrag() is intended as > > > > > last resort to make DMA transfers to the hardware. Therefore page > > > > > alingment is more important than IP header alignment. > > > > > > > > > > ok? > > > > > > > > This makes little sense. We offset a 1500 byte packet by 2 bytes inside a > > > > 8k page. This should not result in extra page use. Something else is off > > > > here. > > > > > > My printf debugging tells a different story. Debugging diff > > > attached, so you can see what I was looking at. > > > > > > Note that I had to add a i > map->_dm_segcnt check to avoid that > > > too many segments are loaded into DMA. This prevents the "iomap > > > insert error: 12 for pa 0xded4e6000" errors I was seeing before. > > > > > > ixl supports only 8 segments. The driver first tries to load the > > > 1500 byte packets combined in 2k clusters and fails. This mbuf > > > chain in the TCP output buffer is a result of socket splicing small > > > packets. TSO creates a 31922 bytes packet consisting of about 22 > > > clusters. The first atempt to load it into DMA fails. > > > > > > _bus_dmamap_load_mbuf: j 9, i 9, _dm_segcnt 8, pktlen 31922 > > > > > > Then m_defrag creates one big cluster that contains the 31922 byte > > > packet. ixl supports up to 8 DMA segments each 16k, at least this > > > is what the driver says. The cluster is split into 3 segments. > > > > > > _bus_dmamap_load_mbuf: ds_addr 0x004eba0002, ds_len 8192, j 1, i 3, _dm_segcnt 8, pktlen 31922 > > > _bus_dmamap_load_mbuf: ds_addr 0x004eba2002, ds_len 8192, j 1, i 3, _dm_segcnt 8, pktlen 31922 > > > _bus_dmamap_load_mbuf: ds_addr 0x004eba4002, ds_len 15538, j 1, i 3, _dm_segcnt 8, pktlen 31922 > > > > > > Note that the segments start 2 byte after the page due to header > > > alignment of the mbuf. That means that each segments covers two > > > pages, the page for the remaining 2 byte is overlapping and loaded > > > twice. > > > > > > viommu_dvmamap_load_raw: a 0x004eba0000, aend 0x004eba4000 > > > viommu_dvmamap_load_raw: a 0x004eba2000, aend 0x004eba4000 > > > viommu_dvmamap_load_raw: a 0x004eba2000, aend 0x004eba6000 > > > viommu_dvmamap_load_raw: a 0x004eba4000, aend 0x004eba6000 > > > viommu_dvmamap_load_raw: a 0x004eba4000, aend 0x004eba8000 > > > viommu_dvmamap_load_raw: a 0x004eba6000, aend 0x004eba8000 > > > > > > With m_defrag header alignment reverted, we are back to 4 pages > > > instead of 6. As the restrictions of ixl DMA for 64k TSO are tight, > > > we should not waste resources here. Header alignment after m_defrag > > > does not make sense for DMA. > > > > > > Unfortunately after fixing the two bugs, large TSO on ixl still has > > > problems. Memory layout after socket splcing is the worst test > > > case. Jan has tested ixl on a different sparc64 that uses iommu > > > instead of viommu. There everything works. em and ix TSO works > > > also on viommu. Your m_defrag alignment diff had no regressions, > > > as it works for em and ix, and ixl on viommu was broken before. > > > > > > The critical corner case is TSO + ixl + viommu sparc64. > > > > > > To make progress in debugging I have to > > > 1. revert m_defrag header align > > > 2. limit _bus_dmamap_load_mbuf to map->_dm_segcnt > > > 3. find the remaining bugs > > > > > > > The problem is _bus_dmamap_load_mbuf() that has a bug. > > > > while (m) { > > vaddr_t vaddr = mtod(m, vaddr_t); > > long buflen = (long)m->m_len; > > > > len += buflen; > > while (buflen > 0 && i < MAX_DMA_SEGS) { > > paddr_t pa; > > long incr; > > > > incr = min(buflen, NBPG); > > > > In our case this is a 64k mcluster with 31922. So buflen is bigger than > > NBPG (8k) and so incr is set to 8192. > > I did put i > map->_dm_segcnt in the condition after the loop to > detect too many segment use. But of course this is not the only > bug. > > > Later on the code does: > > > > buflen -= incr; > > vaddr += incr; > > > > This does not work if vaddr is not at the start of a page. > > Changing the min() to > > incr = min(buflen, PAGE_SIZE - ((u_long)vaddr & PGOFSET)); > > may fix this. > > I still think we should switch m_defrag() back to aligned mbuf > creation. Aligning to IP header does make sense when it comes to > DMA transfers. The problem arises when calling ether_extract_headers() > after m_defrag(). We have a memcpy workaround for that. Currently > the m_defrag() align makes the situation worse, although > _bus_dmamap_load_mbuf() does not work anyway. I do not agree. m_defrag() should keep the alignment since it may also be called in other parts of the tree. The DMA engine needs to be able to start at any byte offset else a single mbuf (e.g. TCP ACK) could not be sent out at all. m_defrag() currently breaks a contract we all depend on -- in our network stack the outermost IP header is correctly alignend. > It seems noone ever did large DMA transfers with this code. At least not like this since these are super jumbo frames. > > On the other hand most other archs implemented > > _bus_dmamap_load_mbuf() as a wrapper around _dmamap_load_buffer() > > so I wonder if sparc64 should do the same. > > > > Also the uio function in sparc64 has the same logic bug and same issue. > > > > I hate that we have so many versions of _bus_dmamap_load_mbuf() in our > > tree since it makes it so much harder to adjust that code. > > > > -- > > :wq Claudio > Here the minimal diff. Can't test right now since my test system is down. -- :wq Claudio Index: sparc64/machdep.c =================================================================== RCS file: /cvs/src/sys/arch/sparc64/sparc64/machdep.c,v diff -u -p -r1.208 machdep.c --- sparc64/machdep.c 10 Feb 2024 07:10:13 -0000 1.208 +++ sparc64/machdep.c 26 Feb 2024 16:01:12 -0000 @@ -1014,7 +1014,8 @@ _bus_dmamap_load_mbuf(bus_dma_tag_t t, b paddr_t pa; long incr; - incr = min(buflen, NBPG); + incr = min(buflen, + PAGE_SIZE - ((u_long)vaddr & PGOFSET)); if (pmap_extract(pmap_kernel(), vaddr, &pa) == FALSE) { #ifdef DIAGNOSTIC @@ -1100,7 +1101,9 @@ _bus_dmamap_load_uio(bus_dma_tag_t t, bu paddr_t pa; long incr; - incr = min(buflen, NBPG); + incr = min(buflen, + PAGE_SIZE - ((u_long)vaddr & PGOFSET)); + (void) pmap_extract(pmap_kernel(), vaddr, &pa); buflen -= incr; vaddr += incr;