Download raw body.
revert m_defrag header align
> Date: Mon, 26 Feb 2024 15:24:17 +0100
> From: Claudio Jeker <cjeker@diehard.n-r-g.com>
>
> On Fri, Feb 23, 2024 at 04:54:59PM +0100, Alexander Bluhm wrote:
> > On Fri, Feb 23, 2024 at 12:32:13PM +0100, Claudio Jeker wrote:
> > > On Fri, Feb 23, 2024 at 11:55:43AM +0100, Alexander Bluhm wrote:
> > > > Hi,
> > > >
> > > > I think we should revert the diff that aligns the mbuf to IP header
> > > > in m_defrag(). The cause of the change was that ether_extract_headers()
> > > > crashed in em(4) when it was reading TCP header for offloading on
> > > > sparc64. This has been fixed by using memcpy().
> > > >
> > > > Now viommu_dvmamap_load_raw() on sparc64 calls iommu_iomap_insert_page()
> > > > twice as often with overlapping pages. m_defrag() is intended as
> > > > last resort to make DMA transfers to the hardware. Therefore page
> > > > alingment is more important than IP header alignment.
> > > >
> > > > ok?
> > >
> > > This makes little sense. We offset a 1500 byte packet by 2 bytes inside a
> > > 8k page. This should not result in extra page use. Something else is off
> > > here.
> >
> > My printf debugging tells a different story. Debugging diff
> > attached, so you can see what I was looking at.
> >
> > Note that I had to add a i > map->_dm_segcnt check to avoid that
> > too many segments are loaded into DMA. This prevents the "iomap
> > insert error: 12 for pa 0xded4e6000" errors I was seeing before.
> >
> > ixl supports only 8 segments. The driver first tries to load the
> > 1500 byte packets combined in 2k clusters and fails. This mbuf
> > chain in the TCP output buffer is a result of socket splicing small
> > packets. TSO creates a 31922 bytes packet consisting of about 22
> > clusters. The first atempt to load it into DMA fails.
> >
> > _bus_dmamap_load_mbuf: j 9, i 9, _dm_segcnt 8, pktlen 31922
> >
> > Then m_defrag creates one big cluster that contains the 31922 byte
> > packet. ixl supports up to 8 DMA segments each 16k, at least this
> > is what the driver says. The cluster is split into 3 segments.
> >
> > _bus_dmamap_load_mbuf: ds_addr 0x004eba0002, ds_len 8192, j 1, i 3, _dm_segcnt 8, pktlen 31922
> > _bus_dmamap_load_mbuf: ds_addr 0x004eba2002, ds_len 8192, j 1, i 3, _dm_segcnt 8, pktlen 31922
> > _bus_dmamap_load_mbuf: ds_addr 0x004eba4002, ds_len 15538, j 1, i 3, _dm_segcnt 8, pktlen 31922
> >
> > Note that the segments start 2 byte after the page due to header
> > alignment of the mbuf. That means that each segments covers two
> > pages, the page for the remaining 2 byte is overlapping and loaded
> > twice.
> >
> > viommu_dvmamap_load_raw: a 0x004eba0000, aend 0x004eba4000
> > viommu_dvmamap_load_raw: a 0x004eba2000, aend 0x004eba4000
> > viommu_dvmamap_load_raw: a 0x004eba2000, aend 0x004eba6000
> > viommu_dvmamap_load_raw: a 0x004eba4000, aend 0x004eba6000
> > viommu_dvmamap_load_raw: a 0x004eba4000, aend 0x004eba8000
> > viommu_dvmamap_load_raw: a 0x004eba6000, aend 0x004eba8000
> >
> > With m_defrag header alignment reverted, we are back to 4 pages
> > instead of 6. As the restrictions of ixl DMA for 64k TSO are tight,
> > we should not waste resources here. Header alignment after m_defrag
> > does not make sense for DMA.
> >
> > Unfortunately after fixing the two bugs, large TSO on ixl still has
> > problems. Memory layout after socket splcing is the worst test
> > case. Jan has tested ixl on a different sparc64 that uses iommu
> > instead of viommu. There everything works. em and ix TSO works
> > also on viommu. Your m_defrag alignment diff had no regressions,
> > as it works for em and ix, and ixl on viommu was broken before.
> >
> > The critical corner case is TSO + ixl + viommu sparc64.
> >
> > To make progress in debugging I have to
> > 1. revert m_defrag header align
> > 2. limit _bus_dmamap_load_mbuf to map->_dm_segcnt
> > 3. find the remaining bugs
> >
>
> The problem is _bus_dmamap_load_mbuf() that has a bug.
>
> while (m) {
> vaddr_t vaddr = mtod(m, vaddr_t);
> long buflen = (long)m->m_len;
>
> len += buflen;
> while (buflen > 0 && i < MAX_DMA_SEGS) {
> paddr_t pa;
> long incr;
>
> incr = min(buflen, NBPG);
>
> In our case this is a 64k mcluster with 31922. So buflen is bigger than
> NBPG (8k) and so incr is set to 8192.
>
> Later on the code does:
>
> buflen -= incr;
> vaddr += incr;
>
> This does not work if vaddr is not at the start of a page.
> Changing the min() to
> incr = min(buflen, PAGE_SIZE - ((u_long)vaddr & PGOFSET));
> may fix this.
>
> On the other hand most other archs implemented
> _bus_dmamap_load_mbuf() as a wrapper around _dmamap_load_buffer()
> so I wonder if sparc64 should do the same.
That would make sense. But the layering of the iommu/viommu code
makes this somewhat tricky. The iommu allows us to map mbufs that are
virt contig but not phys contig as dva contig and the code takes
advantage of this. I suppose that is somewhat beneficial for
networking cards that don't support using multiple descriptors for a
single packet. Not sure if that really makes a difference, but
untangling that may be tricky.
So for now, I'd suggest to go for the minimal fix you're suggesting
above.
> Also the uio function in sparc64 has the same logic bug and same issue.
>
> I hate that we have so many versions of _bus_dmamap_load_mbuf() in our
> tree since it makes it so much harder to adjust that code.
>
> --
> :wq Claudio
>
revert m_defrag header align