From: Claudio Jeker Subject: Re: revert m_defrag header align To: Alexander Bluhm Cc: tech@openbsd.org Date: Fri, 23 Feb 2024 17:23:24 +0100 On Fri, Feb 23, 2024 at 04:54:59PM +0100, Alexander Bluhm wrote: > On Fri, Feb 23, 2024 at 12:32:13PM +0100, Claudio Jeker wrote: > > On Fri, Feb 23, 2024 at 11:55:43AM +0100, Alexander Bluhm wrote: > > > Hi, > > > > > > I think we should revert the diff that aligns the mbuf to IP header > > > in m_defrag(). The cause of the change was that ether_extract_headers() > > > crashed in em(4) when it was reading TCP header for offloading on > > > sparc64. This has been fixed by using memcpy(). > > > > > > Now viommu_dvmamap_load_raw() on sparc64 calls iommu_iomap_insert_page() > > > twice as often with overlapping pages. m_defrag() is intended as > > > last resort to make DMA transfers to the hardware. Therefore page > > > alingment is more important than IP header alignment. > > > > > > ok? > > > > This makes little sense. We offset a 1500 byte packet by 2 bytes inside a > > 8k page. This should not result in extra page use. Something else is off > > here. > > My printf debugging tells a different story. Debugging diff > attached, so you can see what I was looking at. Thanks for sharing the diff. > Note that I had to add a i > map->_dm_segcnt check to avoid that > too many segments are loaded into DMA. This prevents the "iomap > insert error: 12 for pa 0xded4e6000" errors I was seeing before. > > ixl supports only 8 segments. The driver first tries to load the > 1500 byte packets combined in 2k clusters and fails. This mbuf > chain in the TCP output buffer is a result of socket splicing small > packets. TSO creates a 31922 bytes packet consisting of about 22 > clusters. The first atempt to load it into DMA fails. > > _bus_dmamap_load_mbuf: j 9, i 9, _dm_segcnt 8, pktlen 31922 > > Then m_defrag creates one big cluster that contains the 31922 byte > packet. ixl supports up to 8 DMA segments each 16k, at least this > is what the driver says. The cluster is split into 3 segments. > > _bus_dmamap_load_mbuf: ds_addr 0x004eba0002, ds_len 8192, j 1, i 3, _dm_segcnt 8, pktlen 31922 This looks like a bug in the _bus_dmamap_load_mbuf implementation for sun4v. If ds_addr 0x004eba0002 then ds_len needs to be 8190 bytes. The dma descriptor should not cross page boundaries. I will dig into this. > _bus_dmamap_load_mbuf: ds_addr 0x004eba2002, ds_len 8192, j 1, i 3, _dm_segcnt 8, pktlen 31922 > _bus_dmamap_load_mbuf: ds_addr 0x004eba4002, ds_len 15538, j 1, i 3, _dm_segcnt 8, pktlen 31922 > > Note that the segments start 2 byte after the page due to header > alignment of the mbuf. That means that each segments covers two > pages, the page for the remaining 2 byte is overlapping and loaded > twice. > > viommu_dvmamap_load_raw: a 0x004eba0000, aend 0x004eba4000 > viommu_dvmamap_load_raw: a 0x004eba2000, aend 0x004eba4000 > viommu_dvmamap_load_raw: a 0x004eba2000, aend 0x004eba6000 > viommu_dvmamap_load_raw: a 0x004eba4000, aend 0x004eba6000 > viommu_dvmamap_load_raw: a 0x004eba4000, aend 0x004eba8000 > viommu_dvmamap_load_raw: a 0x004eba6000, aend 0x004eba8000 > > With m_defrag header alignment reverted, we are back to 4 pages > instead of 6. As the restrictions of ixl DMA for 64k TSO are tight, > we should not waste resources here. Header alignment after m_defrag > does not make sense for DMA. > > Unfortunately after fixing the two bugs, large TSO on ixl still has > problems. Memory layout after socket splcing is the worst test > case. Jan has tested ixl on a different sparc64 that uses iommu > instead of viommu. There everything works. em and ix TSO works > also on viommu. Your m_defrag alignment diff had no regressions, > as it works for em and ix, and ixl on viommu was broken before. > > The critical corner case is TSO + ixl + viommu sparc64. I don't think I have an ixl at hand but I will look into the _bus_dmamap_load_mbuf issue. We need to fix this proper. > To make progress in debugging I have to > 1. revert m_defrag header align > 2. limit _bus_dmamap_load_mbuf to map->_dm_segcnt > 3. find the remaining bugs > > bluhm > > Index: arch/sparc64/dev/viommu.c > =================================================================== > RCS file: /mount/openbsd/cvs/src/sys/arch/sparc64/dev/viommu.c,v > diff -u -p -r1.20 viommu.c > --- arch/sparc64/dev/viommu.c 16 May 2021 15:10:19 -0000 1.20 > +++ arch/sparc64/dev/viommu.c 22 Feb 2024 22:40:01 -0000 > @@ -223,6 +223,7 @@ viommu_dvmamap_create(bus_dma_tag_t t, b > int ret; > bus_dmamap_t map; > struct iommu_map_state *ims; > + u_long pages; > > BUS_DMA_FIND_PARENT(t, _dmamap_create); > ret = (*t->_dmamap_create)(t, t0, size, nsegments, maxsegsz, boundary, > @@ -231,7 +232,8 @@ viommu_dvmamap_create(bus_dma_tag_t t, b > if (ret) > return (ret); > > - ims = viommu_iomap_create(atop(round_page(size))); > + pages = atop(round_page(size)); > + ims = viommu_iomap_create(4 * pages); > > if (ims == NULL) { > bus_dmamap_destroy(t0, map); > @@ -535,6 +537,10 @@ viommu_dvmamap_load_raw(bus_dma_tag_t t, > "pa 0x%lx\n", err, a); > iommu_iomap_clear_pages(ims); > return (EFBIG); > + } > + if (size > 5000) { > + printf("%s: a 0x%010lx, aend 0x%010lx\n", > + __func__, a, aend); > } > } > > Index: arch/sparc64/sparc64/machdep.c > =================================================================== > RCS file: /mount/openbsd/cvs/src/sys/arch/sparc64/sparc64/machdep.c,v > diff -u -p -r1.208 machdep.c > --- arch/sparc64/sparc64/machdep.c 10 Feb 2024 07:10:13 -0000 1.208 > +++ arch/sparc64/sparc64/machdep.c 23 Feb 2024 12:18:36 -0000 > @@ -987,7 +987,7 @@ _bus_dmamap_load_mbuf(bus_dma_tag_t t, b > struct mbuf *m, int flags) > { > bus_dma_segment_t segs[MAX_DMA_SEGS]; > - int i; > + int i, j, k, pktlen; > size_t len; > > /* > @@ -998,12 +998,13 @@ _bus_dmamap_load_mbuf(bus_dma_tag_t t, b > > if (m->m_pkthdr.len > map->_dm_size) > return (EINVAL); > + pktlen = m->m_pkthdr.len; > > /* Record mbuf for *_unload */ > map->_dm_type = _DM_TYPE_MBUF; > map->_dm_source = m; > > - i = 0; > + j = i = 0; > len = 0; > while (m) { > vaddr_t vaddr = mtod(m, vaddr_t); > @@ -1044,11 +1045,21 @@ _bus_dmamap_load_mbuf(bus_dma_tag_t t, b > i++; > } > m = m->m_next; > - if (m && i >= MAX_DMA_SEGS) { > + j++; > + if ((m && i >= MAX_DMA_SEGS) || i > map->_dm_segcnt) { > + printf("%s: j %d, i %d, _dm_segcnt %d, pktlen %d\n", > + __func__, j, i, map->_dm_segcnt, pktlen); > /* Exceeded the size of our dmamap */ > map->_dm_type = 0; > map->_dm_source = NULL; > return (EFBIG); > + } > + } > + > + if (pktlen > 5000) { > + for (k = 0; k < i; k++) { > + printf("%s: ds_addr 0x%010lx, ds_len %lu, j %d, i %d, _dm_segcnt %d, pktlen %d\n", > + __func__, segs[k].ds_addr, segs[k].ds_len, j, i, map->_dm_segcnt, pktlen); > } > } > > Index: dev/pci/if_ixl.c > =================================================================== > RCS file: /mount/openbsd/cvs/src/sys/dev/pci/if_ixl.c,v > diff -u -p -r1.97 if_ixl.c > --- dev/pci/if_ixl.c 14 Feb 2024 22:41:48 -0000 1.97 > +++ dev/pci/if_ixl.c 22 Feb 2024 20:49:26 -0000 > @@ -2897,6 +2897,7 @@ ixl_start(struct ifqueue *ifq) > #if NBPFILTER > 0 > caddr_t if_bpf; > #endif > + int error; > > if (!LINK_STATE_IS_UP(ifp->if_link_state)) { > ifq_purge(ifq); > @@ -2937,7 +2938,8 @@ ixl_start(struct ifqueue *ifq) > free--; > } > > - if (ixl_load_mbuf(sc->sc_dmat, map, m) != 0) { > + if ((error = ixl_load_mbuf(sc->sc_dmat, map, m)) != 0) { > + printf("%s: ixl_load_mbuf error %d\n", __func__, error); > ifq->ifq_errors++; > m_freem(m); > continue; > -- :wq Claudio