From: Alexander Bluhm Subject: Re: revert m_defrag header align To: tech@openbsd.org Date: Fri, 23 Feb 2024 16:54:59 +0100 On Fri, Feb 23, 2024 at 12:32:13PM +0100, Claudio Jeker wrote: > On Fri, Feb 23, 2024 at 11:55:43AM +0100, Alexander Bluhm wrote: > > Hi, > > > > I think we should revert the diff that aligns the mbuf to IP header > > in m_defrag(). The cause of the change was that ether_extract_headers() > > crashed in em(4) when it was reading TCP header for offloading on > > sparc64. This has been fixed by using memcpy(). > > > > Now viommu_dvmamap_load_raw() on sparc64 calls iommu_iomap_insert_page() > > twice as often with overlapping pages. m_defrag() is intended as > > last resort to make DMA transfers to the hardware. Therefore page > > alingment is more important than IP header alignment. > > > > ok? > > This makes little sense. We offset a 1500 byte packet by 2 bytes inside a > 8k page. This should not result in extra page use. Something else is off > here. My printf debugging tells a different story. Debugging diff attached, so you can see what I was looking at. Note that I had to add a i > map->_dm_segcnt check to avoid that too many segments are loaded into DMA. This prevents the "iomap insert error: 12 for pa 0xded4e6000" errors I was seeing before. ixl supports only 8 segments. The driver first tries to load the 1500 byte packets combined in 2k clusters and fails. This mbuf chain in the TCP output buffer is a result of socket splicing small packets. TSO creates a 31922 bytes packet consisting of about 22 clusters. The first atempt to load it into DMA fails. _bus_dmamap_load_mbuf: j 9, i 9, _dm_segcnt 8, pktlen 31922 Then m_defrag creates one big cluster that contains the 31922 byte packet. ixl supports up to 8 DMA segments each 16k, at least this is what the driver says. The cluster is split into 3 segments. _bus_dmamap_load_mbuf: ds_addr 0x004eba0002, ds_len 8192, j 1, i 3, _dm_segcnt 8, pktlen 31922 _bus_dmamap_load_mbuf: ds_addr 0x004eba2002, ds_len 8192, j 1, i 3, _dm_segcnt 8, pktlen 31922 _bus_dmamap_load_mbuf: ds_addr 0x004eba4002, ds_len 15538, j 1, i 3, _dm_segcnt 8, pktlen 31922 Note that the segments start 2 byte after the page due to header alignment of the mbuf. That means that each segments covers two pages, the page for the remaining 2 byte is overlapping and loaded twice. viommu_dvmamap_load_raw: a 0x004eba0000, aend 0x004eba4000 viommu_dvmamap_load_raw: a 0x004eba2000, aend 0x004eba4000 viommu_dvmamap_load_raw: a 0x004eba2000, aend 0x004eba6000 viommu_dvmamap_load_raw: a 0x004eba4000, aend 0x004eba6000 viommu_dvmamap_load_raw: a 0x004eba4000, aend 0x004eba8000 viommu_dvmamap_load_raw: a 0x004eba6000, aend 0x004eba8000 With m_defrag header alignment reverted, we are back to 4 pages instead of 6. As the restrictions of ixl DMA for 64k TSO are tight, we should not waste resources here. Header alignment after m_defrag does not make sense for DMA. Unfortunately after fixing the two bugs, large TSO on ixl still has problems. Memory layout after socket splcing is the worst test case. Jan has tested ixl on a different sparc64 that uses iommu instead of viommu. There everything works. em and ix TSO works also on viommu. Your m_defrag alignment diff had no regressions, as it works for em and ix, and ixl on viommu was broken before. The critical corner case is TSO + ixl + viommu sparc64. To make progress in debugging I have to 1. revert m_defrag header align 2. limit _bus_dmamap_load_mbuf to map->_dm_segcnt 3. find the remaining bugs bluhm Index: arch/sparc64/dev/viommu.c =================================================================== RCS file: /mount/openbsd/cvs/src/sys/arch/sparc64/dev/viommu.c,v diff -u -p -r1.20 viommu.c --- arch/sparc64/dev/viommu.c 16 May 2021 15:10:19 -0000 1.20 +++ arch/sparc64/dev/viommu.c 22 Feb 2024 22:40:01 -0000 @@ -223,6 +223,7 @@ viommu_dvmamap_create(bus_dma_tag_t t, b int ret; bus_dmamap_t map; struct iommu_map_state *ims; + u_long pages; BUS_DMA_FIND_PARENT(t, _dmamap_create); ret = (*t->_dmamap_create)(t, t0, size, nsegments, maxsegsz, boundary, @@ -231,7 +232,8 @@ viommu_dvmamap_create(bus_dma_tag_t t, b if (ret) return (ret); - ims = viommu_iomap_create(atop(round_page(size))); + pages = atop(round_page(size)); + ims = viommu_iomap_create(4 * pages); if (ims == NULL) { bus_dmamap_destroy(t0, map); @@ -535,6 +537,10 @@ viommu_dvmamap_load_raw(bus_dma_tag_t t, "pa 0x%lx\n", err, a); iommu_iomap_clear_pages(ims); return (EFBIG); + } + if (size > 5000) { + printf("%s: a 0x%010lx, aend 0x%010lx\n", + __func__, a, aend); } } Index: arch/sparc64/sparc64/machdep.c =================================================================== RCS file: /mount/openbsd/cvs/src/sys/arch/sparc64/sparc64/machdep.c,v diff -u -p -r1.208 machdep.c --- arch/sparc64/sparc64/machdep.c 10 Feb 2024 07:10:13 -0000 1.208 +++ arch/sparc64/sparc64/machdep.c 23 Feb 2024 12:18:36 -0000 @@ -987,7 +987,7 @@ _bus_dmamap_load_mbuf(bus_dma_tag_t t, b struct mbuf *m, int flags) { bus_dma_segment_t segs[MAX_DMA_SEGS]; - int i; + int i, j, k, pktlen; size_t len; /* @@ -998,12 +998,13 @@ _bus_dmamap_load_mbuf(bus_dma_tag_t t, b if (m->m_pkthdr.len > map->_dm_size) return (EINVAL); + pktlen = m->m_pkthdr.len; /* Record mbuf for *_unload */ map->_dm_type = _DM_TYPE_MBUF; map->_dm_source = m; - i = 0; + j = i = 0; len = 0; while (m) { vaddr_t vaddr = mtod(m, vaddr_t); @@ -1044,11 +1045,21 @@ _bus_dmamap_load_mbuf(bus_dma_tag_t t, b i++; } m = m->m_next; - if (m && i >= MAX_DMA_SEGS) { + j++; + if ((m && i >= MAX_DMA_SEGS) || i > map->_dm_segcnt) { + printf("%s: j %d, i %d, _dm_segcnt %d, pktlen %d\n", + __func__, j, i, map->_dm_segcnt, pktlen); /* Exceeded the size of our dmamap */ map->_dm_type = 0; map->_dm_source = NULL; return (EFBIG); + } + } + + if (pktlen > 5000) { + for (k = 0; k < i; k++) { + printf("%s: ds_addr 0x%010lx, ds_len %lu, j %d, i %d, _dm_segcnt %d, pktlen %d\n", + __func__, segs[k].ds_addr, segs[k].ds_len, j, i, map->_dm_segcnt, pktlen); } } Index: dev/pci/if_ixl.c =================================================================== RCS file: /mount/openbsd/cvs/src/sys/dev/pci/if_ixl.c,v diff -u -p -r1.97 if_ixl.c --- dev/pci/if_ixl.c 14 Feb 2024 22:41:48 -0000 1.97 +++ dev/pci/if_ixl.c 22 Feb 2024 20:49:26 -0000 @@ -2897,6 +2897,7 @@ ixl_start(struct ifqueue *ifq) #if NBPFILTER > 0 caddr_t if_bpf; #endif + int error; if (!LINK_STATE_IS_UP(ifp->if_link_state)) { ifq_purge(ifq); @@ -2937,7 +2938,8 @@ ixl_start(struct ifqueue *ifq) free--; } - if (ixl_load_mbuf(sc->sc_dmat, map, m) != 0) { + if ((error = ixl_load_mbuf(sc->sc_dmat, map, m)) != 0) { + printf("%s: ixl_load_mbuf error %d\n", __func__, error); ifq->ifq_errors++; m_freem(m); continue;