Index | Thread | Search

From:
Patrick Wildt <patrick@blueri.se>
Subject:
Re: smmu(4) on QC Laptops
To:
Mark Kettenis <mark.kettenis@xs4all.nl>
Cc:
tech@openbsd.org, tobhe@openbsd.org, mlarkin@openbsd.org
Date:
Fri, 1 Aug 2025 21:21:37 +0200

Download raw body.

Thread
On Sat, May 24, 2025 at 07:21:59AM +0200, Patrick Wildt wrote:
> On Fri, May 23, 2025 at 07:58:29PM +0200, Mark Kettenis wrote:
> > > Date: Fri, 23 May 2025 08:46:16 +0200
> > > From: Patrick Wildt <patrick@blueri.se>
> > > Cc: kettenis@openbsd.org, tobhe@openbsd.org, mlarkin@openbsd.org
> > > Content-Type: text/plain; charset=us-ascii
> > > Content-Disposition: inline
> > > 
> > > Hi,
> > > 
> > > Nearly three years ago we disabled smmu(4) for QC laptops because they
> > > just rebooted instantly.  I've finally gone to the bottom of why this
> > > happens.
> > > 
> > > On servers we usually boot up with most peripherals not doing any kind
> > > of DMA.  A serial console is kind of the only thing we depend on, which
> > > usually doesn't use DMA for simplicity.  Everything else, like NVMe or
> > > USB will get turned on later.  For those use cases it's been perfectly
> > > fine to just turn the SMMU on with strict filtering, because only once
> > > the driver configures puts in a job DMA transactions will occur.
> > > 
> > > On the QC laptops the framebuffer goes through an SMMU.  That's one of
> > > the reasons we need to keep the streams alive and actively put them into
> > > a bypass context, which we do!  As soon as a driver attaches, we move
> > > the stream over into a strict translation context.
> > > 
> > > Unfortunately the changes I did were not completely sufficient.
> > > 
> > > (1) The IOMMU enforcement happens before the driver gets a chance to
> > >     stop current work.
> > > 
> > > simplebus(4) creates an IOMMU-bound DMA tag before it attempts to attach
> > > the children.  This leads do two interesting things:
> > > 
> > > (a) If a simplebus(4) has an IOMMU assigned, it is passed an IOMMU-bound
> > >     DMA tag.  All children will get subjected to it.
> > > 
> > > smmu1: establishing sid 0x423
> > > simplebus1 at simplebus0: "geniqup"
> > > "serial" at simplebus1 not configured
> > > smmu1: establishing sid 0x123
> > > simplebus2 at simplebus0: "geniqup"
> > > 
> > > (b) If there's no driver to be found, the IOMMU-bound DMA tag is created
> > >     anyway.
> > > 
> > > smmu1: establishing sid 0x1c00
> > > smmu1: took over 8/1c00/2 for sid 1c00
> > > "display-subsystem" at simplebus0 not configured
> > > 
> > > The good thing is that (a) is not a problem in this case as we do not
> > > have any active transactions such mappings, as far as I can see on the
> > > X1E.  But the framebuffer is behind the display-subsystem which is now
> > > subjected to strict IOMMU configuration, making us need to fix (b).
> > > 
> > > One idea I have is that we keep all the setup code where it is, at
> > > IOMMU-bound DMA tag creation time, but only move the stream to the
> > > strict context block once someone(TM) creates a DMA map using this
> > > tag.
> > > 
> > > Is that a reasonable thing to do?  I don't think anyone would be doing
> > > DMA without creating a DMA map first, right?
> > 
> > Correct.
> > 
> > I'm not sure how this is going to work if we want to write a driver
> > for the display controller.  Presumably we'd create IOMMU mappings for
> > the initial framebuffer.  But to do so, we'd need to create a DMA map
> > first, and when we do that...  game over?
> > 
> > So maybe we need to do the takeover when we do the first DMA map load,
> > after we've entered the mappings into the page table?  But that only
> > works if we have a single mapping for the framebuffer.
> 
> I was hoping that we'd have code that first turns off the transactions,
> making us "go blind" until we re-enable it with the new address.  But
> yes, that would require the disable sequence to happen before the map
> is created.  We could do it on load, which is kind of a hot-path.  I
> was hoping for now we put it into map create and tackle this once or
> if we get around to writing this driver.
> 
> > On the Apple systems, the framebuffer is also behind an IOMMU.  There
> > we have some code that creates some initial mappings based on
> > information from the device tree.
> > 
> > > (2) Re-use of SMRs doesn't take mask in account.
> > > 
> > > In (1b) you can see that the display subsystem is using sid 0x1c00 with
> > > mask 0x2, matching to the following SMR:
> > > 
> > > smmu1: SMR[8] = 0x1c00/0x2
> > > 
> > > We'd like to re-use that if possible, but because of the strict mask
> > > checking that wasn't possible.  So I'd like to take the mask into
> > > consideration when "taking over" streams.  This is only done on QC!
> > 
> > Sorry, you'll have to explain to me what what the SMRs do...
> 
> Think work slightly different depending on SMMU version/feature flags,
> but for this machine it works an IOMMU setup like this:
> 
> You configure a context block.  The context block is basically the IOMMU
> group that owns the pagetable.  To assign devices to a context block,
> you setup a "Stream Mapping Register", e.g.:
> 
>  * Allocate + setup context block 3 (e.g. IOMMU group 3)
>    * This is the actual pagetable and translation setup
>  * Allocate a stream map entry (can be any, e.g. 5)
>    * The array has a fixed length, we need to use one slot to
>      point device to context block.
>  * Write(SMR(5), DeviceId + Mask + Valid)
>    * For each DMA transaction, the SMMUs stream mapping function will
>      iterate the stream map registers to find an enty where:
>      StreamId & Mask == DeviceId & Mask
>    * In this example, after writing SMR(5), slot 5 should match
>  * Write(S2CR(5) = 3)
>    * This points stream map 5 to context block 3
>    * In this example, after the HW's lookup function returns 5, it will
>      look into S2CR(5) to find which context block it belongs to
> 
> On this laptop, the SMRs basically tell you what devices are currently
> enabled to pass through the IOMMU.  We update the SMRs to point all
> currenetly allowed devices to a "bypass domain", so that existing
> traffic goes into a 1:1 translation.
> 
> One we want attach a driver and want to make a device use stricter IOMMU
> mappings, we setup a context block and then switch the device from the
> bypass IOMMU group/context block to the strict IOMMU group/context
> block.
> 
> In reality we're not actually changing the SMR (that's just the filter
> rule), we're updating the S2CR to point the stream to a different group.
> 
> But if we don't "take over" the existing SMR mapping, the code will
> allocate a new SMR register, and then the IOMMU will complain about a
> match conflict because an incoming transaction matches two SMR entries.
> 
> Let me know if I worded this too confusingly!
> 
> > > Please give this a go on machines where `dmesg | grep ^smmu' gives
> > > output, and especially on QC machines with X1E, or SC8280XP/ThinkPad
> > > x13s.  Please send me some dmesgs, as I have added some debug code
> > > that I'd like to see run on other machines. :)
> > 
> > Here's the Ampere eMAG machine.  ACPI so some of your changes don't
> > matter, but it still works with your diff.
> > 
> 
> Thanks! :)
> 
> Cheers,
> Patrick

Updated diff that sets the SMR register not just to the stream id but
also the mask we kept saved.

diff --git a/sys/arch/arm64/conf/GENERIC b/sys/arch/arm64/conf/GENERIC
index adee39bb278..ac593d042ab 100644
--- a/sys/arch/arm64/conf/GENERIC
+++ b/sys/arch/arm64/conf/GENERIC
@@ -93,7 +93,7 @@ pci*		at pciecam?
 sdhc*		at fdt?
 sdmmc*		at sdhc?
 bwfm*		at sdmmc?	# Broadcom FullMAC
-smmu*		at fdt?
+smmu*		at fdt? early 1
 xhci*		at fdt?
 ccp*		at fdt?		# AMD Cryptographic Co-processor
 ipmi*		at fdt?
diff --git a/sys/arch/arm64/dev/smmu.c b/sys/arch/arm64/dev/smmu.c
index edf5dd26de8..76ef21ae365 100644
--- a/sys/arch/arm64/dev/smmu.c
+++ b/sys/arch/arm64/dev/smmu.c
@@ -614,11 +614,11 @@ smmu_domain_create(struct smmu_softc *sc, uint32_t sid)
 			/* Take over QCOM SMRs */
 			if (sc->sc_is_qcom && sc->sc_smr[i] != NULL &&
 			    sc->sc_smr[i]->ss_dom == NULL &&
-			    sc->sc_smr[i]->ss_id == sid &&
-			    sc->sc_smr[i]->ss_mask == 0) {
-				free(sc->sc_smr[i], M_DEVBUF,
-				    sizeof(struct smmu_smr));
-				sc->sc_smr[i] = NULL;
+			    !((sc->sc_smr[i]->ss_id ^ sid) &
+			    ~sc->sc_smr[i]->ss_mask)) {
+				sc->sc_smr[i]->ss_dom = dom;
+				dom->sd_smr_idx = i;
+				break;
 			}
 			if (sc->sc_smr[i] != NULL)
 				continue;
@@ -771,6 +771,28 @@ smmu_domain_create(struct smmu_softc *sc, uint32_t sid)
 		reg |= SMMU_CB_SCTLR_ASIDPNE;
 	smmu_cb_write_4(sc, dom->sd_cb_idx, SMMU_CB_SCTLR, reg);
 
+	snprintf(dom->sd_exname, sizeof(dom->sd_exname), "%s:%x",
+	    sc->sc_dev.dv_xname, sid);
+	dom->sd_iovamap = extent_create(dom->sd_exname, 0,
+	    (1LL << iovabits) - 1, M_DEVBUF, NULL, 0, EX_WAITOK |
+	    EX_NOCOALESCE);
+
+	/* Reserve first page (to catch NULL access) */
+	extent_alloc_region(dom->sd_iovamap, 0, PAGE_SIZE, EX_WAITOK);
+
+	SIMPLEQ_INSERT_TAIL(&sc->sc_domains, dom, sd_list);
+	return dom;
+}
+
+static void
+smmu_domain_enable(struct smmu_domain *dom)
+{
+	struct smmu_softc *sc = dom->sd_sc;
+	uint32_t reg;
+
+	if (dom->sd_enabled)
+		return;
+
 	/* Point stream to context block */
 	reg = SMMU_S2CR_TYPE_TRANS | dom->sd_cb_idx;
 	if (sc->sc_has_exids && sc->sc_smr)
@@ -779,23 +801,14 @@ smmu_domain_create(struct smmu_softc *sc, uint32_t sid)
 
 	/* Map stream idx to S2CR idx */
 	if (sc->sc_smr) {
-		reg = sid;
+		reg = sc->sc_smr[dom->sd_smr_idx]->ss_id << SMMU_SMR_ID_SHIFT |
+		    sc->sc_smr[dom->sd_smr_idx]->ss_mask << SMMU_SMR_MASK_SHIFT;
 		if (!sc->sc_has_exids)
 			reg |= SMMU_SMR_VALID;
 		smmu_gr0_write_4(sc, SMMU_SMR(dom->sd_smr_idx), reg);
 	}
 
-	snprintf(dom->sd_exname, sizeof(dom->sd_exname), "%s:%x",
-	    sc->sc_dev.dv_xname, sid);
-	dom->sd_iovamap = extent_create(dom->sd_exname, 0,
-	    (1LL << iovabits) - 1, M_DEVBUF, NULL, 0, EX_WAITOK |
-	    EX_NOCOALESCE);
-
-	/* Reserve first page (to catch NULL access) */
-	extent_alloc_region(dom->sd_iovamap, 0, PAGE_SIZE, EX_WAITOK);
-
-	SIMPLEQ_INSERT_TAIL(&sc->sc_domains, dom, sd_list);
-	return dom;
+	dom->sd_enabled = 1;
 }
 
 void
@@ -1299,6 +1312,8 @@ smmu_dmamap_create(bus_dma_tag_t t, bus_size_t size, int nsegments,
 	u_long dva, len;
 	int error;
 
+	smmu_domain_enable(dom);
+
 	error = sc->sc_dmat->_dmamap_create(sc->sc_dmat, size,
 	    nsegments, maxsegsz, boundary, flags, &map);
 	if (error)
diff --git a/sys/arch/arm64/dev/smmu_fdt.c b/sys/arch/arm64/dev/smmu_fdt.c
index 2c645da3f37..ac9ad9a9089 100644
--- a/sys/arch/arm64/dev/smmu_fdt.c
+++ b/sys/arch/arm64/dev/smmu_fdt.c
@@ -89,11 +89,6 @@ smmu_fdt_attach(struct device *parent, struct device *self, void *aux)
 	if (OF_getproplen(faa->fa_node, "dma-coherent") == 0)
 		sc->sc_coherent = 1;
 
-	if (sc->sc_is_qcom) {
-		printf(": disabled\n");
-		return;
-	}
-
 	if (smmu_attach(sc) != 0)
 		return;
 
diff --git a/sys/arch/arm64/dev/smmuvar.h b/sys/arch/arm64/dev/smmuvar.h
index c01e10518ee..03373167089 100644
--- a/sys/arch/arm64/dev/smmuvar.h
+++ b/sys/arch/arm64/dev/smmuvar.h
@@ -18,6 +18,7 @@
 struct smmu_softc;
 struct smmu_domain {
 	struct smmu_softc		*sd_sc;
+	uint32_t			 sd_enabled;
 	uint32_t			 sd_sid;
 	bus_dma_tag_t			 sd_dmat;
 	int				 sd_cb_idx;