From: Patrick Wildt <patrick@blueri.se>
Subject: Re: sdhc(4): fix intr status race
To: Mark Kettenis <mark.kettenis@xs4all.nl>
Cc: tech@openbsd.org, dlg@openbsd.org
Date: Mon, 5 Aug 2024 20:00:47 +0200

Am Mon, Aug 05, 2024 at 11:07:58AM +0200 schrieb Mark Kettenis:
> > Date: Mon, 5 Aug 2024 00:31:01 +0200
> > From: Patrick Wildt <patrick@blueri.se>
> > 
> > Hi,
> > 
> > on my RK3588 the eMMC only worked when running with GENERIC.MP, but with
> > GENERIC or RAMDISK it failed to read command responses unless I compiled
> > it with MULTIPROCESSOR, even if I only run it with one core.
> > 
> > As it turns out, this led to a timing change in the command sequence:
> > 
> > As one of the initialization steps, the eMMC is "enumerated".  It is
> > supposed to timeout/error if there's no second "card".  That's normal.
> > After enumeration, it tries to read more information, which is where
> > it then failed.
> > 
> > > sdhc_exec_command: start fresh command (ENUMERATION)
> > > dwmshc0: interrupt status=8000<ERROR>
> > > dwmshc0: interrupt status=1<CMD>
> > > sdhc_exec_command: start fresh command (READ)
> > > dwmshc0: interrupt status=1<CMD>
> 
> I wonder what other SDHC controllers do.  Do they set ERROR and CMD
> simultaneously?  Or don't they set CMD at all?

On the Armada 8040 I *only* see ERROR, no CMD.  Sometimes I wonder if
the CMD is caused by the host reset we execute.

> > As can be seen, the error-ing command seems to throw two separate
> > interrupts.
> > 
> > This first interrupt will:
> >  * hp->intr_status |= ERROR;
> >  * wakeup() the tsleep-ing sdhc_wait_intr(), which will return the
> >    information that the command failed
> > 
> > Before the next command is started, the second interrupt will:
> >  * hp->intr_status |= CMD_COMPLETE;
> > 
> > When the next command is executed, CMD_COMPLETE remains set, hence
> > sdhc_wait_intr() will assume it's already complete, even though it
> > isn't.  Hence invalid data is read.
> > 
> > I'm not sure if there's a more elegant solution, if we should block
> > CMD completion interrupts if we're not starting a new command or
> > something like that, but the easiest thing seems to be to just reset
> > the state before we start the next command.
> > 
> > Opinions?
> 
> As far as I can tell there is no guarantee that the CMD interrupt
> happens before you clear the variable.  But maybe that is just a
> theoretical problem.

On this machine I definitely get the IRQ immediately after the splx()
in sdhc_wait_intr().  I would hope/expect that the host reset should
ensure there are no other ones incoming later.

> I wonder how other OSes hadle this.  I quickly looked at NetBSD this
> morning but didn't immediately spot something that would fix this
> issue in their version of the code.
> 
> > diff --git a/sys/dev/sdmmc/sdhc.c b/sys/dev/sdmmc/sdhc.c
> > index eb0a8e5374b..637c99f485d 100644
> > --- a/sys/dev/sdmmc/sdhc.c
> > +++ b/sys/dev/sdmmc/sdhc.c
> > @@ -1057,6 +1057,9 @@ sdhc_start_command(struct sdhc_host *hp, struct sdmmc_command *cmd)
> >  	DPRINTF(1,("%s: cmd=%#x mode=%#x blksize=%d blkcount=%d\n",
> >  	    DEVNAME(hp->sc), command, mode, blksize, blkcount));
> >  
> > +	/* We're starting a new command, reset state. */
> > +	hp->intr_status = 0;
> > +
> >  	/*
> >  	 * Start a CPU data transfer.  Writing to the high order byte
> >  	 * of the SDHC_COMMAND register triggers the SD command. (1.5)
> > 
> >