From: Mark Kettenis <mark.kettenis@xs4all.nl>
Subject: Re: sdhc(4): fix intr status race
To: Patrick Wildt <patrick@blueri.se>
Cc: tech@openbsd.org, kettenis@openbsd.org, dlg@openbsd.org
Date: Mon, 05 Aug 2024 11:07:58 +0200

> Date: Mon, 5 Aug 2024 00:31:01 +0200
> From: Patrick Wildt <patrick@blueri.se>
> 
> Hi,
> 
> on my RK3588 the eMMC only worked when running with GENERIC.MP, but with
> GENERIC or RAMDISK it failed to read command responses unless I compiled
> it with MULTIPROCESSOR, even if I only run it with one core.
> 
> As it turns out, this led to a timing change in the command sequence:
> 
> As one of the initialization steps, the eMMC is "enumerated".  It is
> supposed to timeout/error if there's no second "card".  That's normal.
> After enumeration, it tries to read more information, which is where
> it then failed.
> 
> > sdhc_exec_command: start fresh command (ENUMERATION)
> > dwmshc0: interrupt status=8000<ERROR>
> > dwmshc0: interrupt status=1<CMD>
> > sdhc_exec_command: start fresh command (READ)
> > dwmshc0: interrupt status=1<CMD>

I wonder what other SDHC controllers do.  Do they set ERROR and CMD
simultaneously?  Or don't they set CMD at all?

> As can be seen, the error-ing command seems to throw two separate
> interrupts.
> 
> This first interrupt will:
>  * hp->intr_status |= ERROR;
>  * wakeup() the tsleep-ing sdhc_wait_intr(), which will return the
>    information that the command failed
> 
> Before the next command is started, the second interrupt will:
>  * hp->intr_status |= CMD_COMPLETE;
> 
> When the next command is executed, CMD_COMPLETE remains set, hence
> sdhc_wait_intr() will assume it's already complete, even though it
> isn't.  Hence invalid data is read.
> 
> I'm not sure if there's a more elegant solution, if we should block
> CMD completion interrupts if we're not starting a new command or
> something like that, but the easiest thing seems to be to just reset
> the state before we start the next command.
> 
> Opinions?

As far as I can tell there is no guarantee that the CMD interrupt
happens before you clear the variable.  But maybe that is just a
theoretical problem.

I wonder how other OSes hadle this.  I quickly looked at NetBSD this
morning but didn't immediately spot something that would fix this
issue in their version of the code.

> diff --git a/sys/dev/sdmmc/sdhc.c b/sys/dev/sdmmc/sdhc.c
> index eb0a8e5374b..637c99f485d 100644
> --- a/sys/dev/sdmmc/sdhc.c
> +++ b/sys/dev/sdmmc/sdhc.c
> @@ -1057,6 +1057,9 @@ sdhc_start_command(struct sdhc_host *hp, struct sdmmc_command *cmd)
>  	DPRINTF(1,("%s: cmd=%#x mode=%#x blksize=%d blkcount=%d\n",
>  	    DEVNAME(hp->sc), command, mode, blksize, blkcount));
>  
> +	/* We're starting a new command, reset state. */
> +	hp->intr_status = 0;
> +
>  	/*
>  	 * Start a CPU data transfer.  Writing to the high order byte
>  	 * of the SDHC_COMMAND register triggers the SD command. (1.5)
> 
>