From: Stefan Sperling <stsp@stsp.name>
Subject: Re: bge/bnx/iavf/igc/ix/ixl/ngbe/pcn: ifq_restart() fix
To: YASUOKA Masahiko <yasuoka@openbsd.org>
Cc: tech@openbsd.org
Date: Wed, 25 Jun 2025 09:53:19 +0200

On Wed, Jun 25, 2025 at 10:33:52AM +0900, YASUOKA Masahiko wrote:
> I'd like to know the problem clearly for the future debug.

> On vmx(4), if_vmx.c,v 1.90 move the chunk
> 
>                if (free <= NTXSEGS) {
>                        ifq_set_oactive(ifq);
> 
> to the middle of the loop in vmxnet3_start().  So it was stuck without
> OACTIVE.  tx could stop following conditions.
> 
> - start() isn't called because the queue length >= the max queue length
> - tx ring is empty
> (I put the whole scenario at the end of the message)
> 
> > I could trigger the same bug on ice(4) with iperf as follows:
> >   for i in `seq 5`; do (iperf -l0 -t 0 -c 2001:db8::1 -u &) ; done
> > This needs another machine which runs iperf -u -s (and -V if using IPv6).
> 
> ice was stuck with OACITVE.  Kevin reported ix was stuck with OACTIVE.
> 
> These drivers must be stuck by a different scenario than vmx because
> they set OACTIVE at the beginning of the loop in drv_start().
> 
> A scenario I imagine is:
> 
>   1. tx full in drv_start()
>      - set oactive
>   2. interrupt happen.  drv_txeof() is called
>      - no tx is consumed
>      - ifq_restart() but 1. is not finished so the task is queued in
>        ifq_serialize()
>   3. returned from 1. ifq_restart() queued in 2. is called.
>      - clear oactive
>      - drv_start() is called
>      - set oactive again
>      - (do nothing for tx ring)
>   4. interrput happen.  drv_txeof() is called
>      - all tx are consumed
>      - ifq_restart() but if 3. is not finished, the task is *not*
>        queued (since 3. is also ifq_restart())
> 
>   -> tx stucked
>     - the tx ring is empty
>     - oactive is set
>       - then drv_start() is not called (tx is kept empty)
> 
> If you know another or the real scenario, I'd like to know that.

What we know for certain about the problematic case is that:

 1) When a TX-EOF interrupt handler is entered, the oactive flag
    is clear even though the actual ring state, based on values of
    txq_prod and txq_cons on function entry, is "full".
    My earlier workaround for ice(4) was based on this observation.

 2) The interface eventually gets stuck with the oactive flag set and
    no longer processes any packets.

 3) This problem was observed during operation with a single Tx queue,
    with interrupts on cpu0, with userland constantly sending frames
    to the queue, presumably via any CPU (UDP send runs without kernel
    lock nowadays).

 4) We could fix the issue by avoiding scheduling of ifq_restart_task()
    if no new space has been made on the Tx ring.

I believe the fix works because when ifq_restart_task() runs it assumes
that clearing OACTIVE is always safe without checking the current driver
ring state, and if ifq_restart_task() runs while the ring is in fact full
(filled by userland via another cpu) we can get inconsistent state:
The ring is full but OACTIVE is clear.

Once this inconsistent state happens, some unknown race condition can
trigger the problem.