From: Jonathan Matthew <jonathan@d14n.org>
Subject: Re: interface multiqueue timout race
To: Alexander Bluhm <bluhm@openbsd.org>
Cc: David Gwynne <david@gwynne.id.au>, Openbsd Tech <tech@openbsd.org>
Date: Wed, 2 Oct 2024 17:01:39 +1000

On Mon, Sep 30, 2024 at 08:38:39PM +0200, Alexander Bluhm wrote:
> On Sun, Sep 29, 2024 at 09:57:38PM +1000, David Gwynne wrote:
> > On Sat, Sep 28, 2024 at 11:07:48AM +0200, Alexander Bluhm wrote:
> > > On Sat, Sep 28, 2024 at 05:08:05PM +1000, David Gwynne wrote:
> > > >
> > > >
> > > > > On 27 Sep 2024, at 23:11, Alexander Bluhm <bluhm@openbsd.org> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > From time to time I see strange hangs or crashes in my network test
> > > > > lab.  Like this one
> > > > >
> > > > > ddb{0}> trace
> > > > > db_enter() at db_enter+0x19
> > > > > intr_handler(ffff80005bfab210,ffff80000009e780) at intr_handler+0x91
> > > > > Xintr_ioapic_edge16_untramp() at Xintr_ioapic_edge16_untramp+0x18f
> > > > > Xspllower() at Xspllower+0x1d
> > > > > pool_multi_alloc(ffffffff827a2a48,2,ffff80005bfab504) at pool_multi_alloc+0xcb
> > > > > m_pool_alloc(ffffffff827a2a48,2,ffff80005bfab504) at m_pool_alloc+0x4b
> > > > > pool_p_alloc(ffffffff827a2a48,2,ffff80005bfab504) at pool_p_alloc+0x68
> > > > > pool_do_get(ffffffff827a2a48,2,ffff80005bfab504) at pool_do_get+0xe5
> > > > > pool_get(ffffffff827a2a48,2) at pool_get+0xad
> > > > > m_clget(0,2,802) at m_clget+0x1cf
> > > > > igc_get_buf(ffff8000004ff8e8,109) at igc_get_buf+0xb8
> > > > > igc_rxfill(ffff8000004ff8e8) at igc_rxfill+0xad
> > > > > igc_rxrefill(ffff8000004ff8e8) at igc_rxrefill+0x27
> > > > > softclock_process_tick_timeout(ffff8000004ff970,1) at softclock_process_tick_timeout+0x103
> > > > > softclock(0) at softclock+0x11e
> > > > > softintr_dispatch(0) at softintr_dispatch+0xe6
> > > > > Xsoftclock() at Xsoftclock+0x27
> > > > > acpicpu_idle() at acpicpu_idle+0x131
> > > > > sched_idle(ffffffff8277aff0) at sched_idle+0x298
> > > > > end trace frame: 0x0, count: -19
> > > > >
> > > > > igc_rxrefill() may be called from both, timeout or receive interrupt.
> > > > > As interrupts are per CPU and timeout can be on any CPU there should
> > > > > be some lock.  Easy fix is to put a mutex around igc_rxrefill().
> > > > >
> > > > > Note that I have fixed similar problem for em(4) a while ago.  There
> > > > > splnet() was enough as it is not multi threaded.
> > > > >
> > > > > I have added receive mutex for bnxt, igc, ix, ixl as I have these
> > > > > interfaces in my lab.
> > > >
> > > > fwiw, the timeout and rx interrupt processing are supposed to be exclusive because you can't rx packets when the ring is empty, and the timeout is only supposed to be scheduled when the ring is empty.
> > > 
> > > Thanks for the explanation.  Now I understand how this is supposed
> > > to work.  I am not sure that my diff fixes it as the problem is
> > > very rare.
> > 
> > well, it's a subtle semantic that's not very well expressed anywhere,
> > and code needs to be careful about implementing it and that becomes less
> > true over time.
> > 
> > if rxfill isn't careful about how it updates state, in particular the
> > producer indexes relative to descriptors, it's believable an interrupt
> > running the rxeof code on another cpu will get confused.
> > 
> > > I see sporadic hangs in igc.  And I saw a trap in bnxt_rx.
> > 
> > by hang do you mean panic? or weird traffic stalls?
> 
> The igc problems are not so clear.  It is something between permanent
> and temporary traffic stalls.  Let's fix bnxt first, then I can
> collect more data about igc.  They are on the same machiene, tests
> driven by cron.
> 
> > >   2325          if (ag != NULL) {
> > >   2326                  bs = &rx->rx_ag_slots[ag->opaque];
> > >   2327                  bus_dmamap_sync(sc->sc_dmat, bs->bs_map, 0,
> > >   2328                      bs->bs_map->dm_mapsize, BUS_DMASYNC_POSTREAD);
> > >   2329                  bus_dmamap_unload(sc->sc_dmat, bs->bs_map);
> > >   2330
> > >   2331                  am = bs->bs_m;
> > >   2332                  bs->bs_m = NULL;
> > > * 2333                  am->m_len = letoh16(ag->len);
> > >   2334                  m->m_next = am;
> > >   2335                  m->m_pkthdr.len += am->m_len;
> > >   2336                  (*agslots)++;
> > >   2337          }
> > > 
> > > It faults when am->m_len is dereferenced.  So bs->bs_m must be
> > > bogous.  It happened only once and I dont know whether am is NULL
> > > or just invalid.
> > > 
> > > A timeout on an other CPU would be a nice expanation.  But you are
> > > right, this should not happen.  Does this "or" condition violate
> > > your assumption?
> > > 
> > >   1661                  bnxt_rx_fill(q);
> > >   1662                  if ((rx->rx_cons == rx->rx_prod) ||
> > >   1663                      (rx->rx_ag_cons == rx->rx_ag_prod))
> > >   1664                          timeout_add(&rx->rx_refill, 0);
> > 
> > yes...
> > 
> > i have tried very hard to avoid serialising nic intr handling with
> > other parts of the system. it didnt make sense to take a lock that
> > will be uncontended for the vast majority of the time, and it does
> > makes me feel twitchy holding a mutex while calling into mbuf and
> > dmamap code. however, if it is uncontended then the overhead of the
> > extra locking isn't going to be noticable anyway, and i can't see
> > how we'd reverse those lock orders and deadlock.
> > 
> > unless we get MI xcalls, moving the timeout handling to relevant
> > cpu and masking the interrupt seems like too much effort.
> > 
> > another option here is we flag rx rings as being "under maintenance"
> > before scheduling the timeout, avoid calling rxeof again while the
> > flag is set, and clear it at the end of the timeout handler (but
> > maybe before we post a producer update to the chip).
> > 
> > a lock is probably simpler to understand and implement though.
> 
> A mutex per rx queue should almost never be contended as timeouts
> happen rareley.  Then it is just an atomic operation an a memory
> barrier.  That does not sound worse than the other alternatives.
> 
> > while we're on this topic, i recently discovered (the hard way)
> > that some chips have a minimum number of descriptors they read from
> > the ring at a time. that means you can have a small number of
> > descriptors filled on the ring, but the chip doesnt care until you
> > get over the threshold. it might make sense to schedule the timeout
> > if the rxr lwm hasnt been reached rather than when the ring is
> > empty.
> 
> I saw em uses if_rxr_needrefill() to add timeout.  But that does
> not work with the idea to either fire rx interrupt or timeout.
> 
> The bnxt paniced again very nicely.  No fixes applied yet.
> Note that it happens when I test with jumbo frames.
> 
> ddb{2}> show panic
> *cpu2: uvm_fault(0xffffffff8283e0c0, 0x18, 0, 2) -> e
> 
> ddb{2}> x/s version
> version:        OpenBSD 7.6-current (GENERIC.MP) #cvs : D2024.09.28.00.00.00: Sat Sep 28 05:42:06 CEST 2024\012    root@ot42.obsd-lab.genua.de:/usr/src/sys/arch/amd64/compile/GENERIC.MP\012
> 
> ddb{2}> trace
> bnxt_rx(ffff8000004d0000,ffff8000004d11c0,ffff8000004d1168,ffff80005bfd5720,ffff80005bfd5778,ffff80005bfd577c,190133b67da4f769) at bnxt_rx+0x235
> bnxt_intr(ffff8000004d1148) at bnxt_intr+0x2ad
> intr_handler(ffff80005bfd5820,ffff800000099f80) at intr_handler+0x91
> Xintr_ioapic_edge0_untramp() at Xintr_ioapic_edge0_untramp+0x18f
> ether_output(ffff8000004d6048,ffff8000004d6048,10,ffff80005bfd58d8) at ether_output+0x14
> if_output_tso(ffff8000004d6048,ffff80005bfd5a28,ffff80005bfd5c00,fffffd9a85faa270,2328) at if_output_tso+0xf6
> ip_output(fffffd80ab5a5800,0,ffff80005bfd5be8,1,0,0,6b545d87789ed8aa) at ip_output+0x88a
> ip_forward(fffffd80ab5a5800,ffff8000004d0048,ffff80005bfd5be8,1) at ip_forward+0x1e1
> ip_input_if(ffff80005bfd5cc8,ffff80005bfd5cd4,7f2,0,ffff8000004d0048) at ip_input_if+0x3fa
> ipv4_input(ffff8000004d0048,fffffd80ab5a5800) at ipv4_input+0x38
> ether_input(ffff8000004d0048,fffffd80ab5a5800) at ether_input+0x3df
> if_input_process(ffff8000004d0048,ffff80005bfd5db8) at if_input_process+0x78
> ifiq_process(ffff8000000c1500) at ifiq_process+0x90
> taskq_thread(ffff80000002f100) at taskq_thread+0x129
> end trace frame: 0x0, count: -14
> 
> ddb{2}> show register
> rdi               0xffffffff82771348    pci_bus_dma_tag
> rsi               0xffff800005db2100
> rbp               0xffff80005bfd5700
> rbx               0xffff8000004d0000
> rdx                                0
> rcx                            0x3b2
> rax                                0
> r8                               0x2
> r9                0xffff80005bfd577c
> r10                            0x37f
> r11               0x9dcafb87a4a45308
> r12               0xffff800002a0d000
> r13                             0xb0
> r14               0xfffffd800c7cd7f0
> r15               0xfffffd80a69c8f00
> rip               0xffffffff81c5fe15    bnxt_rx+0x235
> cs                               0x8
> rflags                       0x10246    __ALIGN_SIZE+0xf246
> rsp               0xffff80005bfd5670
> ss                              0x10
> bnxt_rx+0x235:  movl    %ecx,0x18(%rax)
> 
> Here bs->bs_m is NULL
> 
> Diff below protects all bs->bs_m in the rx path with a mutex.  Note
> that I missed the bnxt_rx() chunk in my previous diff.

I'd rather fix the driver so it doesn't need this.
The problem here seems to be that bnxt has two rings to refill, and
the timeout tries to refill them both, even if one of them is not empty.
Can you try this diff instead?


diff --git a/sys/dev/pci/if_bnxt.c b/sys/dev/pci/if_bnxt.c
index 52165d95f2c..5a40908cb1e 100644
--- a/sys/dev/pci/if_bnxt.c
+++ b/sys/dev/pci/if_bnxt.c
@@ -315,7 +315,7 @@ void		bnxt_write_rx_doorbell(struct bnxt_softc *, struct bnxt_ring *,
 void		bnxt_write_tx_doorbell(struct bnxt_softc *, struct bnxt_ring *,
 		    int);
 
-int		bnxt_rx_fill(struct bnxt_queue *);
+int		bnxt_rx_fill(struct bnxt_queue *, int, int);
 u_int		bnxt_rx_fill_slots(struct bnxt_softc *, struct bnxt_ring *, void *,
 		    struct bnxt_slot *, uint *, int, uint16_t, u_int);
 void		bnxt_refill(void *);
@@ -954,7 +954,7 @@ bnxt_queue_up(struct bnxt_softc *sc, struct bnxt_queue *bq)
 	rx->rx_cons = 0;
 	rx->rx_ag_prod = 0;
 	rx->rx_ag_cons = 0;
-	bnxt_rx_fill(bq);
+	bnxt_rx_fill(bq, 1, 1);
 
 	tx->tx_cons = 0;
 	tx->tx_prod = 0;
@@ -1658,7 +1658,7 @@ bnxt_intr(void *xq)
 			if_rxr_livelocked(&rx->rxr[1]);
 		}
 
-		bnxt_rx_fill(q);
+		bnxt_rx_fill(q, 1, 1);
 		if ((rx->rx_cons == rx->rx_prod) ||
 		    (rx->rx_ag_cons == rx->rx_ag_prod))
 			timeout_add(&rx->rx_refill, 0);
@@ -2216,33 +2216,37 @@ bnxt_rx_fill_slots(struct bnxt_softc *sc, struct bnxt_ring *ring, void *ring_mem
 }
 
 int
-bnxt_rx_fill(struct bnxt_queue *q)
+bnxt_rx_fill(struct bnxt_queue *q, int rxring, int agring)
 {
 	struct bnxt_rx_queue *rx = &q->q_rx;
 	struct bnxt_softc *sc = q->q_sc;
 	u_int slots;
 	int rv = 0;
 
-	slots = if_rxr_get(&rx->rxr[0], rx->rx_ring.ring_size);
-	if (slots > 0) {
-		slots = bnxt_rx_fill_slots(sc, &rx->rx_ring,
-		    BNXT_DMA_KVA(rx->rx_ring_mem), rx->rx_slots,
-		    &rx->rx_prod, MCLBYTES,
-		    RX_PROD_PKT_BD_TYPE_RX_PROD_PKT, slots);
-		if_rxr_put(&rx->rxr[0], slots);
-	} else
-		rv = 1;
-
-	slots = if_rxr_get(&rx->rxr[1],  rx->rx_ag_ring.ring_size);
-	if (slots > 0) {
-		slots = bnxt_rx_fill_slots(sc, &rx->rx_ag_ring,
-		    BNXT_DMA_KVA(rx->rx_ring_mem) + PAGE_SIZE,
-		    rx->rx_ag_slots, &rx->rx_ag_prod,
-		    BNXT_AG_BUFFER_SIZE,
-		    RX_PROD_AGG_BD_TYPE_RX_PROD_AGG, slots);
-		if_rxr_put(&rx->rxr[1], slots);
-	} else
-		rv = 1;
+	if (rxring != 0) {
+		slots = if_rxr_get(&rx->rxr[0], rx->rx_ring.ring_size);
+		if (slots > 0) {
+			slots = bnxt_rx_fill_slots(sc, &rx->rx_ring,
+			    BNXT_DMA_KVA(rx->rx_ring_mem), rx->rx_slots,
+			    &rx->rx_prod, MCLBYTES,
+			    RX_PROD_PKT_BD_TYPE_RX_PROD_PKT, slots);
+			if_rxr_put(&rx->rxr[0], slots);
+		} else
+			rv = 1;
+	}
+
+	if (agring != 0) {
+		slots = if_rxr_get(&rx->rxr[1],  rx->rx_ag_ring.ring_size);
+		if (slots > 0) {
+			slots = bnxt_rx_fill_slots(sc, &rx->rx_ag_ring,
+			    BNXT_DMA_KVA(rx->rx_ring_mem) + PAGE_SIZE,
+			    rx->rx_ag_slots, &rx->rx_ag_prod,
+			    BNXT_AG_BUFFER_SIZE,
+			    RX_PROD_AGG_BD_TYPE_RX_PROD_AGG, slots);
+			if_rxr_put(&rx->rxr[1], slots);
+		} else
+			rv = 1;
+	}
 
 	return (rv);
 }
@@ -2253,9 +2257,10 @@ bnxt_refill(void *xq)
 	struct bnxt_queue *q = xq;
 	struct bnxt_rx_queue *rx = &q->q_rx;
 
-	bnxt_rx_fill(q);
+	bnxt_rx_fill(q, rx->rx_cons == rx->rx_prod, rx->rx_ag_cons == rx->rx_ag_prod);
 
-	if (rx->rx_cons == rx->rx_prod)
+	if ((rx->rx_cons == rx->rx_prod) ||
+	    (rx->rx_ag_cons == rx->rx_ag_prod))
 		timeout_add(&rx->rx_refill, 1);
 }