Download raw body.
interface multiqueue timout race
On 2024-09-30 2:38 p.m., Alexander Bluhm wrote:
> On Sun, Sep 29, 2024 at 09:57:38PM +1000, David Gwynne wrote:
>> On Sat, Sep 28, 2024 at 11:07:48AM +0200, Alexander Bluhm wrote:
>>> On Sat, Sep 28, 2024 at 05:08:05PM +1000, David Gwynne wrote:
>>>>
>>>>> On 27 Sep 2024, at 23:11, Alexander Bluhm<bluhm@openbsd.org> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> From time to time I see strange hangs or crashes in my network test
>>>>> lab. Like this one
>>>>>
>>>>> ddb{0}> trace
>>>>> db_enter() at db_enter+0x19
>>>>> intr_handler(ffff80005bfab210,ffff80000009e780) at intr_handler+0x91
>>>>> Xintr_ioapic_edge16_untramp() at Xintr_ioapic_edge16_untramp+0x18f
>>>>> Xspllower() at Xspllower+0x1d
>>>>> pool_multi_alloc(ffffffff827a2a48,2,ffff80005bfab504) at pool_multi_alloc+0xcb
>>>>> m_pool_alloc(ffffffff827a2a48,2,ffff80005bfab504) at m_pool_alloc+0x4b
>>>>> pool_p_alloc(ffffffff827a2a48,2,ffff80005bfab504) at pool_p_alloc+0x68
>>>>> pool_do_get(ffffffff827a2a48,2,ffff80005bfab504) at pool_do_get+0xe5
>>>>> pool_get(ffffffff827a2a48,2) at pool_get+0xad
>>>>> m_clget(0,2,802) at m_clget+0x1cf
>>>>> igc_get_buf(ffff8000004ff8e8,109) at igc_get_buf+0xb8
>>>>> igc_rxfill(ffff8000004ff8e8) at igc_rxfill+0xad
>>>>> igc_rxrefill(ffff8000004ff8e8) at igc_rxrefill+0x27
>>>>> softclock_process_tick_timeout(ffff8000004ff970,1) at softclock_process_tick_timeout+0x103
>>>>> softclock(0) at softclock+0x11e
>>>>> softintr_dispatch(0) at softintr_dispatch+0xe6
>>>>> Xsoftclock() at Xsoftclock+0x27
>>>>> acpicpu_idle() at acpicpu_idle+0x131
>>>>> sched_idle(ffffffff8277aff0) at sched_idle+0x298
>>>>> end trace frame: 0x0, count: -19
>>>>>
>>>>> igc_rxrefill() may be called from both, timeout or receive interrupt.
>>>>> As interrupts are per CPU and timeout can be on any CPU there should
>>>>> be some lock. Easy fix is to put a mutex around igc_rxrefill().
>>>>>
>>>>> Note that I have fixed similar problem for em(4) a while ago. There
>>>>> splnet() was enough as it is not multi threaded.
>>>>>
>>>>> I have added receive mutex for bnxt, igc, ix, ixl as I have these
>>>>> interfaces in my lab.
>>>> fwiw, the timeout and rx interrupt processing are supposed to be exclusive because you can't rx packets when the ring is empty, and the timeout is only supposed to be scheduled when the ring is empty.
>>> Thanks for the explanation. Now I understand how this is supposed
>>> to work. I am not sure that my diff fixes it as the problem is
>>> very rare.
>> well, it's a subtle semantic that's not very well expressed anywhere,
>> and code needs to be careful about implementing it and that becomes less
>> true over time.
>>
>> if rxfill isn't careful about how it updates state, in particular the
>> producer indexes relative to descriptors, it's believable an interrupt
>> running the rxeof code on another cpu will get confused.
>>
>>> I see sporadic hangs in igc. And I saw a trap in bnxt_rx.
>> by hang do you mean panic? or weird traffic stalls?
> The igc problems are not so clear. It is something between permanent
> and temporary traffic stalls. Let's fix bnxt first, then I can
> collect more data about igc. They are on the same machiene, tests
> driven by cron.
FYI..
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=279245
interface multiqueue timout race