From: Mike Larkin <mlarkin@nested.page>
Subject: Re: Sysupgrade on i386 vmm to current snapshot hangs on boot into upgrade kernel
To: tech@openbsd.org
Date: Tue, 31 Mar 2026 09:36:13 -0700

On Tue, Mar 31, 2026 at 09:30:30AM -0700, Mike Larkin wrote:
> On Tue, Mar 31, 2026 at 06:01:19PM +0200, hshoexer wrote:
> > Hi,
> >
> > I've done some debugging on this:  I've hacked vmd/vmm to inject a
> > NMI into the guest.  On OpenBSD we drop into ddb when receiving a
> > NMI.  I've built a i386 bsd.rd with ddb enabled.
> >
> > For me i386 bsd.rd hangs with a chance of more then 50% when booting
> > through bios.  In that case the vcpu thread goes up to 100% CPU
> > usage.  Breaking into ddb by NMI injection I see traces like this:
> >
> > ...
> > scsibus2 at softraid0: 256 targets
> > root on rd0a swap on rd0b dump on rd0b
> > WARNING: CHECK AND RESET THE DATE!
> > NMI ... going to debugger
> > Stopped at      schedclock+0x3e:        leal    0(%eax,%ecx,2),%ecx
> > ddb> bt
> > schedclock(d72574f0) at schedclock+0x3e
> > statclock(d0acd498,e4abdd98,0) at statclock+0x213
> > clockintr_dispatch(e4abdd98) at clockintr_dispatch+0x1d2
> > clockintr(e4abdd98) at clockintr+0x3c
> > intr_handler(e4abdd98,d727bd40) at intr_handler+0x13
> > Xintr_legacy0_untramp() at Xintr_legacy0_untramp+0xfb
> > bzero() at bzero+0x23
> > pmap_zero_page(d5247088) at pmap_zero_page+0xf
> > uvm_pagezero_thread(d72574f0) at uvm_pagezero_thread+0x92
> > ddb>
> > schedclock(d72574f0) at schedclock+0x3e
> > ddb> c
> > NMI ... going to debugger
> > Stopped at      statclock+0x1f4:        cmpl    %esi,%ebx
> > ddb> bt
> > statclock(d0acd498,e4abdd98,0) at statclock+0x1f4
> > clockintr_dispatch(e4abdd98) at clockintr_dispatch+0x1d2
> > clockintr(e4abdd98) at clockintr+0x3c
> > intr_handler(e4abdd98,d727bd40) at intr_handler+0x13
> > Xintr_legacy0_untramp() at Xintr_legacy0_untramp+0xfb
> > bzero() at bzero+0x23
> > pmap_zero_page(d5247088) at pmap_zero_page+0xf
> > uvm_pagezero_thread(d72574f0) at uvm_pagezero_thread+0x92
> > ddb> c
> > NMI ... going to debugger
> > Stopped at      statclock+0x21b:        addl    $0x1,%eax
> > ddb> bt
> > statclock(d0acd498,e4abdd98,0) at statclock+0x21b
> > clockintr_dispatch(e4abdd98) at clockintr_dispatch+0x1d2
> > clockintr(e4abdd98) at clockintr+0x3c
> > intr_handler(e4abdd98,d727bd40) at intr_handler+0x13
> > Xintr_legacy0_untramp() at Xintr_legacy0_untramp+0xfb
> > bzero() at bzero+0x23
> > pmap_zero_page(d5247088) at pmap_zero_page+0xf
> > uvm_pagezero_thread(d72574f0) at uvm_pagezero_thread+0x92
> > ddb> c
> > NMI ... going to debugger
> > Stopped at      statclock+0x1fa:        jnb     statclock+0x225
> > ddb> bt
> > statclock(d0acd498,e4abdd98,0) at statclock+0x1fa
> > clockintr_dispatch(e4abdd98) at clockintr_dispatch+0x1d2
> > clockintr(e4abdd98) at clockintr+0x3c
> > intr_handler(e4abdd98,d727bd40) at intr_handler+0x13
> > Xintr_legacy0_untramp() at Xintr_legacy0_untramp+0xfb
> > bzero() at bzero+0x23
> > pmap_zero_page(d5247088) at pmap_zero_page+0xf
> > uvm_pagezero_thread(d72574f0) at uvm_pagezero_thread+0x92
> > ddb> c boot reboot
> > rebooting...
> >
> > So bsd.rd is busy handling clock interrupts.  However, I've verified
> > that vmd has only injecting a single clock interrupt when the hang
> > is kicked off.
> >
> > With a "if count > 100 printf" in clockintr_hardclock() I see this:
> >
> > clockintr_hardclock: count 359957
> > clockintr_hardclock: count 32604107809
> >
> > So it's not flodding the guest with interrupts.  It seems to be
> > nsec_advance() (used to calculate count) coming up with an high
> > value for "elapsed".  Causing not an infinite but very long loop,
> > repeatedly (first 359957, then 32604107809 iterations).
> >
> > On i386 bsd.rd have kern.timecounter.hardware=i8254.  On i386 GENERIC
> > we have kern.timecounter.hardware=pvclock.
> >
> > bluhm@ came up with this experiment:  On i386 GENERIC switching
> > kern.timecounter.hardware from pvclock to i8254 is likely to cause
> > an instant hang.  In that case ddb shows a trace similar to the
> > hang on i386 bsd.rd:
> >
> > <root@i386:1># sysctl kern.timecounter.hardware
> > kern.timecounter.hardware=pvclock0
> > <root@i386:2># sysctl kern.timecounter.hardware=i8254
> > kern.timecounter.hardware: pvclock0 ->clockintr_hardclock: cound 359956
> > clockintr_hardclock: cound 9992852590
> > NMI ... going to debugger
> > Stopped at      i8254_get_timecount+0xd:        movzbl  %al,%edx
> > ddb>
> > ddb> bt
> > i8254_get_timecount(d0f57528) at i8254_get_timecount+0xd
> > nanouptime(f6608668) at nanouptime+0x23
> > timeout_hardclock_update() at timeout_hardclock_update+0x1f9
> > clockintr_hardclock(d0f5349c,f6608714,0) at clockintr_hardclock+0xea
> > clockintr_dispatch(f6608714) at clockintr_dispatch+0x1ed
> > clockintr(f6608714) at clockintr+0x3f
> > intr_handler(f6608714,d78e5dc0) at intr_handler+0x18
> > Xintr_legacy0_untramp() at Xintr_legacy0_untramp+0xfb
> > cpu_idle_cycle() at cpu_idle_cycle+0xf
> > ddb> c
> > NMI ... going to debugger
> > Stopped at      i8254_get_timecount+0xb:        inb     $0x40,%al
> > ddb> bt
> > i8254_get_timecount(d0f57528) at i8254_get_timecount+0xb
> > tc_windup(0,0,0) at tc_windup+0x74
> > tc_ticktock() at tc_ticktock+0x3c
> > hardclock(f6608714) at hardclock+0x8
> > clockintr_hardclock(d0f5349c,f6608714,0) at clockintr_hardclock+0xea
> > clockintr_dispatch(f6608714) at clockintr_dispatch+0x1ed
> > clockintr(f6608714) at clockintr+0x3f
> > intr_handler(f6608714,d78e5dc0) at intr_handler+0x18
> > Xintr_legacy0_untramp() at Xintr_legacy0_untramp+0xfb
> > cpu_idle_cycle() at cpu_idle_cycle+0xf
> > ddb> c
> > NMI ... going to debugger
> > Stopped at      i8254_get_timecount+0xb:        inb     $0x40,%al
> > ddb> reboot
> > rebooting...
> >
> > And the printf indicates we are looping due to high counter values.
> >
> > Doing similar experiments with qemu and kvm we don't see hangs. So
> > this might be related to vmd i8254 emulation?
> >
> > Take care,
> > Hans-Joerg
> >
>
> maybe
>
> the problem with i8253 has always been its periodicity; I think the counter
> wraps in like 35ms or something as it's only 16 bit and has a fixed frequency.
>
> we can handle one wrap (eg, "count went negative") but not two.
>
> last time I looked at this, there was a timeout problem here due to HZ=100 on
> the host causing up to 20ms delay between timeouts firing (this was years ago
> though, and I know there was some work done since then).
>
> while the problem may not be 100% related to HZ, various things in the scheduler
> have changed over the years (and changes elsewhere that lead to different
> run profiles now) that maybe we are victims of excessive latency on the i8253
> latching and missing two wraps.
>
> > For me i386 bsd.rd hangs with a chance of more then 50% when booting
> > through bios.
>
> here is an experiment you can do:
>
> rebuild the host kernel with HZ=1000 and see if running that makes the problem
> go away. if that's the case then the only real option to "fix" this is to move
> i8253 into vmm(4) so we don't have to make multiple round trips to vmd(8) ; one

* another option would be to finish the tickless work cheloha started but I think
that's a long ways away from being complete.

> set of round trips to write the counter selector, then another to latch the
> counter value (and then a couple more to read the values but since it's latched
> at that point, those last ones aren't sensitive to the latency).
>
> N.B. - rebuilding with new HZ I think means a make clean then full build.
>
> -ml
>