Download raw body.
fstat(1) caused a weird hang on 7.7
On Wed, May 28, 2025 at 04:55:35PM +0900, Alexander Bluhm wrote:
> On Tue, May 27, 2025 at 01:59:25PM +0000, H. Hartzer wrote:
> > Please let me know if this isn't the best list for this, and I'll send
> > similiar reports elsewhere in the future.
>
> bugs@openbsd.org would be slightly better. Does not matter much,
> the same people read both bugs@ and tech@.
>
> > I had the most bizarre issue. I ran `fstat | grep 8080` and fstat hang.
> > This caused a bunch of other things to not work, but the system was
> > partly usable.
>
> This is a deadlock in the sysctl(2) system call that is used by
> fstat(1). We tried to improve this while the network stack was
> unlocked, maybe we missed something. At a first glance, I don't
> see relevant fixes that were commited afer 7.7 release. You might
> consider using openbsd current snapshot to see it the bug is still
> present.
>
> > What did not:
> > ps
>
> That is bad, as `ps axl` would show the wait queue where it hangs.
> We need the wait queue to debug it further.
>
> > I have no idea where to begin on this. I couldn't reproduce it after
> > rebooting.
>
> Try to stress test to reproduce. Run fstat in a loop. There must
> be some other activity to trigger it.
>
> As ps axl does not work, we need the kernel debugger output. For
> that you have to activate ddb. Write ddb.console=1 into /etc/sysctl.conf
> and reboot. With serial console break brings you to ddb, glass
> console has the hot key Ctrl-Alt-Esc. To get out of ddb, type cont
> at ddb> prompt.
>
> When it happens, drop to ddb. Then type ps at the ddb> prompt. On
> serial copy all the output, with monior take a picture. It scrols
> down page by page, we need all pages. Send the pictures to this
> list, reducing resolution first. We want to read the text, but not
> fill our mailbox with your gigapixel camara images.
>
> Maybe stack traces of the hanging processes would be iteresting,
> but that is too hard to explain in advance.
>
> As the machine is unusable anyway, type boot reboot at ddb> instead
> of continue.
>
> > Never had any issues with fstat before.
>
> It was much more broken before, it could crash the kernel.
>
> > Maybe I need more open file descriptors?
>
> No.
>
> > The last thing I see on fstat from the lists is that it was unlocked:
> > https://marc.info/?l=openbsd-tech&m=173607167913668&w=2
>
> This is about fstat(2) system call. Except the name it has nothing
> in common with fstat(1) userland tool.
>
> bluhm
>
I suspect it stuck in malloc(M_WAITOK) allocation of `kf'. This sleeping
malloc() happens with `sysctl_lock' held, so the whole sysctl layer
became unusable. Fortunately, we expect context switch in `allprocess'
foreach loops, so we could keep kernel lock only around them and avoid
sleeping memory allocation under `sysctl_lock'.
Actually, we don't need to wire memory in the most sysctl() cases, so we
could avoid `sysctl_lock' and rely on kernel lock only.
Note, we could have memory leak somewhere, so I'm curious the memory
usage stats in this machine.
Index: sys/kern/kern_sysctl.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_sysctl.c,v
diff -u -p -r1.468 kern_sysctl.c
--- sys/kern/kern_sysctl.c 9 May 2025 14:53:22 -0000 1.468
+++ sys/kern/kern_sysctl.c 29 May 2025 10:25:00 -0000
@@ -401,6 +401,8 @@ kern_sysctl_dirs(int top_name, int *name
int error;
switch (top_name) {
+ case KERN_FILE:
+ return (sysctl_file(name, namelen, oldp, oldlenp, p));
case KERN_MALLOCSTATS:
return (sysctl_malloc(name, namelen, oldp, oldlenp,
newp, newlen, p));
@@ -447,8 +449,6 @@ kern_sysctl_dirs_locked(int top_name, in
newp, newlen, oldp, oldlenp, p));
case KERN_PROC_VMMAP:
return (sysctl_proc_vmmap(name, namelen, oldp, oldlenp, p));
- case KERN_FILE:
- return (sysctl_file(name, namelen, oldp, oldlenp, p));
#endif
#if defined(GPROF) || defined(DDBPROF)
case KERN_PROF:
@@ -1776,8 +1776,11 @@ do { \
if (af == AF_INET || af == AF_INET6)
skip = 1;
}
- if (!skip)
+ if (!skip) {
+ KERNEL_LOCK();
FILLIT(fp, NULL, 0, NULL, NULL);
+ KERNEL_UNLOCK();
+ }
}
}
break;
@@ -1788,6 +1791,7 @@ do { \
break;
}
matched = 0;
+ KERNEL_LOCK();
LIST_FOREACH(pr, &allprocess, ps_list) {
/*
* skip system, exiting, embryonic and undead
@@ -1825,10 +1829,12 @@ do { \
if (arg >= 0)
break;
}
+ KERNEL_UNLOCK();
if (!matched)
error = ESRCH;
break;
case KERN_FILE_BYUID:
+ KERNEL_LOCK();
LIST_FOREACH(pr, &allprocess, ps_list) {
/*
* skip system, exiting, embryonic and undead
@@ -1859,6 +1865,7 @@ do { \
refcnt_rele_wake(&pr->ps_refcnt);
}
+ KERNEL_UNLOCK();
break;
default:
error = EINVAL;
fstat(1) caused a weird hang on 7.7