From: Alexander Bluhm <bluhm@openbsd.org>
Subject: Re: fstat sysctl crashes kernel
To: Vitaliy Makkoveev <otto@bsdbox.dev>
Cc: tech@openbsd.org
Date: Thu, 8 Aug 2024 13:21:54 +0200

On Thu, Aug 08, 2024 at 02:03:19AM +0300, Vitaliy Makkoveev wrote:
> > On 8 Aug 2024, at 01:49, Alexander Bluhm <bluhm@openbsd.org> wrote:
> > 
> > Hi,
> > 
> > There is this long lasting problem that fstat(1) in a loop during
> > heavy network load can crash the kernel.  Bug is that while traversing
> > allprocess in sysctl KERN_FILE_BYPID the list may change if fill_file()
> > is sleeping.  Conclusion at previous hackaton was that the single
> > sysctl is doing too much.
> > 
> > So I changed fstat to first call kvm_getprocs() getting all pids,
> > and then kvm_getfiles(KERN_FILE_BYPID) for each pid.  This allows
> > to abort the list traversal after a matching process has been found.
> > So we don't have to deal with dead next pointer.  Of course fstat
> > gets slower as it does multiple sysctl with linear process search,
> > but slow is better than crash.
> > 
> > It is necessary to lock the current process, otherwise fd_getfile(fdp)
> > may use a freed struct filedesc.  I am not happy with using
> > &pr->ps_lock in exit1() as it adds additional sleeping points.  With
> > the diff below fork+exec, tcpbench, fstat in loops is stable.  To
> > avoid sleeps in allprocess loop I use RW_SLEEPFAIL.
> > 
> > Note that only KERN_FILE_BYPID with given pid is safe.  After we
> > remove all other sysctl in userland, we can reject them in the
> > kernel.
> > 
> > Any idea how we can lock the process in a nicer way?
> > 
> 
> As I remember, when I tried to introduce allprocess rwlock(9), the
> only problem was false (?) witness(4) report. I like to follow this
> way. I don???t mean various allprocess list iterators, as it was exposed
> this was bad idea.

We tried to fix this for several years.  Interlocking of process
list and single process lock with all the other locks is complicated.
Discussion in Prague concluded, that too much logic was moved into
the kernel when reading /dev/kmem was converted to sysctl.  My diff
moves complexity back to userland which is good.

Your approach has to solve two problems, keeping the list consistent
and locking a single process.  Can you show your diff again, as
consistent view of one process is a common problem.  Maybe we can
combine our ideas.

Independent of that, can we commit the chunk below?  When searching
for a specific process, there is not need to traverse the list to
the end.  Also I check for arg >= 0 as this is cosistent with the
arg < -1 check before.  Then you don't have to think about what is
special about process 0.  There is no functional difference as
process 0 has PS_SYSTEM set and is skipped anyway.

ok?

bluhm

Index: kern/kern_sysctl.c
===================================================================
RCS file: /data/mirror/openbsd/cvs/src/sys/kern/kern_sysctl.c,v
diff -u -p -r1.434 kern_sysctl.c
--- kern/kern_sysctl.c	6 Aug 2024 12:36:54 -0000	1.434
+++ kern/kern_sysctl.c	8 Aug 2024 09:47:45 -0000
@@ -1679,7 +1679,7 @@ sysctl_file(int *name, u_int namelen, ch
 			 */
 			if (pr->ps_flags & (PS_SYSTEM | PS_EMBRYO | PS_EXITING))
 				continue;
-			if (arg > 0 && pr->ps_pid != (pid_t)arg) {
+			if (arg >= 0 && pr->ps_pid != (pid_t)arg) {
 				/* not the pid we are looking for */
 				continue;
 			}
@@ -1699,6 +1699,9 @@ sysctl_file(int *name, u_int namelen, ch
 				FILLIT(fp, fdp, i, NULL, pr);
 				FRELE(fp, p);
 			}
+			/* pid is unique, stop searching */
+			if (arg >= 0)
+				break;
 		}
 		if (!matched)
 			error = ESRCH;