From: Vitaliy Makkoveev <mvs@openbsd.org>
Subject: Re: fstat sysctl crashes kernel
To: Alexander Bluhm <bluhm@openbsd.org>
Cc: Vitaliy Makkoveev <otto@bsdbox.dev>, tech@openbsd.org
Date: Thu, 8 Aug 2024 17:15:11 +0300

On Thu, Aug 08, 2024 at 01:21:54PM +0200, Alexander Bluhm wrote:
> On Thu, Aug 08, 2024 at 02:03:19AM +0300, Vitaliy Makkoveev wrote:
> > > On 8 Aug 2024, at 01:49, Alexander Bluhm <bluhm@openbsd.org> wrote:
> > > 
> > > Hi,
> > > 
> > > There is this long lasting problem that fstat(1) in a loop during
> > > heavy network load can crash the kernel.  Bug is that while traversing
> > > allprocess in sysctl KERN_FILE_BYPID the list may change if fill_file()
> > > is sleeping.  Conclusion at previous hackaton was that the single
> > > sysctl is doing too much.
> > > 
> > > So I changed fstat to first call kvm_getprocs() getting all pids,
> > > and then kvm_getfiles(KERN_FILE_BYPID) for each pid.  This allows
> > > to abort the list traversal after a matching process has been found.
> > > So we don't have to deal with dead next pointer.  Of course fstat
> > > gets slower as it does multiple sysctl with linear process search,
> > > but slow is better than crash.
> > > 
> > > It is necessary to lock the current process, otherwise fd_getfile(fdp)
> > > may use a freed struct filedesc.  I am not happy with using
> > > &pr->ps_lock in exit1() as it adds additional sleeping points.  With
> > > the diff below fork+exec, tcpbench, fstat in loops is stable.  To
> > > avoid sleeps in allprocess loop I use RW_SLEEPFAIL.
> > > 
> > > Note that only KERN_FILE_BYPID with given pid is safe.  After we
> > > remove all other sysctl in userland, we can reject them in the
> > > kernel.
> > > 
> > > Any idea how we can lock the process in a nicer way?
> > > 
> > 
> > As I remember, when I tried to introduce allprocess rwlock(9), the
> > only problem was false (?) witness(4) report. I like to follow this
> > way. I don???t mean various allprocess list iterators, as it was exposed
> > this was bad idea.
> 
> We tried to fix this for several years.  Interlocking of process
> list and single process lock with all the other locks is complicated.
> Discussion in Prague concluded, that too much logic was moved into
> the kernel when reading /dev/kmem was converted to sysctl.  My diff
> moves complexity back to userland which is good.
> 
> Your approach has to solve two problems, keeping the list consistent
> and locking a single process.  Can you show your diff again, as
> consistent view of one process is a common problem.  Maybe we can
> combine our ideas.
> 
> Independent of that, can we commit the chunk below?  When searching
> for a specific process, there is not need to traverse the list to
> the end.  Also I check for arg >= 0 as this is cosistent with the
> arg < -1 check before.  Then you don't have to think about what is
> special about process 0.  There is no functional difference as
> process 0 has PS_SYSTEM set and is skipped anyway.
> 
> ok?
> 

ok mvs

> bluhm
> 
> Index: kern/kern_sysctl.c
> ===================================================================
> RCS file: /data/mirror/openbsd/cvs/src/sys/kern/kern_sysctl.c,v
> diff -u -p -r1.434 kern_sysctl.c
> --- kern/kern_sysctl.c	6 Aug 2024 12:36:54 -0000	1.434
> +++ kern/kern_sysctl.c	8 Aug 2024 09:47:45 -0000
> @@ -1679,7 +1679,7 @@ sysctl_file(int *name, u_int namelen, ch
>  			 */
>  			if (pr->ps_flags & (PS_SYSTEM | PS_EMBRYO | PS_EXITING))
>  				continue;
> -			if (arg > 0 && pr->ps_pid != (pid_t)arg) {
> +			if (arg >= 0 && pr->ps_pid != (pid_t)arg) {
>  				/* not the pid we are looking for */
>  				continue;
>  			}
> @@ -1699,6 +1699,9 @@ sysctl_file(int *name, u_int namelen, ch
>  				FILLIT(fp, fdp, i, NULL, pr);
>  				FRELE(fp, p);
>  			}
> +			/* pid is unique, stop searching */
> +			if (arg >= 0)
> +				break;
>  		}
>  		if (!matched)
>  			error = ESRCH;
>