From: Mark Kettenis <mark.kettenis@xs4all.nl>
Subject: Re: patch: relax ni_pledge panic
To: Claudio Jeker <cjeker@diehard.n-r-g.com>
Cc: deraadt@openbsd.org, semarie@kapouay.eu.org, tech@openbsd.org
Date: Tue, 18 Feb 2025 15:16:37 +0100

> Date: Tue, 18 Feb 2025 14:00:39 +0100
> From: Claudio Jeker <cjeker@diehard.n-r-g.com>
> 
> On Mon, Feb 17, 2025 at 03:40:28PM +0100, Martin Pieuchot wrote:
> > On 10/02/25(Mon) 13:02, Claudio Jeker wrote:
> > > On Mon, Feb 10, 2025 at 12:32:52PM +0100, Martin Pieuchot wrote:
> > > > On 07/02/25(Fri) 17:47, Claudio Jeker wrote:
> > > > > On Thu, Feb 06, 2025 at 06:46:52PM +0100, Martin Pieuchot wrote:
> > > > > > On 06/02/25(Thu) 09:43, Theo de Raadt wrote:
> > > > > > > Mark Kettenis <mark.kettenis@xs4all.nl> wrote:
> > > > > > > 
> > > > > > > > > From: "Theo de Raadt" <deraadt@openbsd.org>
> > > > > > > > > Date: Thu, 06 Feb 2025 09:17:52 -0700
> > > > > > > > > 
> > > > > > > > > > [2] in another thread, pledge("stdio rpath wpath"), and returns.
> > > > > > > > > >    the process is now pledged.
> > > > > > > > > 
> > > > > > > > > How can this be allowed?
> > > > > > > > > 
> > > > > > > > > I am pretty sure sys_pledge should single-thread the process, which
> > > > > > > > > means it will wait until other threads complete their in-kernel sleeps.
> > > > > > > > 
> > > > > > > > I'm not sure clauio@ will agree with you ;)
> > > > > > > 
> > > > > > > He just agreed with me privately.
> > > > > > 
> > > > > > I'd rather see a rwlock be used to serialized access to the per-process
> > > > > > data structures.  I don't see any reason to use the single thread API
> > > > > > for this and I'd rather not spread its usage.  It is already a pain to
> > > > > > work with.
> > > > > 
> > > > > We need to fix that and this is what I'm slowly working at. The single
> > > > > thread API should be simple and allow a few syscalls to know that no other
> > > > > thread will disturb them. exec and exit are such cases but I think unveil
> > > > > and pledge fall into a similar category. They alter the behaviour of the
> > > > > process and should therefor change the sate in a safe fashion.
> > > > 
> > > > I disagree with this approach.
> > > 
> > > I desagree with you on this one. The single thread API is one of the tools
> > > we have to make MP/MT code safe. In this case this tool fits well and solves
> > > an immediate problem. Hammering an rwlock around all of unveil is not
> > > practicable on the other hand.
> > 
> > The single thread API is IMHO not yet a tool.  It is an extremely fragile
> > subsystem which relies on a complex mutex & reference counting of
> > threads plus internal state inspection that rely on the SCHED_LOCK() and
> > that we do not want to spread.
> 
> I very much disagree. The API is a very important tool. We need it for
> exec and exit so it is heavily used. It is not fragile, at least not
> anymore and the fragile bits where exclusive to SINGLE_SUSPEND which is
> not used there for a good reason.
> 
> I spent a lot of time on fixing this.

That is very much appreciated.  But I agree with mpi@ at that this
isn't a generic MP API like mutex(9), rwlock(4), etc.  It was designed
to use in the exec and exit cases, where we have to stop all threads.
And extended to the ptrace(2) case where a similar requirement exists.

IMHO, in the case of unveil(2) it's just a bit of plaster we stuck on
because of fear for the vnode layer.  But at some point we'll have to
rip off that plaster.

> > From performance point of view it is the worst and from far.
> 
> Depends your view point. It is expensive to take a multithreaded process
> to single thread execution but only if the threads are all running.
> For a single threaded process the API is free.
> 
> Unveil(2) is a system call that is best used early in a program when
> threads are not yet spawned or are sleeping.
> Proper use of unveil(2) requires to lock down the process with
> unveil(NULL, NULL) at that point the unveil data becomes immutable.
> 
> So in the unveil(2) case a lot speaks for it.
> 
> > It also interacts with signals and states that are (ab)used by
> > ptrace.  Plus it is a subsystem that is under heavy development
> > with multiple bugs and locking dependencies that we haven't solved
> > yet.  On top of that it is very complicated and barely understood
> > by very few over-busy kernel developers.
> 
> Again this is is no longer true. I spent the last months to make this work
> in all cases. Also the API will become a lot simpler since SINGLE_SUSPEND
> will die soon.
>  
> > Finally it has no integration in WITNESS nor way to assert that data access
> > are properly "protected".
> 
> Yes, there is not WITNESS for this.
>  
> > >                       Hammering an rwlock around all of unveil is not
> > > practicable on the other hand.
> > 
> > This is completely false.  unveil is a relatively small subsystems
> > composed of 3 fields shared amount sibling threads in a process.  It is
> > not only practicable, it is doable starting with the write path
> > unveil_add().
> > 
> > Plus the subsystem already use rwlock...
> > 
> > I also believe this is a great place to start building knowledge about
> > the limits of the VFS in term of data owners.
> > 
> > I also believe the complexity of unveil is small enough that a kernel
> > developer not experienced in locking can tackle it with some help.
> 
> I was maybe a bit unclear. I said once that using the single_thread
> API here is a viable solution until something better is developped.

I'm glad we agree ;)

> Right now I think preventing the race via single_thread API is a
> viable solution until somebody does the work to clean up unveil and
> properly unlock it.  It is easy to say, the data structures are easy
> to lock but we did not look at the call sites and consider if an
> extra sleep point there would result in other subsytem failures.

As mpi@ noted, the subsystem already uses rwlock.  And there are
additional sleeping points as well.  So I'm not worried about extra
sleep points.  In fact, I'm wondering whether the current locking was
done at a too granular level.  If we can replace that with a single
per-process rwlock we'd be better off I think.

> I think if you want to fix unveil you also need to look at namei and
> understand vnodes.

Some understanding is necessary.  But we don't need to go full on
"mpsafe" and run things without holding the kernel lock.

> I'm not ready for that. I just have no time to start yet another huge
> project like this but I know I will be pulled into this. So maybe I'm
> selfish here but right now I think the pragmatic solution is good enough
> right now.
> 
> > > Why can't we use all tools in the box? Why can we only use rwlocks and
> > > mutexes for our solutions?
> > 
> > Theo can do what he wants, it's his project.  I'm just sharing my
> > experience of 12 years working on parallelizing OpenBSD and what I
> > believe is better for the project, its developers and users.
> > 
> > > > > Since both pledge and unveil are almost never called it is better to
> > > > > optimise for the readers (which happen on all or many syscalls).
> > > > 
> > > > Indeed, a read/write lock is fine.
> > > 
> > > No it is not. rwlocks are not a golden ticket that makes all
> > > pain go away.  They are massive bottle necks even now that the
> > > most obvious problem has been fixed.
> > 
> > They are no golden tickets, they are rwlocks.  We know how to use them
> > and clearly document which fields are protected by them.  This allows us
> > to work as a team re-using the knowledge of a previous developer and
> > push the line of the lock a bit further.  They are well integrated in
> > WITNESS as well as in the %spin analysis tools.  We know how to split
> > contended lock or use alternate locking primitives when necessary.
> 
> rwlocks are actually evading a lot of the analysis tools and contention is
> hard to detect. Until recently rwlocks where horribly inefficent. The fact
> that you can sleep with rwlocks held is nice but also a big problem and
> the way the lock adds extra sleeping points can introduce subtle bugs for
> people that are not so well versed as you in their use.

Yes, it is the sleeping while holding an rwlock that I'm afraid of and
why I didn't follow-up my analysis with a diff yet.

BTW, the only genuine problem I saw with unveil(2) is a race with
unveil_destroy().  I think that means the single_thread_set() call in
sys_unveil() isn't necessary.

> Writing thread safe code is hard and it is extra hard if performance
> matters.

But in this case there should be no significant lock contention.
Unless you're going to hammer the kernel with pointless unveil(2)
calls at least.  And at that point you deserve what you get.