Index | Thread | Search

From:
Mark Kettenis <mark.kettenis@xs4all.nl>
Subject:
Re: ip sysctl atomic
To:
Alexander Bluhm <bluhm@openbsd.org>
Cc:
tech@openbsd.org, claudio@openbsd.org
Date:
Sat, 04 May 2024 22:22:54 +0200

Download raw body.

Thread
> Date: Thu, 2 May 2024 13:21:31 +0200
> From: Alexander Bluhm <bluhm@openbsd.org>
> 
> On Wed, May 01, 2024 at 01:16:37PM +0200, Mark Kettenis wrote:
> > Well, you did exactly what I tried to discourage you from doing.  I
> > don't think this is "mpsafe" at all.  Let me explain...
> 
> I think we have a different understanding how lockless access should
> work.
> 
> > > +	/* copyin() may sleep, call it first */
> > 
> > This comment will become rather misleading once we drop the vslock()
> > for certain sysctls like mvs@ is suggesting; copyout(9) may sleep as
> > well at that point.
> 
> I doubt that sleeping copyout(9) in sysctl(2) is a step in the right
> direction.  Getting consistent information will be harder, you will
> fall into various NFS traps, and you cannot hold mutex when copying
> kernel data to userland.

Well, the vslock code has its problems as well.  And it is a very
heavy-handed mechanism when you're only copying a single int into
userland.

> > What does "mpsafe" mean here?  The READ_ONCE() and WRITE_ONCE() here
> > don't really achieve anything and from a user standpoint this could
> > still yield inconsistent values, where two threads changing the sysctl
> > value concurrently could return any "old" value.  And by that I mean
> > that if the value of the sysctl is initailly A, thread 1 changes it to
> > B and thread 2 changes it to C, the final value may end up as C, but
> > thread 2 might return either A or B as the old value.  A truly
> > "mpsafe" version should probably use atomic_swap_uint(9).  Otherwise
> > you might just as wel just remove the locking around the existing
> > sysctl_bounded_arr() call.
> 
> "mpsafe" with READ_ONCE() and WRITE_ONCE() means the compiler does
> exactly one memory access.  When accessing one integer this guarantee
> may be enough.  If there are more than one variable or multiple
> memory locations involved, the surrounding code needs barriers or
> locks.
> 
> Basically the diff checks that the IP code has no such variables
> depending on each other.  Otherwise it reads them once and passes
> the value along in register or stack.

But there is no need to use READ_ONCE() here.  You're simply copying
out the value to userland.  Even if the compiler for some weird reason
decides to do multiple memory accesses, the result of only one of
those will be what gets copied out to userland.  I' simply saying that
you're over-using READ_ONCE() and WRITE_ONCE() here.  This is not what
they're inteded for as far as I understand.

> Multiple sysctl reads in thead A B C are serialized by the rwlock
> sysctl_lock.  In my previous diff a added a comment there.  Actually
> I was concerned about it, tried fix to it with a mutex, until I
> discovered it cannon happen due to sysctl_lock.

Well, there have been efforts to remove/reduce sysctl_lock.  And
please look at the comment for sysctl_lock at the top of the file.
That lock isn't about serializing sysctl reads.  It is all about
reducing the damage that vslock can do to the system.

> Just removing the net lock around sysctl_bounded_arr() call feels
> wrong.  The bare minimum for integer access without lock is the
> READ_ONCE() macro.  I use it to mark lockless reads where another
> thread may modify the value.

And that's where we disagree.  

> Of course when using it, you have to
> be careful that the surrounding code actually uses it only once and
> not in correlation with some other value.
> 
> Currently we have the big hammer net lock.  A mutex for changing a
> single integer is overkill.  Atomic read and store could be
> implemented, but is not necessary.  Reading values that another CPU
> may modify without volatile is wrong.  Otherwise the compile may
> read multiple times or cache the value.  So READ_ONCE() and
> WRITE_ONCE() seems the way to go.
> 
> How would you implement this?

I would probably using atomic_swap_uint() for the sysctls that can be
changed and not worry about multiple access for the read-only ones.