Index | Thread | Search

From:
Mark Kettenis <mark.kettenis@xs4all.nl>
Subject:
Re: ip sysctl atomic
To:
Theo de Raadt <deraadt@cvs.openbsd.org>
Cc:
alexander.bluhm@gmx.net, claudio@openbsd.org, tech@openbsd.org
Date:
Sat, 18 May 2024 20:39:30 +0200

Download raw body.

Thread
> From: Theo de Raadt <deraadt@cvs.openbsd.org>
> Date: Fri, 17 May 2024 13:33:14 -0600 (MDT)
> 
> I think this is going off the rails.
> 
> I think some fictional "everyone must see the same value" is being
> pushed here.
> 
> Can someone please provide an actual real example where this matters?
> 
> I asked before, let me ask again.  The man page says:
> 
>      Unless explicitly noted below, sysctl() returns a consistent snapshot of
>      the data requested.  Consistency is obtained by locking the destination
>      buffer into memory so that the data may be copied out without blocking.
>      Calls to sysctl() are serialized to avoid deadlock.
> 
> I think the last sentence is incompatible with MP, and we should remove it.
> Or add a 100 new biglocks, right?

That last sentence is a weird implementation detial to mention in a
man page.  Someone was really proud of what they did it seems.

The idea of wiring pages to avoid page faults while copying out large
amounts of data isn't necessarily incompatible with MP, but the
current implementation does make it rather easy to introduce lock
order reversals.  It is also entirely unnecessary for sysctls that are
a single int or a small struct.

> Next sentence
> 
>      Consistency is obtained by locking the destination
>      buffer into memory so that the data may be copied out without blocking.
> 
> That has always been untrue.  vslock is an incredibly huge block.

It was true back in the days when MP kernel didn't exist.  These days
you also need a lock to guarantee consistency, like the kernel lock or
the net lock.

> So really it comes down to:
> 
>      Unless explicitly noted below, sysctl() returns a consistent snapshot of
>      the data requested.
> 
> bluhm is trying to fix requests of one int, or one long.

Right, and as long as those are naturally aligned and don't cross a
page boundary.  The values will always be correct...

> Consistant with respect to what??

...since a single integer is always consistent with itself.

> The codepaths he's changing don't sleep.

Not 100% sure about that.  The sysctl codepaths themselves will only
sleep at edges where we grab locks.  But the code paths in the network
stack may sleep at other points. Anyway, with the locks removed there
may be problems in code that usesthe global variables, even without
sleep points.

> Can someone PLEASE tell me of an int/long sysctl value that will be
> read by multiple threads, *while* another thread is changing it (and I mean
> one change) -- and where it is OBSERVABLE in the multiple readers that
> something wasn't consistant?

If we were to remove the sysctl serialization lock (sysctl_lock) I
believe it is possible to observe an inconsistency where if the
variables starts with A, thread 1 changes it to B and thread 2
simultaniously changes it to C, thread 2 may see "A" as the old value,
but the final value may end up being "B".  This may not be a problem,
but it is inconsistent when one assumes that sysctl does a simple swap
operation.  But it is relatively easy to fix by using an atomic swap
operation.

> Some will return a value *before* the write was started (OR after it was
> started but did not finish yet), and others will return a value *after*
> the write was started and finished.
> 
> What's the problem?
> 
> Please demonstrate a specific sysctl request where this is a real issue,
> and if it is, let's go the other way and just reimpose the damn biglock
> and tell people OpenBSD is always going to be slow.

The real issue with the diff is that by dropping the net lock from the
sysctl code paths the global variables that are used to make decisions
in the network stack can now change while running code in the network
stack that holds the net lock.  We must guarantee consistency in those
codepaths.  If net.inet.ip.forwarding=1 and we go down the code path
to forward the packet and further down that code path we check
net.inet.ip.forwaring again and the value is now 0, the code may do
the wrong thing.