Download raw body.
[EXT] Re: SEV: busdma bounce buffer
Hi,
On Tue, 13 Aug 2024, Hans-Jörg Höxer wrote:
> Hi Mark,
>
> On Mon, Aug 12, 2024 at 12:26:35PM +0200, Mark Kettenis wrote:
> > ...
> > > the diff below introduces bounce buffers to busdma(9) when the
> > > kernel is running as a SEV-guest. This is determined by checking
> > > cpu_sev_guestmode (set in locore0).
> > >
> > > The goal is, to make virtio(4) work for SEV-guests. With the recenct
> > > fixes done by sf@ -- thanks! -- the next step is dma bounce buffering
> > > for SEV-enabled guests.
> >
> > Is this just for virtio(4)? Did you consider PCI passthrough?
>
> for now, this is just for virtio(4). This is what we can do with vmd(8)
> right now.
Other drivers have lots of problems, too. For example, on qemu/kvm with
xhci enabled, openbsd with bounce buffers panics during attach. But that
can be fixed later on, if neccessary.
>
> > > _bus_dmamap_create(): On map creation we allocate an array of
> > > PAGE_SIZEd bounce buffers. These pages are mapped with PMAP_NOCRYPT.
> > > This means, they are unencrypted and shared with the hypervisor.
> > > Note, that we allocate at least one page for each dma segment. This
> > > allows us to not have worry about a single shared page holding clear
> > > text data from different dma segments. On the other hand, for
> > > segments of size less than a page (e.g. only a few bytes), we always
> > > use a full page.
> >
> > That is an interesting design choice. I have a slight worry that this
> > may lead to excessive allocation of bounce buffers if a driver creates
> > a map with many more segments than what is actually needed. The
> > bus_dma API does provide a BUS_DMA_ALLOCNOW flag and maybe you should
> > respect that flag and delay allocation of the bounce buffers until
> > bus_dmamap_load() gets called. That said it seems all the virtio
> > drivers, except viomb(4), set that flag.
>
> I see. For now my approach is to make sure shared (clear text) and
> unshared (encrypted) data is clearly separated. With the cost of using
> more memory than actually might be needed. And -- as you point out
> below -- with more bouncing than might be actually needed.
>
> But when we have something that is working good enough to be committed
> we can optimize and improve.
>
> I will checkout out BUS_DMA_ALLOCNOW, thanks for pointing this out.
If the hypervisor offers the virtio indirect descriptors feature, a lot
more bus_dmamaps are allocated and one needs quite a lot of memory to boot
successfully with bounce buffers enabled. A workaround is to disable
indirect descriptors in qemu.
>
> > Speaking about viomb(4), how does the ballooning work if you have
> > encrypted memory?
>
> uh, I have not spent thought on this for now.
No, it cannot work. If the hypervisor re-uses memory from the guest and
later tries to give it back to the guest, the guest would need to act on
that and talk to the security processor again. But I think it is the
responsibility of the hypervisor to not offer viomb if SEV is in use and
we don't need to do anything special, here. Also, probably viogpu won't
work either because the framebuffer does not do enough bus_dmamap_sync.
But that's ok for now.
> > > _bus_dmamap_sync(): On sync data gets now copied from the original
> > > encrypted pages to the unencrypted bounce buffer; and vice versa.
> > >
> > > _bus_dmamap_load_buffer(): When loading actual pages (these are
> > > always encrypted) into the map, we replace the ds_addr in the
> > > descriptor with the (guest physical) address of the boune buffer.
> > > The original address is stored in the new member ds_addr2.
> >
> > So this means that you also encrypt the rings. Is there any point in
> > doing this? I mean you're sharing the contents of the ring with the
> > host, so you're not realling hiding anything. So it might be worth
> > changing bus_dmamem_alloc() to allocate non-encrypted memory instead.
> > Or rather have bus_dmamem_map() map it with the PMAP_NOCRYPT flag and
> > set a flag in the segments such that bus_dmamap_load() can't skip the
> > bouncing.
>
> good idea.
I agree that that is where we should ultimately be going. Then we need to
make the virtio drivers more robust against inconsistent data in the
rings, too.
>
> > Anyway, it's entirely reasonable to postpone that as a future
> > optimization.
>
> yes!
Agreed.
Cheers,
Stefan
From owner-tech@openbsd.org Tue Aug 13 05:24:41 2024
Received: from fugu.bsdbox.dev (fugu.bsdbox.dev [45.77.54.182])
by mail.openbsd.org (OpenSMTPD) with ESMTPS id 694be394 (TLSv1.3:TLS_AES_256_GCM_SHA384:256:NO);
Tue, 13 Aug 2024 05:24:37 -0600 (MDT)
Received: from openbsd.org (tower.nita.ru [213.172.30.42])
by fugu.bsdbox.dev (OpenSMTPD) with ESMTPSA id 3bcfd7f0 (TLSv1.3:TLS_AES_256_GCM_SHA384:256:NO);
Tue, 13 Aug 2024 11:24:35 +0000 (UTC)
Date: Tue, 13 Aug 2024 14:24:34 +0300
From: Vitaliy Makkoveev <mvs@openbsd.org>
To: tech@openbsd.org
Subject: sysctl(2): make sysctl_int{,_lower}() mp-safe and unlock KERN_HOSTID
Message-ID: <ZrtCcsU7kBnExrh6@openbsd.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
X-Archive-Number: 202408/195
X-Sequence-Number: 99562
Make sysctl_int() mp-safe in the sysctl_int_bounded() way and unlock
KERN_HOSTID. Except bootstrap called clockattach() of sparc64, `hostid'
never used outside kern_sysctl().
mp-safe sysctl_int() is meaningless for sysctl_int_lower(), so rework it
too. This time all affected paths are kernel locked, but this doesn't
make sysctl_int_lower() worse.
ok?
Index: sys/kern/kern_sysctl.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_sysctl.c,v
diff -u -p -r1.437 kern_sysctl.c
--- sys/kern/kern_sysctl.c 11 Aug 2024 15:10:53 -0000 1.437
+++ sys/kern/kern_sysctl.c 13 Aug 2024 11:03:24 -0000
@@ -507,6 +507,10 @@ kern_sysctl(int *name, u_int namelen, vo
return (sysctl_rdstring(oldp, oldlenp, newp, version));
case KERN_NUMVNODES: /* XXX numvnodes is a long */
return (sysctl_rdint(oldp, oldlenp, newp, numvnodes));
+ case KERN_HOSTID:
+ /* XXX assumes sizeof long >= sizeof int */
+ return (sysctl_int(oldp, oldlenp, newp, newlen,
+ (int *)&hostid));
case KERN_CLOCKRATE:
return (sysctl_clockrate(oldp, oldlenp, newp));
case KERN_BOOTTIME: {
@@ -585,7 +589,7 @@ int
kern_sysctl_locked(int *name, u_int namelen, void *oldp, size_t *oldlenp,
void *newp, size_t newlen, struct proc *p)
{
- int error, level, inthostid, stackgap;
+ int error, level, stackgap;
dev_t dev;
extern int pool_debug;
@@ -623,11 +627,6 @@ kern_sysctl_locked(int *name, u_int name
if (newp && !error)
domainnamelen = newlen;
return (error);
- case KERN_HOSTID:
- inthostid = hostid; /* XXX assumes sizeof long <= sizeof int */
- error = sysctl_int(oldp, oldlenp, newp, newlen, &inthostid);
- hostid = inthostid;
- return (error);
case KERN_CONSBUF:
if ((error = suser(p)))
return (error);
@@ -1055,17 +1054,36 @@ int
sysctl_int_lower(void *oldp, size_t *oldlenp, void *newp, size_t newlen,
int *valp)
{
- unsigned int oval = *valp, val = *valp;
+ unsigned int oldval, newval;
int error;
- if (newp == NULL)
- return (sysctl_rdint(oldp, oldlenp, newp, val));
+ if (oldp && *oldlenp < sizeof(int))
+ return (ENOMEM);
+ if (newp && newlen != sizeof(int))
+ return (EINVAL);
+ *oldlenp = sizeof(int);
+
+ if (newp) {
+ if ((error = copyin(newp, &newval, sizeof(int))))
+ return (error);
+ do {
+ oldval = atomic_load_int(valp);
+ if (oldval < (unsigned int)newval)
+ return (EPERM); /* do not allow raising */
+ } while (atomic_cas_uint(valp, oldval, newval) != oldval);
+
+ if (oldp) {
+ /* new value has been set although user gets error */
+ if ((error = copyout(&oldval, oldp, sizeof(int))))
+ return (error);
+ }
+ } else if (oldp) {
+ oldval = atomic_load_int(valp);
+
+ if ((error = copyout(&oldval, oldp, sizeof(int))))
+ return (error);
+ }
- if ((error = sysctl_int(oldp, oldlenp, newp, newlen, &val)))
- return (error);
- if (val > oval)
- return (EPERM); /* do not allow raising */
- *(unsigned int *)valp = val;
return (0);
}
@@ -1076,6 +1094,7 @@ sysctl_int_lower(void *oldp, size_t *old
int
sysctl_int(void *oldp, size_t *oldlenp, void *newp, size_t newlen, int *valp)
{
+ int oldval, newval;
int error = 0;
if (oldp && *oldlenp < sizeof(int))
@@ -1083,11 +1102,25 @@ sysctl_int(void *oldp, size_t *oldlenp,
if (newp && newlen != sizeof(int))
return (EINVAL);
*oldlenp = sizeof(int);
- if (oldp)
- error = copyout(valp, oldp, sizeof(int));
- if (error == 0 && newp)
- error = copyin(newp, valp, sizeof(int));
- return (error);
+
+ /* copyin() may sleep, call it first */
+ if (newp) {
+ if ((error = copyin(newp, &newval, sizeof(int))))
+ return (error);
+ }
+ if (oldp) {
+ if (newp)
+ oldval = atomic_swap_uint(valp, newval);
+ else
+ oldval = atomic_load_int(valp);
+ if ((error = copyout(&oldval, oldp, sizeof(int)))) {
+ /* new value has been set although user gets error */
+ return (error);
+ }
+ } else if (newp)
+ atomic_store_int(valp, newval);
+
+ return (0);
}
/*
[EXT] Re: SEV: busdma bounce buffer