Index | Thread | Search

From:
Stefan Fritsch <sf@sfritsch.de>
Subject:
Re: [EXT] Re: SEV: busdma bounce buffer
To:
Hans-Jörg Höxer <Hans-Joerg_Hoexer@genua.de>
Cc:
Mark Kettenis <mark.kettenis@xs4all.nl>, tech@openbsd.org
Date:
Tue, 13 Aug 2024 11:34:30 +0200

Download raw body.

Thread
  • Hans-Jörg Höxer:

    [EXT] Re: SEV: busdma bounce buffer

  • Hi,
    
    On Tue, 13 Aug 2024, Hans-Jörg Höxer wrote:
    
    > Hi Mark,
    > 
    > On Mon, Aug 12, 2024 at 12:26:35PM +0200, Mark Kettenis wrote:
    > > ...
    > > > the diff below introduces bounce buffers to busdma(9) when the
    > > > kernel is running as a SEV-guest.  This is determined by checking
    > > > cpu_sev_guestmode (set in locore0).
    > > > 
    > > > The goal is, to make virtio(4) work for SEV-guests.  With the recenct
    > > > fixes done by sf@ -- thanks! -- the next step is dma bounce buffering
    > > > for SEV-enabled guests.
    > > 
    > > Is this just for virtio(4)?  Did you consider PCI passthrough?
    > 
    > for now, this is just for virtio(4).  This is what we can do with vmd(8)
    > right now.
    
    Other drivers have lots of problems, too. For example, on qemu/kvm with 
    xhci enabled, openbsd with bounce buffers panics during attach. But that 
    can be fixed later on, if neccessary.
    
    > 
    > > > _bus_dmamap_create():  On map creation we allocate an array of
    > > > PAGE_SIZEd bounce buffers.  These pages are mapped with PMAP_NOCRYPT.
    > > > This means, they are unencrypted and shared with the hypervisor.
    > > > Note, that we allocate at least one page for each dma segment.  This
    > > > allows us to not have worry about a single shared page holding clear
    > > > text data from different dma segments.  On the other hand, for
    > > > segments of size less than a page (e.g. only a few bytes), we always
    > > > use a full page.
    > > 
    > > That is an interesting design choice.  I have a slight worry that this
    > > may lead to excessive allocation of bounce buffers if a driver creates
    > > a map with many more segments than what is actually needed.  The
    > > bus_dma API does provide a BUS_DMA_ALLOCNOW flag and maybe you should
    > > respect that flag and delay allocation of the bounce buffers until
    > > bus_dmamap_load() gets called.  That said it seems all the virtio
    > > drivers, except viomb(4), set that flag.
    > 
    > I see.  For now my approach is to make sure shared (clear text) and
    > unshared (encrypted) data is clearly separated.  With the cost of using
    > more memory than actually might be needed.  And -- as you point out
    > below -- with more bouncing than might be actually needed.
    > 
    > But when we have something that is working good enough to be committed
    > we can optimize and improve.
    > 
    > I will checkout out BUS_DMA_ALLOCNOW, thanks for pointing this out.
    
    If the hypervisor offers the virtio indirect descriptors feature, a lot 
    more bus_dmamaps are allocated and one needs quite a lot of memory to boot 
    successfully with bounce buffers enabled. A workaround is to disable 
    indirect descriptors in qemu.
    
    > 
    > > Speaking about viomb(4), how does the ballooning work if you have
    > > encrypted memory?
    > 
    > uh, I have not spent thought on this for now.
    
    No, it cannot work. If the hypervisor re-uses memory from the guest and 
    later tries to give it back to the guest, the guest would need to act on 
    that and talk to the security processor again. But I think it is the 
    responsibility of the hypervisor to not offer viomb if SEV is in use and 
    we don't need to do anything special, here. Also, probably viogpu won't 
    work either because the framebuffer does not do enough bus_dmamap_sync. 
    But that's ok for now.
    
    > > > _bus_dmamap_sync():  On sync data gets now copied from the original
    > > > encrypted pages to the unencrypted bounce buffer; and vice versa.
    > > > 
    > > > _bus_dmamap_load_buffer():  When loading actual pages (these are
    > > > always encrypted) into the map, we replace the ds_addr in the
    > > > descriptor with the (guest physical) address of the boune buffer.
    > > > The original address is stored in the new member ds_addr2.
    > > 
    > > So this means that you also encrypt the rings.  Is there any point in
    > > doing this?  I mean you're sharing the contents of the ring with the
    > > host, so you're not realling hiding anything.  So it might be worth
    > > changing bus_dmamem_alloc() to allocate non-encrypted memory instead.
    > > Or rather have bus_dmamem_map() map it with the PMAP_NOCRYPT flag and
    > > set a flag in the segments such that bus_dmamap_load() can't skip the
    > > bouncing.
    > 
    > good idea.
    
    I agree that that is where we should ultimately be going. Then we need to 
    make the virtio drivers more robust against inconsistent data in the 
    rings, too.
    
    > 
    > > Anyway, it's entirely reasonable to postpone that as a future
    > > optimization.
    > 
    > yes!
    
    Agreed.
    
    Cheers,
    Stefan
    From owner-tech@openbsd.org  Tue Aug 13 05:24:41 2024
    Received: from fugu.bsdbox.dev (fugu.bsdbox.dev [45.77.54.182])
    	by mail.openbsd.org (OpenSMTPD) with ESMTPS id 694be394 (TLSv1.3:TLS_AES_256_GCM_SHA384:256:NO);
    	Tue, 13 Aug 2024 05:24:37 -0600 (MDT)
    Received: from openbsd.org (tower.nita.ru [213.172.30.42])
    	by fugu.bsdbox.dev (OpenSMTPD) with ESMTPSA id 3bcfd7f0 (TLSv1.3:TLS_AES_256_GCM_SHA384:256:NO);
    	Tue, 13 Aug 2024 11:24:35 +0000 (UTC)
    Date: Tue, 13 Aug 2024 14:24:34 +0300
    From: Vitaliy Makkoveev <mvs@openbsd.org>
    To: tech@openbsd.org
    Subject: sysctl(2): make sysctl_int{,_lower}() mp-safe and unlock KERN_HOSTID
    Message-ID: <ZrtCcsU7kBnExrh6@openbsd.org>
    MIME-Version: 1.0
    Content-Type: text/plain; charset=us-ascii
    Content-Disposition: inline
    X-Archive-Number: 202408/195
    X-Sequence-Number: 99562
    
    Make sysctl_int() mp-safe in the sysctl_int_bounded() way and unlock
    KERN_HOSTID. Except bootstrap called clockattach() of sparc64, `hostid'
    never used outside kern_sysctl().
    
    mp-safe sysctl_int() is meaningless for sysctl_int_lower(), so rework it
    too. This time all affected paths are kernel locked, but this doesn't
    make sysctl_int_lower() worse.
    
    ok?
    
    Index: sys/kern/kern_sysctl.c
    ===================================================================
    RCS file: /cvs/src/sys/kern/kern_sysctl.c,v
    diff -u -p -r1.437 kern_sysctl.c
    --- sys/kern/kern_sysctl.c	11 Aug 2024 15:10:53 -0000	1.437
    +++ sys/kern/kern_sysctl.c	13 Aug 2024 11:03:24 -0000
    @@ -507,6 +507,10 @@ kern_sysctl(int *name, u_int namelen, vo
     		return (sysctl_rdstring(oldp, oldlenp, newp, version));
     	case KERN_NUMVNODES:  /* XXX numvnodes is a long */
     		return (sysctl_rdint(oldp, oldlenp, newp, numvnodes));
    +	case KERN_HOSTID:
    +		/* XXX assumes sizeof long >= sizeof int */
    +		return (sysctl_int(oldp, oldlenp, newp, newlen,
    +		    (int *)&hostid));
     	case KERN_CLOCKRATE:
     		return (sysctl_clockrate(oldp, oldlenp, newp));
     	case KERN_BOOTTIME: {
    @@ -585,7 +589,7 @@ int
     kern_sysctl_locked(int *name, u_int namelen, void *oldp, size_t *oldlenp,
         void *newp, size_t newlen, struct proc *p)
     {
    -	int error, level, inthostid, stackgap;
    +	int error, level, stackgap;
     	dev_t dev;
     	extern int pool_debug;
     
    @@ -623,11 +627,6 @@ kern_sysctl_locked(int *name, u_int name
     		if (newp && !error)
     			domainnamelen = newlen;
     		return (error);
    -	case KERN_HOSTID:
    -		inthostid = hostid;  /* XXX assumes sizeof long <= sizeof int */
    -		error =  sysctl_int(oldp, oldlenp, newp, newlen, &inthostid);
    -		hostid = inthostid;
    -		return (error);
     	case KERN_CONSBUF:
     		if ((error = suser(p)))
     			return (error);
    @@ -1055,17 +1054,36 @@ int
     sysctl_int_lower(void *oldp, size_t *oldlenp, void *newp, size_t newlen,
         int *valp)
     {
    -	unsigned int oval = *valp, val = *valp;
    +	unsigned int oldval, newval;
     	int error;
     
    -	if (newp == NULL)
    -		return (sysctl_rdint(oldp, oldlenp, newp, val));
    +	if (oldp && *oldlenp < sizeof(int))
    +		return (ENOMEM);
    +	if (newp && newlen != sizeof(int))
    +		return (EINVAL);
    +	*oldlenp = sizeof(int);
    +
    +	if (newp) {
    +		if ((error = copyin(newp, &newval, sizeof(int))))
    +			return (error);
    +		do {
    +			oldval = atomic_load_int(valp);
    +			if (oldval < (unsigned int)newval)
    +				return (EPERM);	/* do not allow raising */
    +		} while (atomic_cas_uint(valp, oldval, newval) != oldval);
    +
    +		if (oldp) {
    +			/* new value has been set although user gets error */
    +			if ((error = copyout(&oldval, oldp, sizeof(int))))
    +				return (error);
    +		}
    +	} else if (oldp) {
    +		oldval = atomic_load_int(valp);
    +
    +		if ((error = copyout(&oldval, oldp, sizeof(int))))
    +			return (error);	
    +	}
     
    -	if ((error = sysctl_int(oldp, oldlenp, newp, newlen, &val)))
    -		return (error);
    -	if (val > oval)
    -		return (EPERM);		/* do not allow raising */
    -	*(unsigned int *)valp = val;
     	return (0);
     }
     
    @@ -1076,6 +1094,7 @@ sysctl_int_lower(void *oldp, size_t *old
     int
     sysctl_int(void *oldp, size_t *oldlenp, void *newp, size_t newlen, int *valp)
     {
    +	int oldval, newval;
     	int error = 0;
     
     	if (oldp && *oldlenp < sizeof(int))
    @@ -1083,11 +1102,25 @@ sysctl_int(void *oldp, size_t *oldlenp, 
     	if (newp && newlen != sizeof(int))
     		return (EINVAL);
     	*oldlenp = sizeof(int);
    -	if (oldp)
    -		error = copyout(valp, oldp, sizeof(int));
    -	if (error == 0 && newp)
    -		error = copyin(newp, valp, sizeof(int));
    -	return (error);
    +
    +	/* copyin() may sleep, call it first */
    +	if (newp) {
    +		if ((error = copyin(newp, &newval, sizeof(int))))
    +			return (error);
    +	}
    +	if (oldp) {
    +		if (newp)
    +			oldval = atomic_swap_uint(valp, newval);
    +		else
    +			oldval = atomic_load_int(valp);
    +		if ((error = copyout(&oldval, oldp, sizeof(int)))) {
    +			/* new value has been set although user gets error */
    +			return (error);
    +		}
    +	} else if (newp)
    +		atomic_store_int(valp, newval);
    +
    +	return (0);
     }
     
     /*
    
    
    
  • Hans-Jörg Höxer:

    [EXT] Re: SEV: busdma bounce buffer