Index | Thread | Search

From:
Hans-Jörg Höxer <hshoexer@genua.de>
Subject:
Re: [EXT] Re: SEV-ES guest: locore #VC trap handling
To:
<tech@openbsd.org>
Date:
Tue, 24 Jun 2025 14:12:22 +0200

Download raw body.

Thread
Hi Mike,

thanks for going through the diff!  The diff at hand is the large diff
I sent some time ago.  It includes more than the actual locore0 #VC diff
(that is commited now).

You raise some good questions and the diff at hand provides more context.
So let me try to answer your questions inline, see below.

Take care,
Hans-Joerg

On Fri, Jun 20, 2025 at 12:10:19AM -0700, Mike Larkin wrote:
> ...
> > Index: sys/arch/amd64/amd64/locore0.S
> > ===================================================================
> > RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/amd64/locore0.S,v
> > diff -u -p -r1.27 locore0.S
> > --- sys/arch/amd64/amd64/locore0.S	5 May 2025 23:02:39 -0000	1.27
> > +++ sys/arch/amd64/amd64/locore0.S	15 Jun 2025 11:03:01 -0000
> > @@ -111,6 +111,7 @@
> >  #include <machine/param.h>
> >  #include <machine/segments.h>
> >  #include <machine/specialreg.h>
> > +#include <machine/trap.h>
> >
> >  /*
> >   * override user-land alignment before including asm.h
> > @@ -193,6 +194,58 @@ bi_size_ok:
> >  	pushl	$PSL_MBO
> >  	popfl
> >
> > +	/*
> > +	 * Setup temporary #VC trap handler, in case we are running
> > +	 * on an AMD CPU in SEV-ES guest mode.  Will be reset by
> > +	 * init_x86_64().
> > +	 * We are setting up two handlers:
> > +	 *
> > +	 * 1) locore_vc_trap32:  Triggered when we are running in
> > +	 *    32-bit legacy mode.
> > +	 *
> > +	 * 2) locore_vc_trap64:  Triggered when we are running in
> > +	 *    32-bit compatibility mode.
> > +	 *
> > +	 * The latter one is used by vmd(8).
> 
> Please clarify; *when* is this used by vmd? I believe you mean when
> we do a direct kernel launch? If not, then why do we need the 32 bit
> one?

Yes.  To make initial work on SEV-* easier I started with just using
direct kernel launch.  So no firmware and bootload is involved.  In that
case, vmd(8) sets up 32-bit compatibility mode.  This means, exceptions
handlers are required to run in a long mode semgent.  Thus in that
situation the handler needs to be 64-bit code.

On Linux/KVM with the tianocore EFI firmware we actually run BOOTX64.EFI.
As far as I can see, this will run locore0 in 32-bit legacy mode.
So in that situation the trap handler runs in 32-bit legacy mode.
Thus we need a 32-bit handler, too.

I will update the comment.

Why is vmd(8) configuring compatibility mode for direct kernel launch?
Legacy mode should also work, no?

In the long term we need to be able to use legacy or EFI boot to run SEV-*
guests anyway.  So this code might by simplified in the future.

> ...
> > @@ -384,7 +436,9 @@ cont:
> >  #define PROC0_DMP2_OFF	(PROC0_DMP3_OFF + NDML3_ENTRIES * NBPG)
> >  #define TABLESIZE \
> >      ((NKL4_KIMG_ENTRIES + TABLE_L3_ENTRIES + TABLE_L2_ENTRIES + 1 + UPAGES + \
> > -	NDML3_ENTRIES + NDML2_ENTRIES + 3) * NBPG)
> > +	NDML3_ENTRIES + NDML2_ENTRIES + 2 + 3) * NBPG)
> 
> Can this just be '5' ?

yes.

> ...
> > @@ -514,6 +579,16 @@ map_tables:
> >  	shrl	$PGSHIFT,%ecx
> >  	fillkpt_nx
> >
> > +	/* Re-Map GHCB shared (ie. unencrypted) */
> > +	/* XXX hshoexer:  Only in SEV-ES guestmode. */
> 
> Can we fix the XXXs in this diff before committing please?

yes.

> ...
> > @@ -739,12 +819,131 @@ longmode_hi:
> >  	movw	%ax,%fs
> >
> >  	leaq	TABLESIZE(%rsi),%rdi
> > +	subq	$(NBPG*2), %rdi
> >  	subq	$(NBPG*3), %rdi
> 
> Can these be combined to NBPG * 5 ?

yes.

> 
> >  	/* XXX merge these */
> >  	call	init_x86_64
> >  	call	main
> >
> > +	/* MSR Protocol Request Codes */
> > +#define MSRPROTO_CPUID_REQ	0x4
> > +#define MSRPROTO_TERM_REQ	0x100
> > +
> > +vc_cpuid64:
> > +	shll	$30, %eax		/* requested register */
> > +	orl	$MSRPROTO_CPUID_REQ, %eax
> > +	movl	%ebx, %edx		/* CPUID function */
> > +	movl	$MSR_SEV_GHCB, %ecx
> > +	wrmsr
> > +	rep vmmcall
> 
> Out of curiousity, why is the rep prefix needed here?

In the APM vol3 vmmcall is specified as "0x0f 0x01 0xd9" and vmgexit as
"0xf2/0xf3 0x0f 0x01 0xd9".  As we do not have the vmgext mnemonic in
llvm yet,  using "rep vmmcall" will produce the right byte sequence.
Another option would be to use ".byte".  And of course adding the vmgexit
mnemonic to llvm.

I think I saw the use of "rep vmmcall" in linux kernel code.

The difference between vmmcall and vmgexit seems to be, that if the
hypervisor does not intercept vmmcall, the CPU will raise #UD and the
guest will not exit.  However, vmgexit will always exit the guest.

> ...
> > @@ -1314,6 +1317,38 @@ cpu_init_idt(void)
> >  	lidt(&region);
> >  }
> >
> > +uint64_t early_gdt[GDT_SIZE / 8];
> > +
> > +void
> > +cpu_init_early_vctrap(paddr_t addr)
> > +{
> > +	struct region_descriptor region;
> > +
> > +	extern struct region_descriptor gdt64;
> > +	extern struct gate_descriptor early_idt[NIDT];
> > +	extern  void Xvctrap_early(void);
> > +
> > +	/* Setup temporary "early" longmode GDT, will be reset soon */
> > +	memset(early_gdt, 0, sizeof(early_gdt));
> > +	set_mem_segment(GDT_ADDR_MEM(early_gdt, GCODE_SEL), 0, 0xfffff,
> > +	    SDT_MEMERA, SEL_KPL, 1, 0, 1);
> > +	set_mem_segment(GDT_ADDR_MEM(early_gdt, GDATA_SEL), 0, 0xfffff,
> > +	    SDT_MEMRWA, SEL_KPL, 1, 0, 1);
> > +	setregion(&region, early_gdt, GDT_SIZE - 1);
> > +	lgdt(&region);
> > +
> > +	/* Setup temporary "early" longmode #VC entry, will be reset soon */
> > +	setgate(&early_idt[T_VC], Xvctrap_early, 0, SDT_SYS386IGT,
> > +	    SEL_KPL, GSEL(GCODE_SEL, SEL_KPL));
> > +	setregion(&region, early_idt, NIDT * sizeof(idt[0]) - 1);
> > +	lidt(&region);
> > +
> > +	/* Tell vmm(4) about our GHCB. */
> > +	ghcb_paddr = addr;
> > +	memset((void *)ghcb_vaddr, 0, 2 * PAGE_SIZE);
> > +	wrmsr(MSR_SEV_GHCB, ghcb_paddr);
> > +}
> > +
> >  void
> >  cpu_init_extents(void)
> >  {
> > @@ -1433,6 +1468,13 @@ init_x86_64(paddr_t first_avail)
> >  	bios_memmap_t *bmp;
> >  	int x, ist;
> >  	uint64_t max_dm_size = ((uint64_t)512 * NUM_L4_SLOT_DIRECT) << 30;
> > +
> > +	/*
> > +	 * locore0 mapped 2 pages for use as GHCB before pmap is initialized.
> > +	 */
> > +	if (ISSET(cpu_sev_guestmode, SEV_STAT_ES_ENABLED))
> > +		cpu_init_early_vctrap(first_avail);
> > +	first_avail += 2 * NBPG;
> 
> Wasn't 'first_avail' already biased by the subq instructions in locore0 though?
> 
> Why do we need to adjust again?

Good point.  Maybe I misunderstood the existing code regarding
early_pte_pages:  For "early pte" locore0 allocates/reserves 3 pages, I
add another 2 for "early vc trap".  So first_avail is the address of the
lowest of these 5 pages.  And when actually "consuming" them by providing
them to cpu_init_early_vctrap() with first_avail I adjust by 2*NBPG.

Similar to:

+       /*
+        * locore0 mapped 3 pages for use before the pmap is initialized
+        * starting at first_avail. These pages are currently used by
+        * efifb to create early-use VAs for the framebuffer before efifb
+        * is attached.
+        */
+       early_pte_pages = first_avail;
+       first_avail += 3 * NBPG;

So following users of first_avail will not touch these 5 pages.  Does this
make sense?

> >
> >  	/*
> >  	 * locore0 mapped 3 pages for use before the pmap is initialized
>...
> > +int
> > +vctrap(struct trapframe *frame, int user)
> > +{
> > +	uint64_t	 sw_exitcode, sw_exitinfo1, sw_exitinfo2;
> > +	uint8_t		*rip = (uint8_t *)(frame->tf_rip);
> > +	uint16_t	 port;
> > +	struct ghcb_sync syncout, syncin;
> > +	struct ghcb_sa	*ghcb;
> > +
> > +	intr_disable();
> > +
> > +	memset(&syncout, 0, sizeof(syncout));
> > +	memset(&syncin, 0, sizeof(syncin));
> > +
> > +	sw_exitcode = frame->tf_err;
> > +	sw_exitinfo1 = 0;
> > +	sw_exitinfo2 = 0;
> > +
> > +	switch (sw_exitcode) {
> > +	case SVM_VMEXIT_CPUID:
> > +		ghcb_sync_val(GHCB_RAX, GHCB_SZ32, &syncout);
> > +		ghcb_sync_val(GHCB_RCX, GHCB_SZ32, &syncout);
> > +		ghcb_sync_val(GHCB_RAX, GHCB_SZ32, &syncin);
> > +		ghcb_sync_val(GHCB_RBX, GHCB_SZ32, &syncin);
> > +		ghcb_sync_val(GHCB_RCX, GHCB_SZ32, &syncin);
> > +		ghcb_sync_val(GHCB_RDX, GHCB_SZ32, &syncin);
> > +		frame->tf_rip += 2;
> > +		break;
> > +	case SVM_VMEXIT_MSR: {
> > +		if (user)
> > +			return 0;	/* not allowed from userspace */
> > +		if (*rip == 0x0f && *(rip + 1) == 0x30) {
> 
> Doesn't this break with XO kernels?

Good point! I have not considered XO, yet.  Mh, worst case this would
mean we can not use SEV-(ES|SNP) together PKU support?  Or would there
be a way to temporarily remap code readable and then remap XO?

>...
> > +	case SVM_VMEXIT_IOIO: {
> > +		if (user)
> > +			return 0;	/* not allowed from userspace */
> > +		switch (*rip) {
> > +		case 0x66: {
> > +			switch (*(rip + 1)) {
> > +			case 0xef:	/* out %ax,(%dx) */
> > +				ghcb_sync_val(GHCB_RAX, GHCB_SZ16, &syncout);
> > +				port = (uint16_t)frame->tf_rdx;
> > +				sw_exitinfo1 = (port << 16) |
> > +				    (1ULL << 5);
> > +				frame->tf_rip += 2;
> > +				break;
> > +			case 0xed:	/* in (%dx),%ax */
> > +				ghcb_sync_val(GHCB_RAX, GHCB_SZ16, &syncin);
> > +				port = (uint16_t)frame->tf_rdx;
> > +				sw_exitinfo1 = (port << 16) |
> > +				    (1ULL << 5) | (1ULL << 0);
> > +				frame->tf_rip += 2;
> > +				break;
> > +			default:
> > +				panic("failed to decode prefixed IOIO");
> > +			}
> > +			break;
> > +		    }
> > +		case 0xe4:	/* in $0x71,%al */
> 
> I don't understand this assumption. next byte 0xe4 does not imply
> "in $0x71, %al"; what is meant by this comment? (also the next one).

I am looking at the current byte at RIP, not the next one.  Code like this:

ffffffff81903a76:       e4 71                   in     $0x71,%al

Does this make sense?

> > +			ghcb_sync_val(GHCB_RAX, GHCB_SZ8, &syncin);
> > +			port = *(rip + 1);
> > +			sw_exitinfo1 = (port << 16) | (1ULL << 4) |
> > +			    (1ULL << 0);
> > +			frame->tf_rip += 2;
> > +			break;
> ...
> > Index: sys/arch/amd64/amd64/vmm_machdep.c
> > ===================================================================
> > RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/amd64/vmm_machdep.c,v
> > diff -u -p -r1.57 vmm_machdep.c
> > --- sys/arch/amd64/amd64/vmm_machdep.c	3 Jun 2025 19:15:29 -0000	1.57
> > +++ sys/arch/amd64/amd64/vmm_machdep.c	15 Jun 2025 11:03:02 -0000
> > @@ -1588,15 +1588,15 @@ vcpu_reset_regs_svm(struct vcpu *vcpu, s
> >  	    SVM_INTERCEPT_MWAIT_UNCOND | SVM_INTERCEPT_MONITOR |
> >  	    SVM_INTERCEPT_MWAIT_COND | SVM_INTERCEPT_RDTSCP;
> >
> > -	/* With SEV-ES we cannot force access XCR0, thus no intercept */
> > -	if (xsave_mask && !vcpu->vc_seves)
> > +	if (xsave_mask && !vcpu->vc_seves)	/* XXX hshoexer */
> 
> What does this XXX mean, and can we fix before commit?

the XXX is already gone.  I think at that time I wanted to remind myself
to check if my assumption about XCR0 is right.

> ...
> > @@ -1676,13 +1670,23 @@ vcpu_reset_regs_svm(struct vcpu *vcpu, s
> >
> >  		/* Set VMSA. */
> >  		vmcb->v_vmsa_pa = vcpu->vc_svm_vmsa_pa;
> > +
> > +		/* XXX hshoexer: LBR: guest_state_protected flag? */
> 
> Can we fix this XXX before commit please?

yes, already done.

> > @@ -4608,6 +4623,8 @@ svm_handle_efercr(struct vcpu *vcpu, uin
> >  	case SVM_VMEXIT_CR4_WRITE_TRAP:
> >  		vmcb->v_cr4 = vmcb->v_exitinfo1;
> >  		break;
> > +		/* XXX hshoexer: no state for CR8? */
> 
> Can we fix this XXX please before commit?

yes, already done.

> > +		break;
> >  	default:
> >  		return (EINVAL);
> >  	}
> > @@ -6767,6 +6784,8 @@ vcpu_run_svm(struct vcpu *vcpu, struct v
> >  		 * On exit, interrupts are disabled, and we are running with
> >  		 * the guest FPU state still possibly on the CPU. Save the FPU
> >  		 * state before re-enabling interrupts.
> > +		 *
> > +		 * XXX hshoexer:  With SEV-ES we should be able to skip this.
> 
> Can we fix this XXX please before commit?

yes, already done.

> ...