Index | Thread | Search

From:
Mike Larkin <mlarkin@nested.page>
Subject:
Re: [EXT] Re: SEV-ES guest: locore #VC trap handling
To:
tech@openbsd.org
Date:
Thu, 26 Jun 2025 02:02:51 -0700

Download raw body.

Thread
  • Hans-Jörg Höxer:

    [EXT] Re: SEV-ES guest: locore #VC trap handling

  • Alexander Bluhm:

    SEV-ES guest: locore #VC trap handling

  • On Tue, Jun 24, 2025 at 02:12:22PM +0200, Hans-Jörg Höxer wrote:
    > Hi Mike,
    >
    > thanks for going through the diff!  The diff at hand is the large diff
    > I sent some time ago.  It includes more than the actual locore0 #VC diff
    > (that is commited now).
    >
    > You raise some good questions and the diff at hand provides more context.
    > So let me try to answer your questions inline, see below.
    >
    > Take care,
    > Hans-Joerg
    >
    
    Thanks for clarifying. A couple small points inline...
    
    -ml
    
    > On Fri, Jun 20, 2025 at 12:10:19AM -0700, Mike Larkin wrote:
    > > ...
    > > > Index: sys/arch/amd64/amd64/locore0.S
    > > > ===================================================================
    > > > RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/amd64/locore0.S,v
    > > > diff -u -p -r1.27 locore0.S
    > > > --- sys/arch/amd64/amd64/locore0.S	5 May 2025 23:02:39 -0000	1.27
    > > > +++ sys/arch/amd64/amd64/locore0.S	15 Jun 2025 11:03:01 -0000
    > > > @@ -111,6 +111,7 @@
    > > >  #include <machine/param.h>
    > > >  #include <machine/segments.h>
    > > >  #include <machine/specialreg.h>
    > > > +#include <machine/trap.h>
    > > >
    > > >  /*
    > > >   * override user-land alignment before including asm.h
    > > > @@ -193,6 +194,58 @@ bi_size_ok:
    > > >  	pushl	$PSL_MBO
    > > >  	popfl
    > > >
    > > > +	/*
    > > > +	 * Setup temporary #VC trap handler, in case we are running
    > > > +	 * on an AMD CPU in SEV-ES guest mode.  Will be reset by
    > > > +	 * init_x86_64().
    > > > +	 * We are setting up two handlers:
    > > > +	 *
    > > > +	 * 1) locore_vc_trap32:  Triggered when we are running in
    > > > +	 *    32-bit legacy mode.
    > > > +	 *
    > > > +	 * 2) locore_vc_trap64:  Triggered when we are running in
    > > > +	 *    32-bit compatibility mode.
    > > > +	 *
    > > > +	 * The latter one is used by vmd(8).
    > >
    > > Please clarify; *when* is this used by vmd? I believe you mean when
    > > we do a direct kernel launch? If not, then why do we need the 32 bit
    > > one?
    >
    > Yes.  To make initial work on SEV-* easier I started with just using
    > direct kernel launch.  So no firmware and bootload is involved.  In that
    > case, vmd(8) sets up 32-bit compatibility mode.  This means, exceptions
    > handlers are required to run in a long mode semgent.  Thus in that
    > situation the handler needs to be 64-bit code.
    >
    > On Linux/KVM with the tianocore EFI firmware we actually run BOOTX64.EFI.
    > As far as I can see, this will run locore0 in 32-bit legacy mode.
    > So in that situation the trap handler runs in 32-bit legacy mode.
    > Thus we need a 32-bit handler, too.
    >
    > I will update the comment.
    >
    > Why is vmd(8) configuring compatibility mode for direct kernel launch?
    > Legacy mode should also work, no?
    >
    > In the long term we need to be able to use legacy or EFI boot to run SEV-*
    > guests anyway.  So this code might by simplified in the future.
    >
    > > ...
    > > > @@ -384,7 +436,9 @@ cont:
    > > >  #define PROC0_DMP2_OFF	(PROC0_DMP3_OFF + NDML3_ENTRIES * NBPG)
    > > >  #define TABLESIZE \
    > > >      ((NKL4_KIMG_ENTRIES + TABLE_L3_ENTRIES + TABLE_L2_ENTRIES + 1 + UPAGES + \
    > > > -	NDML3_ENTRIES + NDML2_ENTRIES + 3) * NBPG)
    > > > +	NDML3_ENTRIES + NDML2_ENTRIES + 2 + 3) * NBPG)
    > >
    > > Can this just be '5' ?
    >
    > yes.
    >
    > > ...
    > > > @@ -514,6 +579,16 @@ map_tables:
    > > >  	shrl	$PGSHIFT,%ecx
    > > >  	fillkpt_nx
    > > >
    > > > +	/* Re-Map GHCB shared (ie. unencrypted) */
    > > > +	/* XXX hshoexer:  Only in SEV-ES guestmode. */
    > >
    > > Can we fix the XXXs in this diff before committing please?
    >
    > yes.
    >
    > > ...
    > > > @@ -739,12 +819,131 @@ longmode_hi:
    > > >  	movw	%ax,%fs
    > > >
    > > >  	leaq	TABLESIZE(%rsi),%rdi
    > > > +	subq	$(NBPG*2), %rdi
    > > >  	subq	$(NBPG*3), %rdi
    > >
    > > Can these be combined to NBPG * 5 ?
    >
    > yes.
    >
    > >
    > > >  	/* XXX merge these */
    > > >  	call	init_x86_64
    > > >  	call	main
    > > >
    > > > +	/* MSR Protocol Request Codes */
    > > > +#define MSRPROTO_CPUID_REQ	0x4
    > > > +#define MSRPROTO_TERM_REQ	0x100
    > > > +
    > > > +vc_cpuid64:
    > > > +	shll	$30, %eax		/* requested register */
    > > > +	orl	$MSRPROTO_CPUID_REQ, %eax
    > > > +	movl	%ebx, %edx		/* CPUID function */
    > > > +	movl	$MSR_SEV_GHCB, %ecx
    > > > +	wrmsr
    > > > +	rep vmmcall
    > >
    > > Out of curiousity, why is the rep prefix needed here?
    >
    > In the APM vol3 vmmcall is specified as "0x0f 0x01 0xd9" and vmgexit as
    > "0xf2/0xf3 0x0f 0x01 0xd9".  As we do not have the vmgext mnemonic in
    > llvm yet,  using "rep vmmcall" will produce the right byte sequence.
    > Another option would be to use ".byte".  And of course adding the vmgexit
    > mnemonic to llvm.
    >
    > I think I saw the use of "rep vmmcall" in linux kernel code.
    >
    > The difference between vmmcall and vmgexit seems to be, that if the
    > hypervisor does not intercept vmmcall, the CPU will raise #UD and the
    > guest will not exit.  However, vmgexit will always exit the guest.
    >
    > > ...
    > > > @@ -1314,6 +1317,38 @@ cpu_init_idt(void)
    > > >  	lidt(&region);
    > > >  }
    > > >
    > > > +uint64_t early_gdt[GDT_SIZE / 8];
    > > > +
    > > > +void
    > > > +cpu_init_early_vctrap(paddr_t addr)
    > > > +{
    > > > +	struct region_descriptor region;
    > > > +
    > > > +	extern struct region_descriptor gdt64;
    > > > +	extern struct gate_descriptor early_idt[NIDT];
    > > > +	extern  void Xvctrap_early(void);
    > > > +
    > > > +	/* Setup temporary "early" longmode GDT, will be reset soon */
    > > > +	memset(early_gdt, 0, sizeof(early_gdt));
    > > > +	set_mem_segment(GDT_ADDR_MEM(early_gdt, GCODE_SEL), 0, 0xfffff,
    > > > +	    SDT_MEMERA, SEL_KPL, 1, 0, 1);
    > > > +	set_mem_segment(GDT_ADDR_MEM(early_gdt, GDATA_SEL), 0, 0xfffff,
    > > > +	    SDT_MEMRWA, SEL_KPL, 1, 0, 1);
    > > > +	setregion(&region, early_gdt, GDT_SIZE - 1);
    > > > +	lgdt(&region);
    > > > +
    > > > +	/* Setup temporary "early" longmode #VC entry, will be reset soon */
    > > > +	setgate(&early_idt[T_VC], Xvctrap_early, 0, SDT_SYS386IGT,
    > > > +	    SEL_KPL, GSEL(GCODE_SEL, SEL_KPL));
    > > > +	setregion(&region, early_idt, NIDT * sizeof(idt[0]) - 1);
    > > > +	lidt(&region);
    > > > +
    > > > +	/* Tell vmm(4) about our GHCB. */
    > > > +	ghcb_paddr = addr;
    > > > +	memset((void *)ghcb_vaddr, 0, 2 * PAGE_SIZE);
    > > > +	wrmsr(MSR_SEV_GHCB, ghcb_paddr);
    > > > +}
    > > > +
    > > >  void
    > > >  cpu_init_extents(void)
    > > >  {
    > > > @@ -1433,6 +1468,13 @@ init_x86_64(paddr_t first_avail)
    > > >  	bios_memmap_t *bmp;
    > > >  	int x, ist;
    > > >  	uint64_t max_dm_size = ((uint64_t)512 * NUM_L4_SLOT_DIRECT) << 30;
    > > > +
    > > > +	/*
    > > > +	 * locore0 mapped 2 pages for use as GHCB before pmap is initialized.
    > > > +	 */
    > > > +	if (ISSET(cpu_sev_guestmode, SEV_STAT_ES_ENABLED))
    > > > +		cpu_init_early_vctrap(first_avail);
    > > > +	first_avail += 2 * NBPG;
    > >
    > > Wasn't 'first_avail' already biased by the subq instructions in locore0 though?
    > >
    > > Why do we need to adjust again?
    >
    > Good point.  Maybe I misunderstood the existing code regarding
    > early_pte_pages:  For "early pte" locore0 allocates/reserves 3 pages, I
    > add another 2 for "early vc trap".  So first_avail is the address of the
    > lowest of these 5 pages.  And when actually "consuming" them by providing
    > them to cpu_init_early_vctrap() with first_avail I adjust by 2*NBPG.
    >
    > Similar to:
    >
    > +       /*
    > +        * locore0 mapped 3 pages for use before the pmap is initialized
    > +        * starting at first_avail. These pages are currently used by
    > +        * efifb to create early-use VAs for the framebuffer before efifb
    > +        * is attached.
    > +        */
    > +       early_pte_pages = first_avail;
    > +       first_avail += 3 * NBPG;
    >
    > So following users of first_avail will not touch these 5 pages.  Does this
    > make sense?
    >
    
    ok
    
    > > >
    > > >  	/*
    > > >  	 * locore0 mapped 3 pages for use before the pmap is initialized
    > >...
    > > > +int
    > > > +vctrap(struct trapframe *frame, int user)
    > > > +{
    > > > +	uint64_t	 sw_exitcode, sw_exitinfo1, sw_exitinfo2;
    > > > +	uint8_t		*rip = (uint8_t *)(frame->tf_rip);
    > > > +	uint16_t	 port;
    > > > +	struct ghcb_sync syncout, syncin;
    > > > +	struct ghcb_sa	*ghcb;
    > > > +
    > > > +	intr_disable();
    > > > +
    > > > +	memset(&syncout, 0, sizeof(syncout));
    > > > +	memset(&syncin, 0, sizeof(syncin));
    > > > +
    > > > +	sw_exitcode = frame->tf_err;
    > > > +	sw_exitinfo1 = 0;
    > > > +	sw_exitinfo2 = 0;
    > > > +
    > > > +	switch (sw_exitcode) {
    > > > +	case SVM_VMEXIT_CPUID:
    > > > +		ghcb_sync_val(GHCB_RAX, GHCB_SZ32, &syncout);
    > > > +		ghcb_sync_val(GHCB_RCX, GHCB_SZ32, &syncout);
    > > > +		ghcb_sync_val(GHCB_RAX, GHCB_SZ32, &syncin);
    > > > +		ghcb_sync_val(GHCB_RBX, GHCB_SZ32, &syncin);
    > > > +		ghcb_sync_val(GHCB_RCX, GHCB_SZ32, &syncin);
    > > > +		ghcb_sync_val(GHCB_RDX, GHCB_SZ32, &syncin);
    > > > +		frame->tf_rip += 2;
    > > > +		break;
    > > > +	case SVM_VMEXIT_MSR: {
    > > > +		if (user)
    > > > +			return 0;	/* not allowed from userspace */
    > > > +		if (*rip == 0x0f && *(rip + 1) == 0x30) {
    > >
    > > Doesn't this break with XO kernels?
    >
    > Good point! I have not considered XO, yet.  Mh, worst case this would
    > mean we can not use SEV-(ES|SNP) together PKU support?  Or would there
    > be a way to temporarily remap code readable and then remap XO?
    >
    
    It was just a thought. XO kernel will require much more to be done, this would
    just add one more thing to the list. No action needed.
    
    > >...
    > > > +	case SVM_VMEXIT_IOIO: {
    > > > +		if (user)
    > > > +			return 0;	/* not allowed from userspace */
    > > > +		switch (*rip) {
    > > > +		case 0x66: {
    > > > +			switch (*(rip + 1)) {
    > > > +			case 0xef:	/* out %ax,(%dx) */
    > > > +				ghcb_sync_val(GHCB_RAX, GHCB_SZ16, &syncout);
    > > > +				port = (uint16_t)frame->tf_rdx;
    > > > +				sw_exitinfo1 = (port << 16) |
    > > > +				    (1ULL << 5);
    > > > +				frame->tf_rip += 2;
    > > > +				break;
    > > > +			case 0xed:	/* in (%dx),%ax */
    > > > +				ghcb_sync_val(GHCB_RAX, GHCB_SZ16, &syncin);
    > > > +				port = (uint16_t)frame->tf_rdx;
    > > > +				sw_exitinfo1 = (port << 16) |
    > > > +				    (1ULL << 5) | (1ULL << 0);
    > > > +				frame->tf_rip += 2;
    > > > +				break;
    > > > +			default:
    > > > +				panic("failed to decode prefixed IOIO");
    > > > +			}
    > > > +			break;
    > > > +		    }
    > > > +		case 0xe4:	/* in $0x71,%al */
    > >
    > > I don't understand this assumption. next byte 0xe4 does not imply
    > > "in $0x71, %al"; what is meant by this comment? (also the next one).
    >
    > I am looking at the current byte at RIP, not the next one.  Code like this:
    >
    > ffffffff81903a76:       e4 71                   in     $0x71,%al
    >
    > Does this make sense?
    
    My point was "0xe4" just means "in  $xxx, %al", it doesn't mean "$0x71".
    Look at the previous examples (0xef, 0xed); I'd recommend making the
    comment in this case be more generic.
    
    >
    > > > +			ghcb_sync_val(GHCB_RAX, GHCB_SZ8, &syncin);
    > > > +			port = *(rip + 1);
    > > > +			sw_exitinfo1 = (port << 16) | (1ULL << 4) |
    > > > +			    (1ULL << 0);
    > > > +			frame->tf_rip += 2;
    > > > +			break;
    > > ...
    > > > Index: sys/arch/amd64/amd64/vmm_machdep.c
    > > > ===================================================================
    > > > RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/amd64/vmm_machdep.c,v
    > > > diff -u -p -r1.57 vmm_machdep.c
    > > > --- sys/arch/amd64/amd64/vmm_machdep.c	3 Jun 2025 19:15:29 -0000	1.57
    > > > +++ sys/arch/amd64/amd64/vmm_machdep.c	15 Jun 2025 11:03:02 -0000
    > > > @@ -1588,15 +1588,15 @@ vcpu_reset_regs_svm(struct vcpu *vcpu, s
    > > >  	    SVM_INTERCEPT_MWAIT_UNCOND | SVM_INTERCEPT_MONITOR |
    > > >  	    SVM_INTERCEPT_MWAIT_COND | SVM_INTERCEPT_RDTSCP;
    > > >
    > > > -	/* With SEV-ES we cannot force access XCR0, thus no intercept */
    > > > -	if (xsave_mask && !vcpu->vc_seves)
    > > > +	if (xsave_mask && !vcpu->vc_seves)	/* XXX hshoexer */
    > >
    > > What does this XXX mean, and can we fix before commit?
    >
    > the XXX is already gone.  I think at that time I wanted to remind myself
    > to check if my assumption about XCR0 is right.
    >
    > > ...
    > > > @@ -1676,13 +1670,23 @@ vcpu_reset_regs_svm(struct vcpu *vcpu, s
    > > >
    > > >  		/* Set VMSA. */
    > > >  		vmcb->v_vmsa_pa = vcpu->vc_svm_vmsa_pa;
    > > > +
    > > > +		/* XXX hshoexer: LBR: guest_state_protected flag? */
    > >
    > > Can we fix this XXX before commit please?
    >
    > yes, already done.
    >
    > > > @@ -4608,6 +4623,8 @@ svm_handle_efercr(struct vcpu *vcpu, uin
    > > >  	case SVM_VMEXIT_CR4_WRITE_TRAP:
    > > >  		vmcb->v_cr4 = vmcb->v_exitinfo1;
    > > >  		break;
    > > > +		/* XXX hshoexer: no state for CR8? */
    > >
    > > Can we fix this XXX please before commit?
    >
    > yes, already done.
    >
    > > > +		break;
    > > >  	default:
    > > >  		return (EINVAL);
    > > >  	}
    > > > @@ -6767,6 +6784,8 @@ vcpu_run_svm(struct vcpu *vcpu, struct v
    > > >  		 * On exit, interrupts are disabled, and we are running with
    > > >  		 * the guest FPU state still possibly on the CPU. Save the FPU
    > > >  		 * state before re-enabling interrupts.
    > > > +		 *
    > > > +		 * XXX hshoexer:  With SEV-ES we should be able to skip this.
    > >
    > > Can we fix this XXX please before commit?
    >
    > yes, already done.
    >
    > > ...
    
    
    
  • Hans-Jörg Höxer:

    [EXT] Re: SEV-ES guest: locore #VC trap handling

  • Alexander Bluhm:

    SEV-ES guest: locore #VC trap handling