Download raw body.
[EXT] Re: SEV-ES guest: locore #VC trap handling
On Tue, Jun 24, 2025 at 02:12:22PM +0200, Hans-Jörg Höxer wrote:
> Hi Mike,
>
> thanks for going through the diff! The diff at hand is the large diff
> I sent some time ago. It includes more than the actual locore0 #VC diff
> (that is commited now).
>
> You raise some good questions and the diff at hand provides more context.
> So let me try to answer your questions inline, see below.
>
> Take care,
> Hans-Joerg
>
Thanks for clarifying. A couple small points inline...
-ml
> On Fri, Jun 20, 2025 at 12:10:19AM -0700, Mike Larkin wrote:
> > ...
> > > Index: sys/arch/amd64/amd64/locore0.S
> > > ===================================================================
> > > RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/amd64/locore0.S,v
> > > diff -u -p -r1.27 locore0.S
> > > --- sys/arch/amd64/amd64/locore0.S 5 May 2025 23:02:39 -0000 1.27
> > > +++ sys/arch/amd64/amd64/locore0.S 15 Jun 2025 11:03:01 -0000
> > > @@ -111,6 +111,7 @@
> > > #include <machine/param.h>
> > > #include <machine/segments.h>
> > > #include <machine/specialreg.h>
> > > +#include <machine/trap.h>
> > >
> > > /*
> > > * override user-land alignment before including asm.h
> > > @@ -193,6 +194,58 @@ bi_size_ok:
> > > pushl $PSL_MBO
> > > popfl
> > >
> > > + /*
> > > + * Setup temporary #VC trap handler, in case we are running
> > > + * on an AMD CPU in SEV-ES guest mode. Will be reset by
> > > + * init_x86_64().
> > > + * We are setting up two handlers:
> > > + *
> > > + * 1) locore_vc_trap32: Triggered when we are running in
> > > + * 32-bit legacy mode.
> > > + *
> > > + * 2) locore_vc_trap64: Triggered when we are running in
> > > + * 32-bit compatibility mode.
> > > + *
> > > + * The latter one is used by vmd(8).
> >
> > Please clarify; *when* is this used by vmd? I believe you mean when
> > we do a direct kernel launch? If not, then why do we need the 32 bit
> > one?
>
> Yes. To make initial work on SEV-* easier I started with just using
> direct kernel launch. So no firmware and bootload is involved. In that
> case, vmd(8) sets up 32-bit compatibility mode. This means, exceptions
> handlers are required to run in a long mode semgent. Thus in that
> situation the handler needs to be 64-bit code.
>
> On Linux/KVM with the tianocore EFI firmware we actually run BOOTX64.EFI.
> As far as I can see, this will run locore0 in 32-bit legacy mode.
> So in that situation the trap handler runs in 32-bit legacy mode.
> Thus we need a 32-bit handler, too.
>
> I will update the comment.
>
> Why is vmd(8) configuring compatibility mode for direct kernel launch?
> Legacy mode should also work, no?
>
> In the long term we need to be able to use legacy or EFI boot to run SEV-*
> guests anyway. So this code might by simplified in the future.
>
> > ...
> > > @@ -384,7 +436,9 @@ cont:
> > > #define PROC0_DMP2_OFF (PROC0_DMP3_OFF + NDML3_ENTRIES * NBPG)
> > > #define TABLESIZE \
> > > ((NKL4_KIMG_ENTRIES + TABLE_L3_ENTRIES + TABLE_L2_ENTRIES + 1 + UPAGES + \
> > > - NDML3_ENTRIES + NDML2_ENTRIES + 3) * NBPG)
> > > + NDML3_ENTRIES + NDML2_ENTRIES + 2 + 3) * NBPG)
> >
> > Can this just be '5' ?
>
> yes.
>
> > ...
> > > @@ -514,6 +579,16 @@ map_tables:
> > > shrl $PGSHIFT,%ecx
> > > fillkpt_nx
> > >
> > > + /* Re-Map GHCB shared (ie. unencrypted) */
> > > + /* XXX hshoexer: Only in SEV-ES guestmode. */
> >
> > Can we fix the XXXs in this diff before committing please?
>
> yes.
>
> > ...
> > > @@ -739,12 +819,131 @@ longmode_hi:
> > > movw %ax,%fs
> > >
> > > leaq TABLESIZE(%rsi),%rdi
> > > + subq $(NBPG*2), %rdi
> > > subq $(NBPG*3), %rdi
> >
> > Can these be combined to NBPG * 5 ?
>
> yes.
>
> >
> > > /* XXX merge these */
> > > call init_x86_64
> > > call main
> > >
> > > + /* MSR Protocol Request Codes */
> > > +#define MSRPROTO_CPUID_REQ 0x4
> > > +#define MSRPROTO_TERM_REQ 0x100
> > > +
> > > +vc_cpuid64:
> > > + shll $30, %eax /* requested register */
> > > + orl $MSRPROTO_CPUID_REQ, %eax
> > > + movl %ebx, %edx /* CPUID function */
> > > + movl $MSR_SEV_GHCB, %ecx
> > > + wrmsr
> > > + rep vmmcall
> >
> > Out of curiousity, why is the rep prefix needed here?
>
> In the APM vol3 vmmcall is specified as "0x0f 0x01 0xd9" and vmgexit as
> "0xf2/0xf3 0x0f 0x01 0xd9". As we do not have the vmgext mnemonic in
> llvm yet, using "rep vmmcall" will produce the right byte sequence.
> Another option would be to use ".byte". And of course adding the vmgexit
> mnemonic to llvm.
>
> I think I saw the use of "rep vmmcall" in linux kernel code.
>
> The difference between vmmcall and vmgexit seems to be, that if the
> hypervisor does not intercept vmmcall, the CPU will raise #UD and the
> guest will not exit. However, vmgexit will always exit the guest.
>
> > ...
> > > @@ -1314,6 +1317,38 @@ cpu_init_idt(void)
> > > lidt(®ion);
> > > }
> > >
> > > +uint64_t early_gdt[GDT_SIZE / 8];
> > > +
> > > +void
> > > +cpu_init_early_vctrap(paddr_t addr)
> > > +{
> > > + struct region_descriptor region;
> > > +
> > > + extern struct region_descriptor gdt64;
> > > + extern struct gate_descriptor early_idt[NIDT];
> > > + extern void Xvctrap_early(void);
> > > +
> > > + /* Setup temporary "early" longmode GDT, will be reset soon */
> > > + memset(early_gdt, 0, sizeof(early_gdt));
> > > + set_mem_segment(GDT_ADDR_MEM(early_gdt, GCODE_SEL), 0, 0xfffff,
> > > + SDT_MEMERA, SEL_KPL, 1, 0, 1);
> > > + set_mem_segment(GDT_ADDR_MEM(early_gdt, GDATA_SEL), 0, 0xfffff,
> > > + SDT_MEMRWA, SEL_KPL, 1, 0, 1);
> > > + setregion(®ion, early_gdt, GDT_SIZE - 1);
> > > + lgdt(®ion);
> > > +
> > > + /* Setup temporary "early" longmode #VC entry, will be reset soon */
> > > + setgate(&early_idt[T_VC], Xvctrap_early, 0, SDT_SYS386IGT,
> > > + SEL_KPL, GSEL(GCODE_SEL, SEL_KPL));
> > > + setregion(®ion, early_idt, NIDT * sizeof(idt[0]) - 1);
> > > + lidt(®ion);
> > > +
> > > + /* Tell vmm(4) about our GHCB. */
> > > + ghcb_paddr = addr;
> > > + memset((void *)ghcb_vaddr, 0, 2 * PAGE_SIZE);
> > > + wrmsr(MSR_SEV_GHCB, ghcb_paddr);
> > > +}
> > > +
> > > void
> > > cpu_init_extents(void)
> > > {
> > > @@ -1433,6 +1468,13 @@ init_x86_64(paddr_t first_avail)
> > > bios_memmap_t *bmp;
> > > int x, ist;
> > > uint64_t max_dm_size = ((uint64_t)512 * NUM_L4_SLOT_DIRECT) << 30;
> > > +
> > > + /*
> > > + * locore0 mapped 2 pages for use as GHCB before pmap is initialized.
> > > + */
> > > + if (ISSET(cpu_sev_guestmode, SEV_STAT_ES_ENABLED))
> > > + cpu_init_early_vctrap(first_avail);
> > > + first_avail += 2 * NBPG;
> >
> > Wasn't 'first_avail' already biased by the subq instructions in locore0 though?
> >
> > Why do we need to adjust again?
>
> Good point. Maybe I misunderstood the existing code regarding
> early_pte_pages: For "early pte" locore0 allocates/reserves 3 pages, I
> add another 2 for "early vc trap". So first_avail is the address of the
> lowest of these 5 pages. And when actually "consuming" them by providing
> them to cpu_init_early_vctrap() with first_avail I adjust by 2*NBPG.
>
> Similar to:
>
> + /*
> + * locore0 mapped 3 pages for use before the pmap is initialized
> + * starting at first_avail. These pages are currently used by
> + * efifb to create early-use VAs for the framebuffer before efifb
> + * is attached.
> + */
> + early_pte_pages = first_avail;
> + first_avail += 3 * NBPG;
>
> So following users of first_avail will not touch these 5 pages. Does this
> make sense?
>
ok
> > >
> > > /*
> > > * locore0 mapped 3 pages for use before the pmap is initialized
> >...
> > > +int
> > > +vctrap(struct trapframe *frame, int user)
> > > +{
> > > + uint64_t sw_exitcode, sw_exitinfo1, sw_exitinfo2;
> > > + uint8_t *rip = (uint8_t *)(frame->tf_rip);
> > > + uint16_t port;
> > > + struct ghcb_sync syncout, syncin;
> > > + struct ghcb_sa *ghcb;
> > > +
> > > + intr_disable();
> > > +
> > > + memset(&syncout, 0, sizeof(syncout));
> > > + memset(&syncin, 0, sizeof(syncin));
> > > +
> > > + sw_exitcode = frame->tf_err;
> > > + sw_exitinfo1 = 0;
> > > + sw_exitinfo2 = 0;
> > > +
> > > + switch (sw_exitcode) {
> > > + case SVM_VMEXIT_CPUID:
> > > + ghcb_sync_val(GHCB_RAX, GHCB_SZ32, &syncout);
> > > + ghcb_sync_val(GHCB_RCX, GHCB_SZ32, &syncout);
> > > + ghcb_sync_val(GHCB_RAX, GHCB_SZ32, &syncin);
> > > + ghcb_sync_val(GHCB_RBX, GHCB_SZ32, &syncin);
> > > + ghcb_sync_val(GHCB_RCX, GHCB_SZ32, &syncin);
> > > + ghcb_sync_val(GHCB_RDX, GHCB_SZ32, &syncin);
> > > + frame->tf_rip += 2;
> > > + break;
> > > + case SVM_VMEXIT_MSR: {
> > > + if (user)
> > > + return 0; /* not allowed from userspace */
> > > + if (*rip == 0x0f && *(rip + 1) == 0x30) {
> >
> > Doesn't this break with XO kernels?
>
> Good point! I have not considered XO, yet. Mh, worst case this would
> mean we can not use SEV-(ES|SNP) together PKU support? Or would there
> be a way to temporarily remap code readable and then remap XO?
>
It was just a thought. XO kernel will require much more to be done, this would
just add one more thing to the list. No action needed.
> >...
> > > + case SVM_VMEXIT_IOIO: {
> > > + if (user)
> > > + return 0; /* not allowed from userspace */
> > > + switch (*rip) {
> > > + case 0x66: {
> > > + switch (*(rip + 1)) {
> > > + case 0xef: /* out %ax,(%dx) */
> > > + ghcb_sync_val(GHCB_RAX, GHCB_SZ16, &syncout);
> > > + port = (uint16_t)frame->tf_rdx;
> > > + sw_exitinfo1 = (port << 16) |
> > > + (1ULL << 5);
> > > + frame->tf_rip += 2;
> > > + break;
> > > + case 0xed: /* in (%dx),%ax */
> > > + ghcb_sync_val(GHCB_RAX, GHCB_SZ16, &syncin);
> > > + port = (uint16_t)frame->tf_rdx;
> > > + sw_exitinfo1 = (port << 16) |
> > > + (1ULL << 5) | (1ULL << 0);
> > > + frame->tf_rip += 2;
> > > + break;
> > > + default:
> > > + panic("failed to decode prefixed IOIO");
> > > + }
> > > + break;
> > > + }
> > > + case 0xe4: /* in $0x71,%al */
> >
> > I don't understand this assumption. next byte 0xe4 does not imply
> > "in $0x71, %al"; what is meant by this comment? (also the next one).
>
> I am looking at the current byte at RIP, not the next one. Code like this:
>
> ffffffff81903a76: e4 71 in $0x71,%al
>
> Does this make sense?
My point was "0xe4" just means "in $xxx, %al", it doesn't mean "$0x71".
Look at the previous examples (0xef, 0xed); I'd recommend making the
comment in this case be more generic.
>
> > > + ghcb_sync_val(GHCB_RAX, GHCB_SZ8, &syncin);
> > > + port = *(rip + 1);
> > > + sw_exitinfo1 = (port << 16) | (1ULL << 4) |
> > > + (1ULL << 0);
> > > + frame->tf_rip += 2;
> > > + break;
> > ...
> > > Index: sys/arch/amd64/amd64/vmm_machdep.c
> > > ===================================================================
> > > RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/amd64/vmm_machdep.c,v
> > > diff -u -p -r1.57 vmm_machdep.c
> > > --- sys/arch/amd64/amd64/vmm_machdep.c 3 Jun 2025 19:15:29 -0000 1.57
> > > +++ sys/arch/amd64/amd64/vmm_machdep.c 15 Jun 2025 11:03:02 -0000
> > > @@ -1588,15 +1588,15 @@ vcpu_reset_regs_svm(struct vcpu *vcpu, s
> > > SVM_INTERCEPT_MWAIT_UNCOND | SVM_INTERCEPT_MONITOR |
> > > SVM_INTERCEPT_MWAIT_COND | SVM_INTERCEPT_RDTSCP;
> > >
> > > - /* With SEV-ES we cannot force access XCR0, thus no intercept */
> > > - if (xsave_mask && !vcpu->vc_seves)
> > > + if (xsave_mask && !vcpu->vc_seves) /* XXX hshoexer */
> >
> > What does this XXX mean, and can we fix before commit?
>
> the XXX is already gone. I think at that time I wanted to remind myself
> to check if my assumption about XCR0 is right.
>
> > ...
> > > @@ -1676,13 +1670,23 @@ vcpu_reset_regs_svm(struct vcpu *vcpu, s
> > >
> > > /* Set VMSA. */
> > > vmcb->v_vmsa_pa = vcpu->vc_svm_vmsa_pa;
> > > +
> > > + /* XXX hshoexer: LBR: guest_state_protected flag? */
> >
> > Can we fix this XXX before commit please?
>
> yes, already done.
>
> > > @@ -4608,6 +4623,8 @@ svm_handle_efercr(struct vcpu *vcpu, uin
> > > case SVM_VMEXIT_CR4_WRITE_TRAP:
> > > vmcb->v_cr4 = vmcb->v_exitinfo1;
> > > break;
> > > + /* XXX hshoexer: no state for CR8? */
> >
> > Can we fix this XXX please before commit?
>
> yes, already done.
>
> > > + break;
> > > default:
> > > return (EINVAL);
> > > }
> > > @@ -6767,6 +6784,8 @@ vcpu_run_svm(struct vcpu *vcpu, struct v
> > > * On exit, interrupts are disabled, and we are running with
> > > * the guest FPU state still possibly on the CPU. Save the FPU
> > > * state before re-enabling interrupts.
> > > + *
> > > + * XXX hshoexer: With SEV-ES we should be able to skip this.
> >
> > Can we fix this XXX please before commit?
>
> yes, already done.
>
> > ...
[EXT] Re: SEV-ES guest: locore #VC trap handling