From: Mike Larkin Subject: Re: [EXT] Re: SEV-ES guest: locore #VC trap handling To: tech@openbsd.org Date: Thu, 26 Jun 2025 02:02:51 -0700 On Tue, Jun 24, 2025 at 02:12:22PM +0200, Hans-Jörg Höxer wrote: > Hi Mike, > > thanks for going through the diff! The diff at hand is the large diff > I sent some time ago. It includes more than the actual locore0 #VC diff > (that is commited now). > > You raise some good questions and the diff at hand provides more context. > So let me try to answer your questions inline, see below. > > Take care, > Hans-Joerg > Thanks for clarifying. A couple small points inline... -ml > On Fri, Jun 20, 2025 at 12:10:19AM -0700, Mike Larkin wrote: > > ... > > > Index: sys/arch/amd64/amd64/locore0.S > > > =================================================================== > > > RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/amd64/locore0.S,v > > > diff -u -p -r1.27 locore0.S > > > --- sys/arch/amd64/amd64/locore0.S 5 May 2025 23:02:39 -0000 1.27 > > > +++ sys/arch/amd64/amd64/locore0.S 15 Jun 2025 11:03:01 -0000 > > > @@ -111,6 +111,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > > > > /* > > > * override user-land alignment before including asm.h > > > @@ -193,6 +194,58 @@ bi_size_ok: > > > pushl $PSL_MBO > > > popfl > > > > > > + /* > > > + * Setup temporary #VC trap handler, in case we are running > > > + * on an AMD CPU in SEV-ES guest mode. Will be reset by > > > + * init_x86_64(). > > > + * We are setting up two handlers: > > > + * > > > + * 1) locore_vc_trap32: Triggered when we are running in > > > + * 32-bit legacy mode. > > > + * > > > + * 2) locore_vc_trap64: Triggered when we are running in > > > + * 32-bit compatibility mode. > > > + * > > > + * The latter one is used by vmd(8). > > > > Please clarify; *when* is this used by vmd? I believe you mean when > > we do a direct kernel launch? If not, then why do we need the 32 bit > > one? > > Yes. To make initial work on SEV-* easier I started with just using > direct kernel launch. So no firmware and bootload is involved. In that > case, vmd(8) sets up 32-bit compatibility mode. This means, exceptions > handlers are required to run in a long mode semgent. Thus in that > situation the handler needs to be 64-bit code. > > On Linux/KVM with the tianocore EFI firmware we actually run BOOTX64.EFI. > As far as I can see, this will run locore0 in 32-bit legacy mode. > So in that situation the trap handler runs in 32-bit legacy mode. > Thus we need a 32-bit handler, too. > > I will update the comment. > > Why is vmd(8) configuring compatibility mode for direct kernel launch? > Legacy mode should also work, no? > > In the long term we need to be able to use legacy or EFI boot to run SEV-* > guests anyway. So this code might by simplified in the future. > > > ... > > > @@ -384,7 +436,9 @@ cont: > > > #define PROC0_DMP2_OFF (PROC0_DMP3_OFF + NDML3_ENTRIES * NBPG) > > > #define TABLESIZE \ > > > ((NKL4_KIMG_ENTRIES + TABLE_L3_ENTRIES + TABLE_L2_ENTRIES + 1 + UPAGES + \ > > > - NDML3_ENTRIES + NDML2_ENTRIES + 3) * NBPG) > > > + NDML3_ENTRIES + NDML2_ENTRIES + 2 + 3) * NBPG) > > > > Can this just be '5' ? > > yes. > > > ... > > > @@ -514,6 +579,16 @@ map_tables: > > > shrl $PGSHIFT,%ecx > > > fillkpt_nx > > > > > > + /* Re-Map GHCB shared (ie. unencrypted) */ > > > + /* XXX hshoexer: Only in SEV-ES guestmode. */ > > > > Can we fix the XXXs in this diff before committing please? > > yes. > > > ... > > > @@ -739,12 +819,131 @@ longmode_hi: > > > movw %ax,%fs > > > > > > leaq TABLESIZE(%rsi),%rdi > > > + subq $(NBPG*2), %rdi > > > subq $(NBPG*3), %rdi > > > > Can these be combined to NBPG * 5 ? > > yes. > > > > > > /* XXX merge these */ > > > call init_x86_64 > > > call main > > > > > > + /* MSR Protocol Request Codes */ > > > +#define MSRPROTO_CPUID_REQ 0x4 > > > +#define MSRPROTO_TERM_REQ 0x100 > > > + > > > +vc_cpuid64: > > > + shll $30, %eax /* requested register */ > > > + orl $MSRPROTO_CPUID_REQ, %eax > > > + movl %ebx, %edx /* CPUID function */ > > > + movl $MSR_SEV_GHCB, %ecx > > > + wrmsr > > > + rep vmmcall > > > > Out of curiousity, why is the rep prefix needed here? > > In the APM vol3 vmmcall is specified as "0x0f 0x01 0xd9" and vmgexit as > "0xf2/0xf3 0x0f 0x01 0xd9". As we do not have the vmgext mnemonic in > llvm yet, using "rep vmmcall" will produce the right byte sequence. > Another option would be to use ".byte". And of course adding the vmgexit > mnemonic to llvm. > > I think I saw the use of "rep vmmcall" in linux kernel code. > > The difference between vmmcall and vmgexit seems to be, that if the > hypervisor does not intercept vmmcall, the CPU will raise #UD and the > guest will not exit. However, vmgexit will always exit the guest. > > > ... > > > @@ -1314,6 +1317,38 @@ cpu_init_idt(void) > > > lidt(®ion); > > > } > > > > > > +uint64_t early_gdt[GDT_SIZE / 8]; > > > + > > > +void > > > +cpu_init_early_vctrap(paddr_t addr) > > > +{ > > > + struct region_descriptor region; > > > + > > > + extern struct region_descriptor gdt64; > > > + extern struct gate_descriptor early_idt[NIDT]; > > > + extern void Xvctrap_early(void); > > > + > > > + /* Setup temporary "early" longmode GDT, will be reset soon */ > > > + memset(early_gdt, 0, sizeof(early_gdt)); > > > + set_mem_segment(GDT_ADDR_MEM(early_gdt, GCODE_SEL), 0, 0xfffff, > > > + SDT_MEMERA, SEL_KPL, 1, 0, 1); > > > + set_mem_segment(GDT_ADDR_MEM(early_gdt, GDATA_SEL), 0, 0xfffff, > > > + SDT_MEMRWA, SEL_KPL, 1, 0, 1); > > > + setregion(®ion, early_gdt, GDT_SIZE - 1); > > > + lgdt(®ion); > > > + > > > + /* Setup temporary "early" longmode #VC entry, will be reset soon */ > > > + setgate(&early_idt[T_VC], Xvctrap_early, 0, SDT_SYS386IGT, > > > + SEL_KPL, GSEL(GCODE_SEL, SEL_KPL)); > > > + setregion(®ion, early_idt, NIDT * sizeof(idt[0]) - 1); > > > + lidt(®ion); > > > + > > > + /* Tell vmm(4) about our GHCB. */ > > > + ghcb_paddr = addr; > > > + memset((void *)ghcb_vaddr, 0, 2 * PAGE_SIZE); > > > + wrmsr(MSR_SEV_GHCB, ghcb_paddr); > > > +} > > > + > > > void > > > cpu_init_extents(void) > > > { > > > @@ -1433,6 +1468,13 @@ init_x86_64(paddr_t first_avail) > > > bios_memmap_t *bmp; > > > int x, ist; > > > uint64_t max_dm_size = ((uint64_t)512 * NUM_L4_SLOT_DIRECT) << 30; > > > + > > > + /* > > > + * locore0 mapped 2 pages for use as GHCB before pmap is initialized. > > > + */ > > > + if (ISSET(cpu_sev_guestmode, SEV_STAT_ES_ENABLED)) > > > + cpu_init_early_vctrap(first_avail); > > > + first_avail += 2 * NBPG; > > > > Wasn't 'first_avail' already biased by the subq instructions in locore0 though? > > > > Why do we need to adjust again? > > Good point. Maybe I misunderstood the existing code regarding > early_pte_pages: For "early pte" locore0 allocates/reserves 3 pages, I > add another 2 for "early vc trap". So first_avail is the address of the > lowest of these 5 pages. And when actually "consuming" them by providing > them to cpu_init_early_vctrap() with first_avail I adjust by 2*NBPG. > > Similar to: > > + /* > + * locore0 mapped 3 pages for use before the pmap is initialized > + * starting at first_avail. These pages are currently used by > + * efifb to create early-use VAs for the framebuffer before efifb > + * is attached. > + */ > + early_pte_pages = first_avail; > + first_avail += 3 * NBPG; > > So following users of first_avail will not touch these 5 pages. Does this > make sense? > ok > > > > > > /* > > > * locore0 mapped 3 pages for use before the pmap is initialized > >... > > > +int > > > +vctrap(struct trapframe *frame, int user) > > > +{ > > > + uint64_t sw_exitcode, sw_exitinfo1, sw_exitinfo2; > > > + uint8_t *rip = (uint8_t *)(frame->tf_rip); > > > + uint16_t port; > > > + struct ghcb_sync syncout, syncin; > > > + struct ghcb_sa *ghcb; > > > + > > > + intr_disable(); > > > + > > > + memset(&syncout, 0, sizeof(syncout)); > > > + memset(&syncin, 0, sizeof(syncin)); > > > + > > > + sw_exitcode = frame->tf_err; > > > + sw_exitinfo1 = 0; > > > + sw_exitinfo2 = 0; > > > + > > > + switch (sw_exitcode) { > > > + case SVM_VMEXIT_CPUID: > > > + ghcb_sync_val(GHCB_RAX, GHCB_SZ32, &syncout); > > > + ghcb_sync_val(GHCB_RCX, GHCB_SZ32, &syncout); > > > + ghcb_sync_val(GHCB_RAX, GHCB_SZ32, &syncin); > > > + ghcb_sync_val(GHCB_RBX, GHCB_SZ32, &syncin); > > > + ghcb_sync_val(GHCB_RCX, GHCB_SZ32, &syncin); > > > + ghcb_sync_val(GHCB_RDX, GHCB_SZ32, &syncin); > > > + frame->tf_rip += 2; > > > + break; > > > + case SVM_VMEXIT_MSR: { > > > + if (user) > > > + return 0; /* not allowed from userspace */ > > > + if (*rip == 0x0f && *(rip + 1) == 0x30) { > > > > Doesn't this break with XO kernels? > > Good point! I have not considered XO, yet. Mh, worst case this would > mean we can not use SEV-(ES|SNP) together PKU support? Or would there > be a way to temporarily remap code readable and then remap XO? > It was just a thought. XO kernel will require much more to be done, this would just add one more thing to the list. No action needed. > >... > > > + case SVM_VMEXIT_IOIO: { > > > + if (user) > > > + return 0; /* not allowed from userspace */ > > > + switch (*rip) { > > > + case 0x66: { > > > + switch (*(rip + 1)) { > > > + case 0xef: /* out %ax,(%dx) */ > > > + ghcb_sync_val(GHCB_RAX, GHCB_SZ16, &syncout); > > > + port = (uint16_t)frame->tf_rdx; > > > + sw_exitinfo1 = (port << 16) | > > > + (1ULL << 5); > > > + frame->tf_rip += 2; > > > + break; > > > + case 0xed: /* in (%dx),%ax */ > > > + ghcb_sync_val(GHCB_RAX, GHCB_SZ16, &syncin); > > > + port = (uint16_t)frame->tf_rdx; > > > + sw_exitinfo1 = (port << 16) | > > > + (1ULL << 5) | (1ULL << 0); > > > + frame->tf_rip += 2; > > > + break; > > > + default: > > > + panic("failed to decode prefixed IOIO"); > > > + } > > > + break; > > > + } > > > + case 0xe4: /* in $0x71,%al */ > > > > I don't understand this assumption. next byte 0xe4 does not imply > > "in $0x71, %al"; what is meant by this comment? (also the next one). > > I am looking at the current byte at RIP, not the next one. Code like this: > > ffffffff81903a76: e4 71 in $0x71,%al > > Does this make sense? My point was "0xe4" just means "in $xxx, %al", it doesn't mean "$0x71". Look at the previous examples (0xef, 0xed); I'd recommend making the comment in this case be more generic. > > > > + ghcb_sync_val(GHCB_RAX, GHCB_SZ8, &syncin); > > > + port = *(rip + 1); > > > + sw_exitinfo1 = (port << 16) | (1ULL << 4) | > > > + (1ULL << 0); > > > + frame->tf_rip += 2; > > > + break; > > ... > > > Index: sys/arch/amd64/amd64/vmm_machdep.c > > > =================================================================== > > > RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/amd64/vmm_machdep.c,v > > > diff -u -p -r1.57 vmm_machdep.c > > > --- sys/arch/amd64/amd64/vmm_machdep.c 3 Jun 2025 19:15:29 -0000 1.57 > > > +++ sys/arch/amd64/amd64/vmm_machdep.c 15 Jun 2025 11:03:02 -0000 > > > @@ -1588,15 +1588,15 @@ vcpu_reset_regs_svm(struct vcpu *vcpu, s > > > SVM_INTERCEPT_MWAIT_UNCOND | SVM_INTERCEPT_MONITOR | > > > SVM_INTERCEPT_MWAIT_COND | SVM_INTERCEPT_RDTSCP; > > > > > > - /* With SEV-ES we cannot force access XCR0, thus no intercept */ > > > - if (xsave_mask && !vcpu->vc_seves) > > > + if (xsave_mask && !vcpu->vc_seves) /* XXX hshoexer */ > > > > What does this XXX mean, and can we fix before commit? > > the XXX is already gone. I think at that time I wanted to remind myself > to check if my assumption about XCR0 is right. > > > ... > > > @@ -1676,13 +1670,23 @@ vcpu_reset_regs_svm(struct vcpu *vcpu, s > > > > > > /* Set VMSA. */ > > > vmcb->v_vmsa_pa = vcpu->vc_svm_vmsa_pa; > > > + > > > + /* XXX hshoexer: LBR: guest_state_protected flag? */ > > > > Can we fix this XXX before commit please? > > yes, already done. > > > > @@ -4608,6 +4623,8 @@ svm_handle_efercr(struct vcpu *vcpu, uin > > > case SVM_VMEXIT_CR4_WRITE_TRAP: > > > vmcb->v_cr4 = vmcb->v_exitinfo1; > > > break; > > > + /* XXX hshoexer: no state for CR8? */ > > > > Can we fix this XXX please before commit? > > yes, already done. > > > > + break; > > > default: > > > return (EINVAL); > > > } > > > @@ -6767,6 +6784,8 @@ vcpu_run_svm(struct vcpu *vcpu, struct v > > > * On exit, interrupts are disabled, and we are running with > > > * the guest FPU state still possibly on the CPU. Save the FPU > > > * state before re-enabling interrupts. > > > + * > > > + * XXX hshoexer: With SEV-ES we should be able to skip this. > > > > Can we fix this XXX please before commit? > > yes, already done. > > > ...