From: Alexander Bluhm Subject: Re: SEV-ES guest: locore #VC trap handling To: Mike Larkin Cc: tech@openbsd.org Date: Mon, 23 Jun 2025 01:36:20 +0200 On Sat, Jun 21, 2025 at 10:43:23PM -0700, Mike Larkin wrote: > On Fri, Jun 20, 2025 at 09:06:37PM +0200, Alexander Bluhm wrote: > > On Fri, Jun 20, 2025 at 12:10:19AM -0700, Mike Larkin wrote: > > > On Sun, Jun 15, 2025 at 04:55:12PM +0200, Alexander Bluhm wrote: > > > > On Sat, Jun 14, 2025 at 10:23:53PM -0700, Mike Larkin wrote: > > > > > On Wed, May 21, 2025 at 05:10:27PM +0200, Hans-J?rg H?xer wrote: > > > > > > Hi, > > > > > > > > > > > > this change deals with locore for SEV-ES enabled guests. The approach > > > > > > might be a bit controversial. And it requires a diff for vmd(8), that > > > > > > I've also attached, to simplify the discussion: > > > > > > > > > > > > SEV-ES guest: locore #VC trap handling > > > > > > > > > > > > When locore is executed by a SEV-ES enabled guest the first cpuid > > > > > > instruction will raise a #VC trap that will need to be handled. > > > > > > However, at that point in time the guest does not know wether it's > > > > > > a guest at all, if it is running on an AMD cpu with SEV-ES enabled, > > > > > > etc. > > > > > > > > > > > > To resolve this chicken-egg situation we undconditionally setup a > > > > > > > > > > > > As vmd(8) configures the runtime for locore to be in 32 bit > > > > > > compatibility mode a raised #VC exception will switch to long mode. > > > > > > And the CPU will expect a 64 bit entry in the IDT. When running > > > > > > on eg. KVM locore is execute in 32 bit legacy mode. There the > > > > > > CPU will expect a 32 bit entry in the IDT. > > > > > > > > > > > > To accomodate both situations, we set up both 64 and 32 bit handler > > > > > > in the IDT. > > > > > > > > > > > > Additionally, vmd(8) has to setup a long mode segment in the GDT. > > > > > > > > > > > > Both #VC trap handler use the MSR protocol to talk to the hypervisor > > > > > > to emulate CPUID. The MSR protocol only supports "simple" CPUID > > > > > > without subfunctions. > > > > > > > > > > > > Note: When SEV-ES is enabled, the hypervisor can not intercept > > > > > > writes to EFER beforehand, only after the write. Thus on vmm(4) > > > > > > with directly executed kernel we are in compatibility mode and > > > > > > EFER_LMA is set. As resetting EFER_LMA raises #GP we have to > > > > > > preserve it. > > > > > > > > > > > > Take care, > > > > > > HJ. > > > > > > > > > > > > > > > > This one probably needs a lot of testing. What tests have you guys done so > > > > > far? My SEV-ES machine is available now, I will test this and let you know > > > > > but we probably want a lot of testing from different host/guest combos. > > > > > > > > I know that this part does not break SEV (without ES) on vmm/vmd > > > > and kvm/qemu. > > > > > > > > Below is the complete diff that hshoexer@ sent some months ago > > > > rebased to current. With that I can run OpenBSD guest with SEV-ES > > > > in vmm/vmd. On kvm/qemu SEV still works, but I cannot get guest > > > > running with SEV-ES. > > > > > > > > So if you agree with the locore0 part, that was in the original > > > > mail of this thread, I would like to commit it. > > > > > > > > bluhm > > > > > > > > > > I think this diff needs a fair amount of work before it can go in. > > > > The latest diff I sent is everything that is missing for SEV-ES > > support. It has some old XXX from months ago, that have been > > resolved. It was never intended for commit, I provided it just in > > case you want to test the whole thing. In your previous mail you > > said that you want to test something. The huge diff was not intended > > for review. > > > > > See below for comments. This is going to require a ton of testing. How do > > > you want to proceed? For example, I think this needs to be tested as a > > > guest on a number of sev/sev-es capable hypervisors and a wide variety > > > of hardware. You already mentioned one breakage above, right? > > > > The only problem I see is SEV-ES on Linux KVM/qemu host. But we > > never had support for that before. I would like to address this > > after OpenBSD running on vmm/vmd has been commited. > > > > I do have Linux KVM/qemu and OpenBSD vmm/vmd on AMD and Intel. On > > AMD I can run SEV with OpenBSD guest, but not SEV-ES because the > > diff is not fully commited yet. As currently SEV-ES is not supported, > > this is not a regression. > > > > Below is the diff that needs review. It is the locore0 part of the > > original mail in this thread. The only testing that has to be done > > is that it does not break anything. SEV-ES will not work, as it > > is not complete. > > > > Two of your remarks still apply with this next step. > > > > > > @@ -193,6 +194,58 @@ bi_size_ok: > > > > pushl $PSL_MBO > > > > popfl > > > > > > > > + /* > > > > + * Setup temporary #VC trap handler, in case we are running > > > > + * on an AMD CPU in SEV-ES guest mode. Will be reset by > > > > + * init_x86_64(). > > > > + * We are setting up two handlers: > > > > + * > > > > + * 1) locore_vc_trap32: Triggered when we are running in > > > > + * 32-bit legacy mode. > > > > + * > > > > + * 2) locore_vc_trap64: Triggered when we are running in > > > > + * 32-bit compatibility mode. > > > > + * > > > > + * The latter one is used by vmd(8). > > > > > > Please clarify; *when* is this used by vmd? I believe you mean when > > > we do a direct kernel launch? If not, then why do we need the 32 bit > > > one? > > > > There are two ways we may enter the kernel. KMV/qemu uses special > > Tiano-Core EFI implementation. For vmm/vmd currently we support > > direct kernel boot only. hshoexer@ explained to me why we need > > both 32 bit methods here, but I forgot. > > > > > > /* XXX merge these */ > > > > call init_x86_64 > > > > call main > > > > > > > > + /* MSR Protocol Request Codes */ > > > > +#define MSRPROTO_CPUID_REQ 0x4 > > > > +#define MSRPROTO_TERM_REQ 0x100 > > > > + > > > > +vc_cpuid64: > > > > + shll $30, %eax /* requested register */ > > > > + orl $MSRPROTO_CPUID_REQ, %eax > > > > + movl %ebx, %edx /* CPUID function */ > > > > + movl $MSR_SEV_GHCB, %ecx > > > > + wrmsr > > > > + rep vmmcall > > > > > > Out of curiousity, why is the rep prefix needed here? > > > > I don't know. hshoexer? > > > > Below is the next diff, that I want to commit. It is guest side > > only. My testing can only cover that nothing breaks. > > > > ok? > > > > bluhm > > > > I think this locore diff below looks ok to me but it changes some very > fundamental early boot behavior. I was asking before what your plan is for > testing this before commit. I don't think testing this on just a handful > of machines is sufficient. It's particularly tough with this one as if it > breaks someone, the machine will pretty much be bricked and require fixing > via install media. That's why I'm hesitant. As we do not touch disk or boot loader, boot> obsd should be enough to recover. I have tested it on laptops and servers, Intel and AMD, virtual and physical machines, with and without SEV. No fallout so far. I will commit tomorrow when hshoexer is in the office. Thanks for all your effort looking into this. > If you are ready to handle any fallout, then sure. It's probably the right > time in the release process for this. ok mlarkin on the diff itself in that > case. There are a few things we can change later (like using .Lxxx in all > symbols in locore0 and there are a few style(9) things) but nothing that > warrants fixing now. > > P.S. hshoexer: The locore0 page gets unmapped right after we are done with it; > can you please verify that the #VC traps and locore0 IDTs aren't needed after > that unmapping? If I remember correctly, this handler is only temporarily used during boot. It will we replaced with a permanent handler as soon the C code initializes. This will be in a follow up diff. bluhm > > -ml > > > On Wed, May 21, 2025 at 05:19:14PM +0200, Hans-J?rg H?xer wrote: > > > > SEV-ES guest: locore #VC trap handling > > > > > > > > When locore is executed by a SEV-ES enabled guest the first cpuid > > > > instruction will raise a #VC trap that will need to be handled. > > > > However, at that point in time the guest does not know wether it's > > > > a guest at all, if it is running on an AMD cpu with SEV-ES enabled, > > > > etc. > > > > > > > > To resolve this chicken-egg situation we undconditionally setup a > > > > #VC trap handler. If we are not a SEV-ES enabled guest, or we are > > > > runnign on some non-AMD CPU we will not raise #VC (hopefully). On > > > > Intel CPUs the vector for #VC is reserved. > > > > > > > > As vmd(8) configures the runtime for locore to be in 32 bit > > > > compatibility mode a raised #VC exception will switch to long mode. > > > > And the CPU will expect a 64 bit entry in the IDT. When running > > > > on eg. KVM locore is execute in 32 bit legacy mode. There the > > > > CPU will expect a 32 bit entry in the IDT. > > > > > > > > To accomodate both situations, we set up both 64 and 32 bit handler > > > > in the IDT. > > > > > > > > Additionally, vmd(8) has to setup a long mode segment in the GDT. > > > > > > > > Both #VC trap handler use the MSR protocol to talk to the hypervisor > > > > to emulate CPUID. The MSR protocol only supports "simple" CPUID > > > > without subfunctions. > > > > > > > > Note: When SEV-ES is enabled, the hypervisor can not intercept > > > > writes to EFER beforehand, only after the write. Thus on vmm(4) > > > > with directly executed kernel we are in compatibility mode and > > > > EFER_LMA is set. As resetting EFER_LMA raises #GP we have to > > > > preserve it. > > > > Index: arch/amd64/amd64/locore0.S > > =================================================================== > > RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/amd64/locore0.S,v > > diff -u -p -r1.27 locore0.S > > --- arch/amd64/amd64/locore0.S 5 May 2025 23:02:39 -0000 1.27 > > +++ arch/amd64/amd64/locore0.S 20 Jun 2025 16:31:14 -0000 > > @@ -111,6 +111,9 @@ > > #include > > #include > > #include > > +#include > > +#include > > +#include > > > > /* > > * override user-land alignment before including asm.h > > @@ -193,6 +196,58 @@ bi_size_ok: > > pushl $PSL_MBO > > popfl > > > > + /* > > + * Setup temporary #VC trap handler, in case we are running > > + * on an AMD CPU in SEV-ES guest mode. Will be reset by > > + * init_x86_64(). > > + * We are setting up two handlers: > > + * > > + * 1) locore_vc_trap32: Triggered when we are running in > > + * 32-bit legacy mode. > > + * > > + * 2) locore_vc_trap64: Triggered when we are running in > > + * 32-bit compatibility mode. > > + * > > + * The latter one is used by vmd(8). > > + */ > > + movl $RELOC(early_idt), %ecx > > + movl $T_VC, %edx > > + leal (%ecx, %edx, 8), %ecx /* 32bit #VC IDT slot */ > > + > > + pushl %cs /* get current %cs */ > > + popl %ebx > > + shll $16, %ebx > > + > > + movl $RELOC(locore_vc_trap32), %eax > > + andl $0x0000ffff, %eax > > + orl %ebx, %eax /* use current %cs */ > > + movl %eax, (%ecx) > > + > > + movl $RELOC(locore_vc_trap32), %eax > > + andl $0xffff0000, %eax > > + orl $((0x80 | SDT_SYS386IGT) << 8), %eax > > + movl %eax, 4(%ecx) > > + > > + movl $RELOC(early_idt), %ecx > > + movl $(2 * T_VC), %edx > > + leal (%ecx, %edx, 8), %ecx /* 64bit #VC IDT slot */ > > + > > + movl $RELOC(locore_vc_trap64), %eax > > + andl $0x0000ffff, %eax > > + orl $(GSEL(3, SEL_KPL) << 16), %eax > > + movl %eax, (%ecx) > > + > > + movl $RELOC(locore_vc_trap64), %eax > > + andl $0xffff0000, %eax > > + orl $((0x80 | SDT_SYS386IGT) << 8), %eax > > + movl %eax, 4(%ecx) > > + xorl %eax, %eax > > + movl %eax, 8(%ecx) > > + movl %eax, 12(%ecx) > > + > > + movl $RELOC(idtlc), %eax > > + lidt (%eax) > > + > > /* Reset debug control registers */ > > xorl %eax,%eax > > movl %eax,%dr6 > > @@ -631,8 +686,14 @@ store_pte: > > */ > > movl $MSR_EFER,%ecx > > rdmsr > > + movl %eax,%ebx > > xorl %eax,%eax /* XXX */ > > orl $(EFER_LME|EFER_SCE),%eax > > + /* If set, preserve LMA */ > > + testl $EFER_LMA,%ebx > > + jz efer_nxe > > + orl $EFER_LMA,%eax > > +efer_nxe: > > movl RELOC((pg_nx + 4)), %ebx > > cmpl $0, %ebx > > je write_efer > > @@ -745,6 +806,118 @@ longmode_hi: > > call init_x86_64 > > call main > > > > +vc_cpuid64: > > + shll $30, %eax /* requested register */ > > + orl $MSR_PROTO_CPUID_REQ, %eax > > + movl %ebx, %edx /* CPUID function */ > > + movl $MSR_SEV_GHCB, %ecx > > + wrmsr > > + rep vmmcall > > + rdmsr > > + ret > > + > > + .globl locore_vc_trap64 > > +locore_vc_trap64: > > + pushq %rax > > + pushq %rbx > > + pushq %rcx > > + pushq %rdx > > + > > + cmpl $SVM_VMEXIT_CPUID, 32(%rsp) > > + jne .Lterminate64 > > + > > + movl %eax, %ebx /* save CPUID function */ > > + > > + movl $0, %eax /* request cpuid, get %eax */ > > + call vc_cpuid64 > > + movq %rdx, 24(%rsp) > > + > > + movl $1, %eax /* get %ebx */ > > + call vc_cpuid64 > > + movq %rdx, 16(%rsp) > > + > > + movl $2, %eax /* get %ecx */ > > + call vc_cpuid64 > > + movq %rdx, 8(%rsp) > > + > > + movl $3, %eax /* get %edx */ > > + call vc_cpuid64 > > + movq %rdx, 0(%rsp) > > + > > + popq %rdx > > + popq %rcx > > + popq %rbx > > + popq %rax > > + addq $8, %rsp > > + addq $2, (%rsp) > > + iretq > > + > > +.Lterminate64: > > + movl $MSR_PROTO_TERMINATE, %eax > > + movl $MSR_SEV_GHCB, %ecx > > + wrmsr > > + rep vmmcall > > +.Lterm_loop64: > > + hlt > > + jmp .Lterm_loop64 > > + > > + .code32 > > +vc_cpuid32: > > + shll $30, %eax /* requested register */ > > + orl $MSR_PROTO_CPUID_REQ, %eax > > + movl %ebx, %edx /* CPUID function */ > > + movl $MSR_SEV_GHCB, %ecx > > + wrmsr > > + rep vmmcall > > + rdmsr > > + ret > > + > > + .globl locore_vc_trap32 > > +locore_vc_trap32: > > + pushl %eax > > + pushl %ebx > > + pushl %ecx > > + pushl %edx > > + > > + cmpl $SVM_VMEXIT_CPUID, 16(%esp) > > + jne .Lterminate32 > > + > > + movl %eax, %ebx /* save CPUID function */ > > + > > + movl $0, %eax /* request cpuid, get %eax */ > > + call vc_cpuid32 > > + movl %edx, 12(%esp) > > + > > + movl $1, %eax /* get %ebx */ > > + call vc_cpuid32 > > + movl %edx, 8(%esp) > > + > > + movl $2, %eax /* get %ecx */ > > + call vc_cpuid32 > > + movl %edx, 4(%esp) > > + > > + movl $3, %eax /* get %edx */ > > + call vc_cpuid32 > > + movl %edx, 0(%esp) > > + > > + popl %edx > > + popl %ecx > > + popl %ebx > > + popl %eax > > + addl $4, %esp > > + addl $2, (%esp) > > + iret > > + > > +.Lterminate32: > > + movl $MSR_PROTO_TERMINATE, %eax > > + movl $MSR_SEV_GHCB, %ecx > > + wrmsr > > + rep vmmcall > > +.Lterm_loop32: > > + hlt > > + jmp .Lterm_loop32 > > + > > + > > .section .codepatch,"a" > > .align 8, 0xcc > > .globl codepatch_begin > > @@ -757,6 +930,20 @@ codepatch_end: > > .previous > > > > .data > > + .globl idtlc /* temporary locore IDT */ > > +idtlc: > > + .word early_idt_end-early_idt-1 > > + .long _RELOC(early_idt) > > + .align 64, 0xcc > > + > > + .globl early_idt > > +early_idt: > > + .rept NIDT > > + .quad 0x0000000000000000 > > + .quad 0x0000000000000000 > > + .endr > > +early_idt_end: > > + > > .globl gdt64 > > gdt64: > > .word gdt64_end-gdt64_start-1 > > Index: arch/amd64/include/ghcb.h > > =================================================================== > > RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/include/ghcb.h,v > > diff -u -p -r1.2 ghcb.h > > --- arch/amd64/include/ghcb.h 28 May 2025 07:59:05 -0000 1.2 > > +++ arch/amd64/include/ghcb.h 20 Jun 2025 16:31:14 -0000 > > @@ -19,6 +19,8 @@ > > #ifndef _MACHINE_GHCB_H_ > > #define _MACHINE_GHCB_H_ > > > > +#ifndef _LOCORE > > + > > #include > > > > #define GHCB_OFFSET(m) ((m) / 8) > > @@ -98,12 +100,15 @@ struct ghcb_sync { > > int sz_c; > > int sz_d; > > }; > > +#endif /* !_LOCORE */ > > > > /* Definitions used with the MSR protocol */ > > #define MSR_PROTO_CPUID_REQ 0x4 > > #define MSR_PROTO_CPUID_RESP 0x5 > > #define MSR_PROTO_TERMINATE 0x100 > > > > +#ifndef _LOCORE > > + > > void ghcb_clear(struct ghcb_sa *); > > int ghcb_valbm_set(uint8_t *, int); > > int ghcb_valbm_isset(uint8_t *, int); > > @@ -114,5 +119,7 @@ void ghcb_sync_val(int, int, struct ghcb > > void ghcb_sync_out(struct trapframe *, uint64_t, uint64_t, uint64_t, > > struct ghcb_sa *, struct ghcb_sync *); > > void ghcb_sync_in(struct trapframe *, struct ghcb_sa *, struct ghcb_sync *); > > + > > +#endif /* !_LOCORE */ > > > > #endif /* !_MACHINE_GHCB_H_ */ > > Index: arch/amd64/include/trap.h > > =================================================================== > > RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/include/trap.h,v > > diff -u -p -r1.5 trap.h > > --- arch/amd64/include/trap.h 15 Apr 2023 01:22:50 -0000 1.5 > > +++ arch/amd64/include/trap.h 20 Jun 2025 16:31:14 -0000 > > @@ -62,3 +62,4 @@ > > #define T_XMM 19 /* SSE FP exception */ > > #define T_VE 20 /* virtualization exception */ > > #define T_CP 21 /* control protection exception */ > > +#define T_VC 29 /* VMM communication exception */ > > Index: arch/amd64/include/vmmvar.h > > =================================================================== > > RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/include/vmmvar.h,v > > diff -u -p -r1.114 vmmvar.h > > --- arch/amd64/include/vmmvar.h 24 May 2025 12:47:00 -0000 1.114 > > +++ arch/amd64/include/vmmvar.h 20 Jun 2025 16:31:14 -0000 > > @@ -21,6 +21,8 @@ > > #ifndef _MACHINE_VMMVAR_H_ > > #define _MACHINE_VMMVAR_H_ > > > > +#ifndef _LOCORE > > + > > #define VMM_HV_SIGNATURE "OpenBSDVMM58" > > > > /* VMX: Basic Exit Reasons */ > > @@ -94,6 +96,8 @@ > > #define VMX_MAX_CR3_TARGETS 256 > > #define VMX_VMCS_PA_CLEAR 0xFFFFFFFFFFFFFFFFUL > > > > +#endif /* ! _LOCORE */ > > + > > /* > > * SVM: Intercept codes (exit reasons) > > */ > > @@ -262,6 +266,8 @@ > > #define SVM_VMEXIT_VMGEXIT 0x403 > > #define SVM_VMEXIT_INVALID -1 > > > > +#ifndef _LOCORE > > + > > /* > > * Exception injection vectors (these correspond to the CPU exception types > > * defined in the SDM.) > > @@ -1056,5 +1062,7 @@ int vcpu_reset_regs(struct vcpu *, struc > > int svm_get_vmsa_pa(uint32_t, uint32_t, uint64_t *); > > > > #endif /* _KERNEL */ > > + > > +#endif /* ! _LOCORE */ > > > > #endif /* ! _MACHINE_VMMVAR_H_ */ > > >