Download raw body.
SEV-ES guest: locore #VC trap handling
On Fri, Jun 20, 2025 at 09:06:37PM +0200, Alexander Bluhm wrote:
> On Fri, Jun 20, 2025 at 12:10:19AM -0700, Mike Larkin wrote:
> > On Sun, Jun 15, 2025 at 04:55:12PM +0200, Alexander Bluhm wrote:
> > > On Sat, Jun 14, 2025 at 10:23:53PM -0700, Mike Larkin wrote:
> > > > On Wed, May 21, 2025 at 05:10:27PM +0200, Hans-J?rg H?xer wrote:
> > > > > Hi,
> > > > >
> > > > > this change deals with locore for SEV-ES enabled guests. The approach
> > > > > might be a bit controversial. And it requires a diff for vmd(8), that
> > > > > I've also attached, to simplify the discussion:
> > > > >
> > > > > SEV-ES guest: locore #VC trap handling
> > > > >
> > > > > When locore is executed by a SEV-ES enabled guest the first cpuid
> > > > > instruction will raise a #VC trap that will need to be handled.
> > > > > However, at that point in time the guest does not know wether it's
> > > > > a guest at all, if it is running on an AMD cpu with SEV-ES enabled,
> > > > > etc.
> > > > >
> > > > > To resolve this chicken-egg situation we undconditionally setup a
> > > > >
> > > > > As vmd(8) configures the runtime for locore to be in 32 bit
> > > > > compatibility mode a raised #VC exception will switch to long mode.
> > > > > And the CPU will expect a 64 bit entry in the IDT. When running
> > > > > on eg. KVM locore is execute in 32 bit legacy mode. There the
> > > > > CPU will expect a 32 bit entry in the IDT.
> > > > >
> > > > > To accomodate both situations, we set up both 64 and 32 bit handler
> > > > > in the IDT.
> > > > >
> > > > > Additionally, vmd(8) has to setup a long mode segment in the GDT.
> > > > >
> > > > > Both #VC trap handler use the MSR protocol to talk to the hypervisor
> > > > > to emulate CPUID. The MSR protocol only supports "simple" CPUID
> > > > > without subfunctions.
> > > > >
> > > > > Note: When SEV-ES is enabled, the hypervisor can not intercept
> > > > > writes to EFER beforehand, only after the write. Thus on vmm(4)
> > > > > with directly executed kernel we are in compatibility mode and
> > > > > EFER_LMA is set. As resetting EFER_LMA raises #GP we have to
> > > > > preserve it.
> > > > >
> > > > > Take care,
> > > > > HJ.
> > > > >
> > > >
> > > > This one probably needs a lot of testing. What tests have you guys done so
> > > > far? My SEV-ES machine is available now, I will test this and let you know
> > > > but we probably want a lot of testing from different host/guest combos.
> > >
> > > I know that this part does not break SEV (without ES) on vmm/vmd
> > > and kvm/qemu.
> > >
> > > Below is the complete diff that hshoexer@ sent some months ago
> > > rebased to current. With that I can run OpenBSD guest with SEV-ES
> > > in vmm/vmd. On kvm/qemu SEV still works, but I cannot get guest
> > > running with SEV-ES.
> > >
> > > So if you agree with the locore0 part, that was in the original
> > > mail of this thread, I would like to commit it.
> > >
> > > bluhm
> > >
> >
> > I think this diff needs a fair amount of work before it can go in.
>
> The latest diff I sent is everything that is missing for SEV-ES
> support. It has some old XXX from months ago, that have been
> resolved. It was never intended for commit, I provided it just in
> case you want to test the whole thing. In your previous mail you
> said that you want to test something. The huge diff was not intended
> for review.
>
> > See below for comments. This is going to require a ton of testing. How do
> > you want to proceed? For example, I think this needs to be tested as a
> > guest on a number of sev/sev-es capable hypervisors and a wide variety
> > of hardware. You already mentioned one breakage above, right?
>
> The only problem I see is SEV-ES on Linux KVM/qemu host. But we
> never had support for that before. I would like to address this
> after OpenBSD running on vmm/vmd has been commited.
>
> I do have Linux KVM/qemu and OpenBSD vmm/vmd on AMD and Intel. On
> AMD I can run SEV with OpenBSD guest, but not SEV-ES because the
> diff is not fully commited yet. As currently SEV-ES is not supported,
> this is not a regression.
>
> Below is the diff that needs review. It is the locore0 part of the
> original mail in this thread. The only testing that has to be done
> is that it does not break anything. SEV-ES will not work, as it
> is not complete.
>
> Two of your remarks still apply with this next step.
>
> > > @@ -193,6 +194,58 @@ bi_size_ok:
> > > pushl $PSL_MBO
> > > popfl
> > >
> > > + /*
> > > + * Setup temporary #VC trap handler, in case we are running
> > > + * on an AMD CPU in SEV-ES guest mode. Will be reset by
> > > + * init_x86_64().
> > > + * We are setting up two handlers:
> > > + *
> > > + * 1) locore_vc_trap32: Triggered when we are running in
> > > + * 32-bit legacy mode.
> > > + *
> > > + * 2) locore_vc_trap64: Triggered when we are running in
> > > + * 32-bit compatibility mode.
> > > + *
> > > + * The latter one is used by vmd(8).
> >
> > Please clarify; *when* is this used by vmd? I believe you mean when
> > we do a direct kernel launch? If not, then why do we need the 32 bit
> > one?
>
> There are two ways we may enter the kernel. KMV/qemu uses special
> Tiano-Core EFI implementation. For vmm/vmd currently we support
> direct kernel boot only. hshoexer@ explained to me why we need
> both 32 bit methods here, but I forgot.
>
> > > /* XXX merge these */
> > > call init_x86_64
> > > call main
> > >
> > > + /* MSR Protocol Request Codes */
> > > +#define MSRPROTO_CPUID_REQ 0x4
> > > +#define MSRPROTO_TERM_REQ 0x100
> > > +
> > > +vc_cpuid64:
> > > + shll $30, %eax /* requested register */
> > > + orl $MSRPROTO_CPUID_REQ, %eax
> > > + movl %ebx, %edx /* CPUID function */
> > > + movl $MSR_SEV_GHCB, %ecx
> > > + wrmsr
> > > + rep vmmcall
> >
> > Out of curiousity, why is the rep prefix needed here?
>
> I don't know. hshoexer?
>
> Below is the next diff, that I want to commit. It is guest side
> only. My testing can only cover that nothing breaks.
>
> ok?
>
> bluhm
>
I think this locore diff below looks ok to me but it changes some very
fundamental early boot behavior. I was asking before what your plan is for
testing this before commit. I don't think testing this on just a handful
of machines is sufficient. It's particularly tough with this one as if it
breaks someone, the machine will pretty much be bricked and require fixing
via install media. That's why I'm hesitant.
If you are ready to handle any fallout, then sure. It's probably the right
time in the release process for this. ok mlarkin on the diff itself in that
case. There are a few things we can change later (like using .Lxxx in all
symbols in locore0 and there are a few style(9) things) but nothing that
warrants fixing now.
P.S. hshoexer: The locore0 page gets unmapped right after we are done with it;
can you please verify that the #VC traps and locore0 IDTs aren't needed after
that unmapping?
-ml
> On Wed, May 21, 2025 at 05:19:14PM +0200, Hans-J?rg H?xer wrote:
> > > SEV-ES guest: locore #VC trap handling
> > >
> > > When locore is executed by a SEV-ES enabled guest the first cpuid
> > > instruction will raise a #VC trap that will need to be handled.
> > > However, at that point in time the guest does not know wether it's
> > > a guest at all, if it is running on an AMD cpu with SEV-ES enabled,
> > > etc.
> > >
> > > To resolve this chicken-egg situation we undconditionally setup a
> > > #VC trap handler. If we are not a SEV-ES enabled guest, or we are
> > > runnign on some non-AMD CPU we will not raise #VC (hopefully). On
> > > Intel CPUs the vector for #VC is reserved.
> > >
> > > As vmd(8) configures the runtime for locore to be in 32 bit
> > > compatibility mode a raised #VC exception will switch to long mode.
> > > And the CPU will expect a 64 bit entry in the IDT. When running
> > > on eg. KVM locore is execute in 32 bit legacy mode. There the
> > > CPU will expect a 32 bit entry in the IDT.
> > >
> > > To accomodate both situations, we set up both 64 and 32 bit handler
> > > in the IDT.
> > >
> > > Additionally, vmd(8) has to setup a long mode segment in the GDT.
> > >
> > > Both #VC trap handler use the MSR protocol to talk to the hypervisor
> > > to emulate CPUID. The MSR protocol only supports "simple" CPUID
> > > without subfunctions.
> > >
> > > Note: When SEV-ES is enabled, the hypervisor can not intercept
> > > writes to EFER beforehand, only after the write. Thus on vmm(4)
> > > with directly executed kernel we are in compatibility mode and
> > > EFER_LMA is set. As resetting EFER_LMA raises #GP we have to
> > > preserve it.
>
> Index: arch/amd64/amd64/locore0.S
> ===================================================================
> RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/amd64/locore0.S,v
> diff -u -p -r1.27 locore0.S
> --- arch/amd64/amd64/locore0.S 5 May 2025 23:02:39 -0000 1.27
> +++ arch/amd64/amd64/locore0.S 20 Jun 2025 16:31:14 -0000
> @@ -111,6 +111,9 @@
> #include <machine/param.h>
> #include <machine/segments.h>
> #include <machine/specialreg.h>
> +#include <machine/trap.h>
> +#include <machine/ghcb.h>
> +#include <machine/vmmvar.h>
>
> /*
> * override user-land alignment before including asm.h
> @@ -193,6 +196,58 @@ bi_size_ok:
> pushl $PSL_MBO
> popfl
>
> + /*
> + * Setup temporary #VC trap handler, in case we are running
> + * on an AMD CPU in SEV-ES guest mode. Will be reset by
> + * init_x86_64().
> + * We are setting up two handlers:
> + *
> + * 1) locore_vc_trap32: Triggered when we are running in
> + * 32-bit legacy mode.
> + *
> + * 2) locore_vc_trap64: Triggered when we are running in
> + * 32-bit compatibility mode.
> + *
> + * The latter one is used by vmd(8).
> + */
> + movl $RELOC(early_idt), %ecx
> + movl $T_VC, %edx
> + leal (%ecx, %edx, 8), %ecx /* 32bit #VC IDT slot */
> +
> + pushl %cs /* get current %cs */
> + popl %ebx
> + shll $16, %ebx
> +
> + movl $RELOC(locore_vc_trap32), %eax
> + andl $0x0000ffff, %eax
> + orl %ebx, %eax /* use current %cs */
> + movl %eax, (%ecx)
> +
> + movl $RELOC(locore_vc_trap32), %eax
> + andl $0xffff0000, %eax
> + orl $((0x80 | SDT_SYS386IGT) << 8), %eax
> + movl %eax, 4(%ecx)
> +
> + movl $RELOC(early_idt), %ecx
> + movl $(2 * T_VC), %edx
> + leal (%ecx, %edx, 8), %ecx /* 64bit #VC IDT slot */
> +
> + movl $RELOC(locore_vc_trap64), %eax
> + andl $0x0000ffff, %eax
> + orl $(GSEL(3, SEL_KPL) << 16), %eax
> + movl %eax, (%ecx)
> +
> + movl $RELOC(locore_vc_trap64), %eax
> + andl $0xffff0000, %eax
> + orl $((0x80 | SDT_SYS386IGT) << 8), %eax
> + movl %eax, 4(%ecx)
> + xorl %eax, %eax
> + movl %eax, 8(%ecx)
> + movl %eax, 12(%ecx)
> +
> + movl $RELOC(idtlc), %eax
> + lidt (%eax)
> +
> /* Reset debug control registers */
> xorl %eax,%eax
> movl %eax,%dr6
> @@ -631,8 +686,14 @@ store_pte:
> */
> movl $MSR_EFER,%ecx
> rdmsr
> + movl %eax,%ebx
> xorl %eax,%eax /* XXX */
> orl $(EFER_LME|EFER_SCE),%eax
> + /* If set, preserve LMA */
> + testl $EFER_LMA,%ebx
> + jz efer_nxe
> + orl $EFER_LMA,%eax
> +efer_nxe:
> movl RELOC((pg_nx + 4)), %ebx
> cmpl $0, %ebx
> je write_efer
> @@ -745,6 +806,118 @@ longmode_hi:
> call init_x86_64
> call main
>
> +vc_cpuid64:
> + shll $30, %eax /* requested register */
> + orl $MSR_PROTO_CPUID_REQ, %eax
> + movl %ebx, %edx /* CPUID function */
> + movl $MSR_SEV_GHCB, %ecx
> + wrmsr
> + rep vmmcall
> + rdmsr
> + ret
> +
> + .globl locore_vc_trap64
> +locore_vc_trap64:
> + pushq %rax
> + pushq %rbx
> + pushq %rcx
> + pushq %rdx
> +
> + cmpl $SVM_VMEXIT_CPUID, 32(%rsp)
> + jne .Lterminate64
> +
> + movl %eax, %ebx /* save CPUID function */
> +
> + movl $0, %eax /* request cpuid, get %eax */
> + call vc_cpuid64
> + movq %rdx, 24(%rsp)
> +
> + movl $1, %eax /* get %ebx */
> + call vc_cpuid64
> + movq %rdx, 16(%rsp)
> +
> + movl $2, %eax /* get %ecx */
> + call vc_cpuid64
> + movq %rdx, 8(%rsp)
> +
> + movl $3, %eax /* get %edx */
> + call vc_cpuid64
> + movq %rdx, 0(%rsp)
> +
> + popq %rdx
> + popq %rcx
> + popq %rbx
> + popq %rax
> + addq $8, %rsp
> + addq $2, (%rsp)
> + iretq
> +
> +.Lterminate64:
> + movl $MSR_PROTO_TERMINATE, %eax
> + movl $MSR_SEV_GHCB, %ecx
> + wrmsr
> + rep vmmcall
> +.Lterm_loop64:
> + hlt
> + jmp .Lterm_loop64
> +
> + .code32
> +vc_cpuid32:
> + shll $30, %eax /* requested register */
> + orl $MSR_PROTO_CPUID_REQ, %eax
> + movl %ebx, %edx /* CPUID function */
> + movl $MSR_SEV_GHCB, %ecx
> + wrmsr
> + rep vmmcall
> + rdmsr
> + ret
> +
> + .globl locore_vc_trap32
> +locore_vc_trap32:
> + pushl %eax
> + pushl %ebx
> + pushl %ecx
> + pushl %edx
> +
> + cmpl $SVM_VMEXIT_CPUID, 16(%esp)
> + jne .Lterminate32
> +
> + movl %eax, %ebx /* save CPUID function */
> +
> + movl $0, %eax /* request cpuid, get %eax */
> + call vc_cpuid32
> + movl %edx, 12(%esp)
> +
> + movl $1, %eax /* get %ebx */
> + call vc_cpuid32
> + movl %edx, 8(%esp)
> +
> + movl $2, %eax /* get %ecx */
> + call vc_cpuid32
> + movl %edx, 4(%esp)
> +
> + movl $3, %eax /* get %edx */
> + call vc_cpuid32
> + movl %edx, 0(%esp)
> +
> + popl %edx
> + popl %ecx
> + popl %ebx
> + popl %eax
> + addl $4, %esp
> + addl $2, (%esp)
> + iret
> +
> +.Lterminate32:
> + movl $MSR_PROTO_TERMINATE, %eax
> + movl $MSR_SEV_GHCB, %ecx
> + wrmsr
> + rep vmmcall
> +.Lterm_loop32:
> + hlt
> + jmp .Lterm_loop32
> +
> +
> .section .codepatch,"a"
> .align 8, 0xcc
> .globl codepatch_begin
> @@ -757,6 +930,20 @@ codepatch_end:
> .previous
>
> .data
> + .globl idtlc /* temporary locore IDT */
> +idtlc:
> + .word early_idt_end-early_idt-1
> + .long _RELOC(early_idt)
> + .align 64, 0xcc
> +
> + .globl early_idt
> +early_idt:
> + .rept NIDT
> + .quad 0x0000000000000000
> + .quad 0x0000000000000000
> + .endr
> +early_idt_end:
> +
> .globl gdt64
> gdt64:
> .word gdt64_end-gdt64_start-1
> Index: arch/amd64/include/ghcb.h
> ===================================================================
> RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/include/ghcb.h,v
> diff -u -p -r1.2 ghcb.h
> --- arch/amd64/include/ghcb.h 28 May 2025 07:59:05 -0000 1.2
> +++ arch/amd64/include/ghcb.h 20 Jun 2025 16:31:14 -0000
> @@ -19,6 +19,8 @@
> #ifndef _MACHINE_GHCB_H_
> #define _MACHINE_GHCB_H_
>
> +#ifndef _LOCORE
> +
> #include <machine/frame.h>
>
> #define GHCB_OFFSET(m) ((m) / 8)
> @@ -98,12 +100,15 @@ struct ghcb_sync {
> int sz_c;
> int sz_d;
> };
> +#endif /* !_LOCORE */
>
> /* Definitions used with the MSR protocol */
> #define MSR_PROTO_CPUID_REQ 0x4
> #define MSR_PROTO_CPUID_RESP 0x5
> #define MSR_PROTO_TERMINATE 0x100
>
> +#ifndef _LOCORE
> +
> void ghcb_clear(struct ghcb_sa *);
> int ghcb_valbm_set(uint8_t *, int);
> int ghcb_valbm_isset(uint8_t *, int);
> @@ -114,5 +119,7 @@ void ghcb_sync_val(int, int, struct ghcb
> void ghcb_sync_out(struct trapframe *, uint64_t, uint64_t, uint64_t,
> struct ghcb_sa *, struct ghcb_sync *);
> void ghcb_sync_in(struct trapframe *, struct ghcb_sa *, struct ghcb_sync *);
> +
> +#endif /* !_LOCORE */
>
> #endif /* !_MACHINE_GHCB_H_ */
> Index: arch/amd64/include/trap.h
> ===================================================================
> RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/include/trap.h,v
> diff -u -p -r1.5 trap.h
> --- arch/amd64/include/trap.h 15 Apr 2023 01:22:50 -0000 1.5
> +++ arch/amd64/include/trap.h 20 Jun 2025 16:31:14 -0000
> @@ -62,3 +62,4 @@
> #define T_XMM 19 /* SSE FP exception */
> #define T_VE 20 /* virtualization exception */
> #define T_CP 21 /* control protection exception */
> +#define T_VC 29 /* VMM communication exception */
> Index: arch/amd64/include/vmmvar.h
> ===================================================================
> RCS file: /data/mirror/openbsd/cvs/src/sys/arch/amd64/include/vmmvar.h,v
> diff -u -p -r1.114 vmmvar.h
> --- arch/amd64/include/vmmvar.h 24 May 2025 12:47:00 -0000 1.114
> +++ arch/amd64/include/vmmvar.h 20 Jun 2025 16:31:14 -0000
> @@ -21,6 +21,8 @@
> #ifndef _MACHINE_VMMVAR_H_
> #define _MACHINE_VMMVAR_H_
>
> +#ifndef _LOCORE
> +
> #define VMM_HV_SIGNATURE "OpenBSDVMM58"
>
> /* VMX: Basic Exit Reasons */
> @@ -94,6 +96,8 @@
> #define VMX_MAX_CR3_TARGETS 256
> #define VMX_VMCS_PA_CLEAR 0xFFFFFFFFFFFFFFFFUL
>
> +#endif /* ! _LOCORE */
> +
> /*
> * SVM: Intercept codes (exit reasons)
> */
> @@ -262,6 +266,8 @@
> #define SVM_VMEXIT_VMGEXIT 0x403
> #define SVM_VMEXIT_INVALID -1
>
> +#ifndef _LOCORE
> +
> /*
> * Exception injection vectors (these correspond to the CPU exception types
> * defined in the SDM.)
> @@ -1056,5 +1062,7 @@ int vcpu_reset_regs(struct vcpu *, struc
> int svm_get_vmsa_pa(uint32_t, uint32_t, uint64_t *);
>
> #endif /* _KERNEL */
> +
> +#endif /* ! _LOCORE */
>
> #endif /* ! _MACHINE_VMMVAR_H_ */
>
SEV-ES guest: locore #VC trap handling