From: hshoexer Subject: Re: vmm(4): For SEV-ES enabled guest raise #VC on MMIO related #NPF To: tech@openbsd.org Cc: mlarkin@openbsd.org, dv@openbsd.org Date: Wed, 26 Nov 2025 10:47:00 +0100 On Mon, Nov 17, 2025 at 08:44:21PM +0100, hshoexer wrote: > On Sun, Nov 16, 2025 at 06:17:24PM -0800, Mike Larkin wrote: > > On Thu, Nov 13, 2025 at 01:04:08PM +0100, hshoexer wrote: > > > Hi, > > > > > > for SEV-ES enabled guests MMIO should raise a #VC trap with error > > > code SVM_VMEXIT_NPF. This is required as vmm(4) can not analyze > > > the guest code causing the exit; the guest itself has to assist > > > which requires a #VC trap being raised. > > > > > > As SEV-ES guest we try to avoid MMIO by using paravirtualization. > > > Nonetheless, vmm(4) should enforce #NPF/MMIO for SEV guests; e.g. > > > to find bugs. > > > > > > Enforcing #VC on MMIO is accomplished by setting up a mapping in > > > the nested page table that has reserved bits set. The bits of the > > > physical address portion of the PTE that are affected by physical > > > address bit reduction are reserved. When these bits are set, a #VC > > > trap with error code SVM_VMEXIT_NPF is raised when the guest resumes. > > > Further MMIO to the same page will raise #VC right away. > > > > > > This approach is recommended by [1] section 4.1.5. It can be tested > > > with regress/sys/arch/amd64/seves_mmio/. > > > > > > For non-SEV-ES guests nothing changes, we forward the fault to > > > vmd(8) as before. > > > > > > [1] https://docs.amd.com/v/u/en-US/56421 > > > > > > > I will need more detail as to what is going on in the diff below, before I can > > proceed. > > > > it looks like you are letting the guest map any gpa to (?) some HPA whose > > calculation is not obvious or clear. are you trying to force a bad mapping? > > why not just use PROT_NONE for the mapping and force it to some completely > > bogus HPA? that way you're sure the guest can't sneak some mappings past you? > > > > or is this what you are doing? I can't understand from the diff. is this what > > the "address reduction" part is? > > yes, that's what I'm trying to do. > > But let's take a step back: On amd64 physical addresses have a > maximum size of 52 bits. The top most 12 bits are reserved (and > in the PTE these 12 bits are used for eg. NX XO, etc.) > > With SEV enabled the bit 51 is repurposed as crypt bit, indicating > whether the addressed memory is to be encrypted or not. The next > few bits are used for the ASID. > > Now: According to [2] p. 37 the actually addressable physical > memory and the number of ASID is determind by the firmware. On my > EPYC 9124 the physical address size is reduced from 52 to 46 > (indicated by cpuid 0x000001f.ebx). The bits 51:47 are considered > reserved. > > Now, to raise a #VC exception on MMIO the GHCB manual [1] section > 4.1.5 claims that the nested page table needs an entry for MMIO > space that has a physcial address with these reserved bits set. > So the entry looks like this: 0x000fc00000000000. > > So any GPA that is assigned to the MMIO range of physcial addresses > (type VMM_MEM_TYPE_MMIO) will raise a nested page fault on (first) > access. In case of a SEV enabled guest I'd say we now have to set > up a nested page table entry, that has those reserved bits set. > When resuming the guest the access is retried and now yields #VC > (and on future access, too). And we can use that particular entry > for any GPA from the MMIO range. > > And that's what I'm trying to do with that diff. And > regress/sys/arch/amd64/seves_mmio/ yields a positive result (ie. > #VC gets raised). > > When I just set an entry with PROT_NONE I keep getting NP faults > (to be handle by vmm) and not a #VC in the guest. does this explanation make sense to you? > [1] https://docs.amd.com/v/u/en-US/56421 > [2] https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/tuning-guides/58207-using-sev-with-amd-epyc-processors.pdf > > > also, I'm not sure we can just "only" use pmap_enter anymore to put mappings > > in the guest; I know dv was doing work in this area a few months ago WRT > > uvm_share and its interaction with uvm_fault. do we need to use uvm_fault here? > > if so, what would the permissions be? (dv?) > > good point! I'm not sure. dv any advice? any suggestions regarding the pmap_enter() chunk? Am I missing something there? Take care, HJ. > > -ml > > > > > > > ---------------------------------- > > > diff --git a/sys/arch/amd64/amd64/identcpu.c b/sys/arch/amd64/amd64/identcpu.c > > > index ac4e845a1aa..86f05f1a7bf 100644 > > > --- a/sys/arch/amd64/amd64/identcpu.c > > > +++ b/sys/arch/amd64/amd64/identcpu.c > > > @@ -67,6 +67,8 @@ int cpuspeed; > > > > > > int amd64_has_xcrypt; > > > int amd64_pos_cbit; /* C bit position for SEV */ > > > +int amd64_phys_addrsz; /* Physcial address size */ > > > +int amd64_phys_red; /* Physcial address size reduction */ > > > int amd64_min_noes_asid; > > > int has_rdrand; > > > int has_rdseed; > > > @@ -676,8 +678,9 @@ identifycpu(struct cpu_info *ci) > > > /* speculation control features */ > > > if (ci->ci_vendor == CPUV_AMD) { > > > if (ci->ci_pnfeatset >= 0x80000008) { > > > - CPUID(0x80000008, dummy, ci->ci_feature_amdspec_ebx, > > > - dummy, dummy); > > > + CPUID(0x80000008, ci->ci_feature_amdspec_eax, > > > + ci->ci_feature_amdspec_ebx, dummy, dummy); > > > + amd64_phys_addrsz = ci->ci_feature_amdspec_eax & 0xff; > > > pcpuid(ci, "80000008", 'b', > > > CPUID_MEMBER(ci_feature_amdspec_ebx), > > > CPUID_AMDSPEC_EBX_BITS); > > > @@ -711,6 +714,7 @@ identifycpu(struct cpu_info *ci) > > > 'd', CPUID_MEMBER(ci_feature_amdsev_edx), > > > CPUID_AMDSEV_EDX_BITS); > > > amd64_pos_cbit = (ci->ci_feature_amdsev_ebx & 0x3f); > > > + amd64_phys_red = ((ci->ci_feature_amdsev_ebx >> 6) & 0x3f); > > > amd64_min_noes_asid = ci->ci_feature_amdsev_edx; > > > if (cpu_sev_guestmode && CPU_IS_PRIMARY(ci)) > > > printf("\n%s: SEV%s guest mode", ci->ci_dev->dv_xname, > > > diff --git a/sys/arch/amd64/amd64/vmm_machdep.c b/sys/arch/amd64/amd64/vmm_machdep.c > > > index abcd177a700..be6c0aea6ca 100644 > > > --- a/sys/arch/amd64/amd64/vmm_machdep.c > > > +++ b/sys/arch/amd64/amd64/vmm_machdep.c > > > @@ -5023,6 +5023,7 @@ int > > > svm_handle_np_fault(struct vcpu *vcpu) > > > { > > > uint64_t gpa; > > > + paddr_t hpa; > > > int gpa_memtype, ret = 0; > > > struct vmcb *vmcb = (struct vmcb *)vcpu->vc_control_va; > > > struct vm_exit_eptviolation *vee = &vcpu->vc_exit.vee; > > > @@ -5039,14 +5040,25 @@ svm_handle_np_fault(struct vcpu *vcpu) > > > ret = svm_fault_page(vcpu, gpa); > > > break; > > > case VMM_MEM_TYPE_MMIO: > > > - vee->vee_fault_type = VEE_FAULT_MMIO_ASSIST; > > > - if (ci->ci_vmm_cap.vcc_svm.svm_decode_assist) { > > > - vee->vee_insn_len = vmcb->v_n_bytes_fetched; > > > - memcpy(&vee->vee_insn_bytes, vmcb->v_guest_ins_bytes, > > > - sizeof(vee->vee_insn_bytes)); > > > - vee->vee_insn_info |= VEE_BYTES_VALID; > > > + if (vcpu->vc_seves) { > > > + vee->vee_fault_type = VEE_FAULT_HANDLED; > > > + /* Setup invalid mapping to raise #VC in guest */ > > > + hpa = (1ULL << amd64_phys_red) - 1; > > > + hpa <<= (amd64_phys_addrsz - amd64_phys_red); > > > + ret = pmap_enter(vcpu->vc_parent->vm_pmap, > > > + trunc_page(gpa), hpa, > > > + PROT_READ | PROT_WRITE | PROT_EXEC, 0); > > > + } else { > > > + vee->vee_fault_type = VEE_FAULT_MMIO_ASSIST; > > > + if (ci->ci_vmm_cap.vcc_svm.svm_decode_assist) { > > > + vee->vee_insn_len = vmcb->v_n_bytes_fetched; > > > + memcpy(&vee->vee_insn_bytes, > > > + vmcb->v_guest_ins_bytes, > > > + sizeof(vee->vee_insn_bytes)); > > > + vee->vee_insn_info |= VEE_BYTES_VALID; > > > + } > > > + ret = EAGAIN; > > > } > > > - ret = EAGAIN; > > > break; > > > default: > > > printf("%s: unknown memory type %d for GPA 0x%llx\n", > > > diff --git a/sys/arch/amd64/include/cpu.h b/sys/arch/amd64/include/cpu.h > > > index c7d34e634cf..2c8616dede9 100644 > > > --- a/sys/arch/amd64/include/cpu.h > > > +++ b/sys/arch/amd64/include/cpu.h > > > @@ -170,6 +170,7 @@ struct cpu_info { > > > u_int32_t ci_feature_sefflags_ebx;/* [I] */ > > > u_int32_t ci_feature_sefflags_ecx;/* [I] */ > > > u_int32_t ci_feature_sefflags_edx;/* [I] */ > > > + u_int32_t ci_feature_amdspec_eax; /* [I] */ > > > u_int32_t ci_feature_amdspec_ebx; /* [I] */ > > > u_int32_t ci_feature_amdsev_eax; /* [I] */ > > > u_int32_t ci_feature_amdsev_ebx; /* [I] */ > > > @@ -418,6 +419,8 @@ void identifycpu(struct cpu_info *); > > > int cpu_amd64speed(int *); > > > extern int cpuspeed; > > > extern int amd64_pos_cbit; > > > +extern int amd64_phys_addrsz; > > > +extern int amd64_phys_red; > > > extern int amd64_min_noes_asid; > > > > > > /* machdep.c */ > > > > > >