Download raw body.
Reworking VMM's nested paging & guest memory (de-vmspace-ification)
[EXT] Re: Reworking VMM's nested paging & guest memory (de-vmspace-ification)
[EXT] Re: Reworking VMM's nested paging & guest memory (de-vmspace-ification)
[EXT] Reworking VMM's nested paging & guest memory (de-vmspace-ification)
On Tue, May 13, 2025 at 03:42:08PM +0200, Hans-Jörg Höxer wrote:
> Hi,
>
> On Tue, May 06, 2025 at 02:34:37PM -0400, Dave Voutila wrote:
> > ...
> > Worked with mpi@ on tuning the lifecycle of uao's and now happy with how
> > this looks. Asking for ok's at this point!
>
> I've tested your diff with my SEV work in progress tree and I do not
> see regressions.
>
> The code looks code to me.
actually the code looks good to me.
> > > The more I looked at refining the direct use of uvm_map() the
> > > more I realized this is also a good time to ditch having the vmd process
> > > "prefind" ranges via mmap(2)/munmap(2) like vmd has been doing. This
> > > version no longer uses UVM_FLAG_FIXED and instead let's uvm_map()
> > > determine the mapping locations. The mappings are now also set with
> > > UVM_FLAG_CONCEAL and MAP_INHERIT_NONE. Lastly, I now set the mappings as
> > > immutable so userland can't muck with them as vmm is supposed to be
> > > managing them.
> > >
> > > Plus, with some slight cheating by tidying up comments, the diff is
> > > technically net-negative! The bulk comes from removing the hacky
> > > behavior in vmd.
> > >
> > > diffstat refs/heads/master refs/heads/vmm-uobj-v5
> > > M sys/arch/amd64/amd64/vmm_machdep.c | 110+ 63-
> > > M sys/dev/vmm/vmm.c | 104+ 46-
> > > M sys/dev/vmm/vmm.h | 10+ 4-
> > > M usr.sbin/vmd/vm.c | 16+ 128-
> > >
> > > 4 files changed, 240 insertions(+), 241 deletions(-)
> > >
> >
> > No longer net-negative...but only net 10 lines, so not bad?
> >
> > 4 files changed, 255 insertions(+), 245 deletions(-)
> >
> > >> Today, vmm uses a UVM vmspace to represent both the nested page table
> > >> (pmap) for the guest as well as guest memory in the form of amap's
> > >> within the vmspace's map. This has sort of worked, but I've been hunting
> > >> constant issues related to vmm triggering uvm_map_teardown inside vmm.
> > >>
> > >> In short, using a full blown UVM vmspace to represent a guest makes us
> > >> do some funky things that are not typical in other hypervisors and leads
> > >> to some lifecycle headaches:
> > >>
> > >> 1. vmm faults pages into the vmspace's map, which is shared via a
> > >> uvm_share() function to share map entries into a target process's
> > >> own address space. (This uvm_share() function exists *only* for vmm
> > >> to assist in sharing guest memory between vmm's vmspace in the
> > >> kernel and the vmd process's address space.)
> > >>
> > >> 2. This also forces use of uvm_share() for vmd child processes that
> > >> need r/w access to guest memory, such as the virtio block/network
> > >> emulation.
> > >>
> > >> 3. Teardown/cleanup means vmm's vmspace may or may not outlive the vmd
> > >> process with which it's sharing mappings.
> > >>
> > >> Now that vmd is multi-process for guest emulation, the extra abstraction
> > >> is simply a burden and increases the surface area to hunt for the source
> > >> of bugs. Plus, vmm is relying on a UVM function no other kernel system
> > >> uses making it a bit of a snowflake.
> > >>
> > >> This diff removes the use of a vmspace by:
> > >>
> > >> 1. creating and managing a pmap directly for representing the nested
> > >> page tables (i.e. EPT page tables on Intel) and using pmap_extract /
> > >> pmap_enter / pmap_remove for working with them.
> > >>
> > >> 2. creates and manages UVM aobjs to represent each slot/range of guest
> > >> memory with an explicit size provided at creation time. These are
> > >> uvm_map()'d into userland vmd processes and no use of uvm_share() is
> > >> needed.
> > >>
> > >> 3. cleanup is simplified to dropping a refcount on the aobjs, wiping
> > >> page table entries in the pmap, and then a pmap_destroy.
> > >>
> > >> The major changes are in vm_create, vm_teardown, svm_fault_page, and
> > >> vmx_fault_page functions. There's some tidying that can be done
> > >> afterwards.
> > >>
> > >> I've been stress testing this on one of the most cursed Intel machines
> > >> that is good at finding race conditions. I've also tested it lightly on
> > >> an AMD machine I own to validate SVM, so more tests are very welcome at
> > >> this point.
> > >>
> > >> OKs welcome, but I'd prefer some test reports before looking to commit,
> > >> but since we're recently unlocked from release prep, I'd prefer sooner
> > >> rather than later.
> > >>
> >
> > diffstat refs/heads/master refs/heads/vmm-uobj-v7
> > M sys/arch/amd64/amd64/vmm_machdep.c | 110+ 63-
> > M sys/dev/vmm/vmm.c | 119+ 50-
> > M sys/dev/vmm/vmm.h | 10+ 4-
> > M usr.sbin/vmd/vm.c | 16+ 128-
> >
> > 4 files changed, 255 insertions(+), 245 deletions(-)
> >
> > diff refs/heads/master refs/heads/vmm-uobj-v7
> > commit - 51b51be201a312011f0589e0b53ca83d03adb6f3
> > commit + f855c7801b5f793a2473f6a74980edb29f1cc9d0
> > blob - 0a3be15cd64d4399bd4d038152d44763c9bebe1e
> > blob + 33e312395cd9300bbaab5bd9dd44051ac72bf24b
> > --- sys/arch/amd64/amd64/vmm_machdep.c
> > +++ sys/arch/amd64/amd64/vmm_machdep.c
> > @@ -116,6 +116,7 @@ int vmm_inject_db(struct vcpu *);
> > void vmx_handle_intr(struct vcpu *);
> > void vmx_handle_misc_enable_msr(struct vcpu *);
> > int vmm_get_guest_memtype(struct vm *, paddr_t);
> > +vaddr_t vmm_translate_gpa(struct vm *, paddr_t);
> > int vmx_get_guest_faulttype(void);
> > int svm_get_guest_faulttype(struct vmcb *);
> > int vmx_get_exit_qualification(uint64_t *);
> > @@ -900,54 +901,19 @@ vmx_remote_vmclear(struct cpu_info *ci, struct vcpu *v
> > int
> > vm_impl_init(struct vm *vm, struct proc *p)
> > {
> > - int i, mode, ret;
> > - vaddr_t mingpa, maxgpa;
> > - struct vm_mem_range *vmr;
> > -
> > /* If not EPT or RVI, nothing to do here */
> > switch (vmm_softc->mode) {
> > case VMM_MODE_EPT:
> > - mode = PMAP_TYPE_EPT;
> > + pmap_convert(vm->vm_pmap, PMAP_TYPE_EPT);
> > break;
> > case VMM_MODE_RVI:
> > - mode = PMAP_TYPE_RVI;
> > + pmap_convert(vm->vm_pmap, PMAP_TYPE_RVI);
> > break;
> > default:
> > printf("%s: invalid vmm mode %d\n", __func__, vmm_softc->mode);
> > return (EINVAL);
> > }
> >
> > - vmr = &vm->vm_memranges[0];
> > - mingpa = vmr->vmr_gpa;
> > - vmr = &vm->vm_memranges[vm->vm_nmemranges - 1];
> > - maxgpa = vmr->vmr_gpa + vmr->vmr_size;
> > -
> > - /*
> > - * uvmspace_alloc (currently) always returns a valid vmspace
> > - */
> > - vm->vm_vmspace = uvmspace_alloc(mingpa, maxgpa, TRUE, FALSE);
> > - vm->vm_map = &vm->vm_vmspace->vm_map;
> > -
> > - /* Map the new map with an anon */
> > - DPRINTF("%s: created vm_map @ %p\n", __func__, vm->vm_map);
> > - for (i = 0; i < vm->vm_nmemranges; i++) {
> > - vmr = &vm->vm_memranges[i];
> > - ret = uvm_share(vm->vm_map, vmr->vmr_gpa,
> > - PROT_READ | PROT_WRITE | PROT_EXEC,
> > - &p->p_vmspace->vm_map, vmr->vmr_va, vmr->vmr_size);
> > - if (ret) {
> > - printf("%s: uvm_share failed (%d)\n", __func__, ret);
> > - /* uvmspace_free calls pmap_destroy for us */
> > - KERNEL_LOCK();
> > - uvmspace_free(vm->vm_vmspace);
> > - vm->vm_vmspace = NULL;
> > - KERNEL_UNLOCK();
> > - return (ENOMEM);
> > - }
> > - }
> > -
> > - pmap_convert(vm->vm_map->pmap, mode);
> > -
> > return (0);
> > }
> >
> > @@ -1661,7 +1627,7 @@ vcpu_reset_regs_svm(struct vcpu *vcpu, struct vcpu_reg
> >
> > /* NPT */
> > vmcb->v_np_enable = SVM_ENABLE_NP;
> > - vmcb->v_n_cr3 = vcpu->vc_parent->vm_map->pmap->pm_pdirpa;
> > + vmcb->v_n_cr3 = vcpu->vc_parent->vm_pmap->pm_pdirpa;
> >
> > /* SEV */
> > if (vcpu->vc_sev)
> > @@ -1675,7 +1641,7 @@ vcpu_reset_regs_svm(struct vcpu *vcpu, struct vcpu_reg
> > /* xcr0 power on default sets bit 0 (x87 state) */
> > vcpu->vc_gueststate.vg_xcr0 = XFEATURE_X87 & xsave_mask;
> >
> > - vcpu->vc_parent->vm_map->pmap->eptp = 0;
> > + vcpu->vc_parent->vm_pmap->eptp = 0;
> >
> > return ret;
> > }
> > @@ -2596,7 +2562,7 @@ vcpu_init_vmx(struct vcpu *vcpu)
> > }
> >
> > /* Configure EPT Pointer */
> > - eptp = vcpu->vc_parent->vm_map->pmap->pm_pdirpa;
> > + eptp = vcpu->vc_parent->vm_pmap->pm_pdirpa;
> > msr = rdmsr(IA32_VMX_EPT_VPID_CAP);
> > if (msr & IA32_EPT_VPID_CAP_PAGE_WALK_4) {
> > /* Page walk length 4 supported */
> > @@ -2620,7 +2586,7 @@ vcpu_init_vmx(struct vcpu *vcpu)
> > goto exit;
> > }
> >
> > - vcpu->vc_parent->vm_map->pmap->eptp = eptp;
> > + vcpu->vc_parent->vm_pmap->eptp = eptp;
> >
> > /* Host CR0 */
> > cr0 = rcr0() & ~CR0_TS;
> > @@ -3513,7 +3479,7 @@ vmm_translate_gva(struct vcpu *vcpu, uint64_t va, uint
> >
> > DPRINTF("%s: read pte level %d @ GPA 0x%llx\n", __func__,
> > level, pte_paddr);
> > - if (!pmap_extract(vcpu->vc_parent->vm_map->pmap, pte_paddr,
> > + if (!pmap_extract(vcpu->vc_parent->vm_pmap, pte_paddr,
> > &hpa)) {
> > DPRINTF("%s: cannot extract HPA for GPA 0x%llx\n",
> > __func__, pte_paddr);
> > @@ -3694,11 +3660,11 @@ vcpu_run_vmx(struct vcpu *vcpu, struct vm_run_params *
> >
> > /* We're now using this vcpu's EPT pmap on this cpu. */
> > atomic_swap_ptr(&ci->ci_ept_pmap,
> > - vcpu->vc_parent->vm_map->pmap);
> > + vcpu->vc_parent->vm_pmap);
> >
> > /* Invalidate EPT cache. */
> > vid_ept.vid_reserved = 0;
> > - vid_ept.vid_eptp = vcpu->vc_parent->vm_map->pmap->eptp;
> > + vid_ept.vid_eptp = vcpu->vc_parent->vm_pmap->eptp;
> > if (invept(ci->ci_vmm_cap.vcc_vmx.vmx_invept_mode,
> > &vid_ept)) {
> > printf("%s: invept\n", __func__);
> > @@ -4484,6 +4450,29 @@ vmm_get_guest_memtype(struct vm *vm, paddr_t gpa)
> > return (VMM_MEM_TYPE_UNKNOWN);
> > }
> >
> > +vaddr_t
> > +vmm_translate_gpa(struct vm *vm, paddr_t gpa)
> > +{
> > + int i = 0;
> > + vaddr_t hva = 0;
> > + struct vm_mem_range *vmr = NULL;
> > +
> > + /*
> > + * Translate GPA -> userland HVA in proc p. Find the memory range
> > + * and use it to translate to the HVA.
> > + */
> > + for (i = 0; i < vm->vm_nmemranges; i++) {
> > + vmr = &vm->vm_memranges[i];
> > + if (gpa >= vmr->vmr_gpa && gpa < vmr->vmr_gpa + vmr->vmr_size) {
> > + hva = vmr->vmr_va + (gpa - vmr->vmr_gpa);
> > + break;
> > + }
> > + }
> > +
> > + return (hva);
> > +}
> > +
> > +
> > /*
> > * vmx_get_exit_qualification
> > *
> > @@ -4565,15 +4554,45 @@ svm_get_guest_faulttype(struct vmcb *vmcb)
> > int
> > svm_fault_page(struct vcpu *vcpu, paddr_t gpa)
> > {
> > - paddr_t pa = trunc_page(gpa);
> > - int ret;
> > + struct proc *p = curproc;
> > + paddr_t hpa, pa = trunc_page(gpa);
> > + vaddr_t hva;
> > + int ret = 1;
> >
> > - ret = uvm_fault_wire(vcpu->vc_parent->vm_map, pa, pa + PAGE_SIZE,
> > - PROT_READ | PROT_WRITE | PROT_EXEC);
> > - if (ret)
> > - printf("%s: uvm_fault returns %d, GPA=0x%llx, rip=0x%llx\n",
> > - __func__, ret, (uint64_t)gpa, vcpu->vc_gueststate.vg_rip);
> > + hva = vmm_translate_gpa(vcpu->vc_parent, pa);
> > + if (hva == 0) {
> > + printf("%s: unable to translate gpa 0x%llx\n", __func__,
> > + (uint64_t)pa);
> > + return (EINVAL);
> > + }
> >
> > + /* If we don't already have a backing page... */
> > + if (!pmap_extract(p->p_vmspace->vm_map.pmap, hva, &hpa)) {
> > + /* ...fault a RW page into the p's address space... */
> > + ret = uvm_fault_wire(&p->p_vmspace->vm_map, hva,
> > + hva + PAGE_SIZE, PROT_READ | PROT_WRITE);
> > + if (ret) {
> > + printf("%s: uvm_fault failed %d hva=0x%llx\n", __func__,
> > + ret, (uint64_t)hva);
> > + return (ret);
> > + }
> > +
> > + /* ...and then get the mapping. */
> > + if (!pmap_extract(p->p_vmspace->vm_map.pmap, hva, &hpa)) {
> > + printf("%s: failed to extract hpa for hva 0x%llx\n",
> > + __func__, (uint64_t)hva);
> > + return (EINVAL);
> > + }
> > + }
> > +
> > + /* Now we insert a RWX mapping into the guest's RVI pmap. */
> > + ret = pmap_enter(vcpu->vc_parent->vm_pmap, pa, hpa,
> > + PROT_READ | PROT_WRITE | PROT_EXEC, 0);
> > + if (ret) {
> > + printf("%s: pmap_enter failed pa=0x%llx, hpa=0x%llx\n",
> > + __func__, (uint64_t)pa, (uint64_t)hpa);
> > + }
> > +
> > return (ret);
> > }
> >
> > @@ -4640,8 +4659,10 @@ svm_handle_np_fault(struct vcpu *vcpu)
> > int
> > vmx_fault_page(struct vcpu *vcpu, paddr_t gpa)
> > {
> > + struct proc *p = curproc;
> > int fault_type, ret;
> > - paddr_t pa = trunc_page(gpa);
> > + paddr_t hpa, pa = trunc_page(gpa);
> > + vaddr_t hva;
> >
> > fault_type = vmx_get_guest_faulttype();
> > switch (fault_type) {
> > @@ -4656,19 +4677,45 @@ vmx_fault_page(struct vcpu *vcpu, paddr_t gpa)
> > break;
> > }
> >
> > - /* We may sleep during uvm_fault_wire(), so reload VMCS. */
> > - vcpu->vc_last_pcpu = curcpu();
> > - ret = uvm_fault_wire(vcpu->vc_parent->vm_map, pa, pa + PAGE_SIZE,
> > - PROT_READ | PROT_WRITE | PROT_EXEC);
> > - if (vcpu_reload_vmcs_vmx(vcpu)) {
> > - printf("%s: failed to reload vmcs\n", __func__);
> > + hva = vmm_translate_gpa(vcpu->vc_parent, pa);
> > + if (hva == 0) {
> > + printf("%s: unable to translate gpa 0x%llx\n", __func__,
> > + (uint64_t)pa);
> > return (EINVAL);
> > }
> >
> > - if (ret)
> > - printf("%s: uvm_fault returns %d, GPA=0x%llx, rip=0x%llx\n",
> > - __func__, ret, (uint64_t)gpa, vcpu->vc_gueststate.vg_rip);
> > + /* If we don't already have a backing page... */
> > + if (!pmap_extract(p->p_vmspace->vm_map.pmap, hva, &hpa)) {
> > + /* ...fault a RW page into the p's address space... */
> > + vcpu->vc_last_pcpu = curcpu(); /* uvm_fault may sleep. */
> > + ret = uvm_fault_wire(&p->p_vmspace->vm_map, hva,
> > + hva + PAGE_SIZE, PROT_READ | PROT_WRITE);
> > + if (ret) {
> > + printf("%s: uvm_fault failed %d hva=0x%llx\n", __func__,
> > + ret, (uint64_t)hva);
> > + return (ret);
> > + }
> > + if (vcpu_reload_vmcs_vmx(vcpu)) {
> > + printf("%s: failed to reload vmcs\n", __func__);
> > + return (EINVAL);
> > + }
> >
> > + /* ...and then get the mapping. */
> > + if (!pmap_extract(p->p_vmspace->vm_map.pmap, hva, &hpa)) {
> > + printf("%s: failed to extract hpa for hva 0x%llx\n",
> > + __func__, (uint64_t)hva);
> > + return (EINVAL);
> > + }
> > + }
> > +
> > + /* Now we insert a RWX mapping into the guest's EPT pmap. */
> > + ret = pmap_enter(vcpu->vc_parent->vm_pmap, pa, hpa,
> > + PROT_READ | PROT_WRITE | PROT_EXEC, 0);
> > + if (ret) {
> > + printf("%s: pmap_enter failed pa=0x%llx, hpa=0x%llx\n",
> > + __func__, (uint64_t)pa, (uint64_t)hpa);
> > + }
> > +
> > return (ret);
> > }
> >
> > @@ -4953,7 +5000,7 @@ vmx_load_pdptes(struct vcpu *vcpu)
> > return (EINVAL);
> > }
> >
> > - if (!pmap_extract(vcpu->vc_parent->vm_map->pmap, (vaddr_t)cr3,
> > + if (!pmap_extract(vcpu->vc_parent->vm_pmap, (vaddr_t)cr3,
> > (paddr_t *)&cr3_host_phys)) {
> > DPRINTF("%s: nonmapped guest CR3, setting PDPTEs to 0\n",
> > __func__);
> > @@ -6499,7 +6546,7 @@ vmm_update_pvclock(struct vcpu *vcpu)
> >
> > if (vcpu->vc_pvclock_system_gpa & PVCLOCK_SYSTEM_TIME_ENABLE) {
> > pvclock_gpa = vcpu->vc_pvclock_system_gpa & 0xFFFFFFFFFFFFFFF0;
> > - if (!pmap_extract(vm->vm_map->pmap, pvclock_gpa, &pvclock_hpa))
> > + if (!pmap_extract(vm->vm_pmap, pvclock_gpa, &pvclock_hpa))
> > return (EINVAL);
> > pvclock_ti = (void*) PMAP_DIRECT_MAP(pvclock_hpa);
> >
> > blob - 0a86ddbecd3fbd3627336902c7209f6d023bc3f0
> > blob + 4d39ff3b6910e1b905c959476073eed89471f407
> > --- sys/dev/vmm/vmm.c
> > +++ sys/dev/vmm/vmm.c
> > @@ -25,6 +25,9 @@
> > #include <sys/malloc.h>
> > #include <sys/signalvar.h>
> >
> > +#include <uvm/uvm_extern.h>
> > +#include <uvm/uvm_aobj.h>
> > +
> > #include <machine/vmmvar.h>
> >
> > #include <dev/vmm/vmm.h>
> > @@ -356,6 +359,9 @@ vm_create(struct vm_create_params *vcp, struct proc *p
> > size_t memsize;
> > struct vm *vm;
> > struct vcpu *vcpu;
> > + struct uvm_object *uao;
> > + struct vm_mem_range *vmr;
> > + unsigned int uvmflags = 0;
> >
> > memsize = vm_create_check_mem_ranges(vcp);
> > if (memsize == 0)
> > @@ -378,13 +384,61 @@ vm_create(struct vm_create_params *vcp, struct proc *p
> > /* Instantiate and configure the new vm. */
> > vm = pool_get(&vm_pool, PR_WAITOK | PR_ZERO);
> >
> > + /* Create the VM's identity. */
> > vm->vm_creator_pid = p->p_p->ps_pid;
> > + strncpy(vm->vm_name, vcp->vcp_name, VMM_MAX_NAME_LEN - 1);
> > +
> > + /* Create the pmap for nested paging. */
> > + vm->vm_pmap = pmap_create();
> > +
> > + /* Initialize memory slots. */
> > vm->vm_nmemranges = vcp->vcp_nmemranges;
> > memcpy(vm->vm_memranges, vcp->vcp_memranges,
> > vm->vm_nmemranges * sizeof(vm->vm_memranges[0]));
> > - vm->vm_memory_size = memsize;
> > - strncpy(vm->vm_name, vcp->vcp_name, VMM_MAX_NAME_LEN - 1);
> > + vm->vm_memory_size = memsize; /* Calculated above. */
> >
> > + uvmflags = UVM_MAPFLAG(PROT_READ | PROT_WRITE, PROT_READ | PROT_WRITE,
> > + MAP_INHERIT_NONE, MADV_NORMAL, UVM_FLAG_CONCEAL);
> > + for (i = 0; i < vm->vm_nmemranges; i++) {
> > + vmr = &vm->vm_memranges[i];
> > + if (vmr->vmr_type == VM_MEM_MMIO)
> > + continue;
> > +
> > + uao = NULL;
> > + uao = uao_create(vmr->vmr_size, UAO_FLAG_CANFAIL);
> > + if (uao == NULL) {
> > + printf("%s: failed to initialize memory slot\n",
> > + __func__);
> > + vm_teardown(&vm);
> > + return (ENOMEM);
> > + }
> > +
> > + /* Map the UVM aobj into the process. It owns this reference. */
> > + ret = uvm_map(&p->p_vmspace->vm_map, &vmr->vmr_va,
> > + vmr->vmr_size, uao, 0, 0, uvmflags);
> > + if (ret) {
> > + printf("%s: uvm_map failed: %d\n", __func__, ret);
> > + uao_detach(uao);
> > + vm_teardown(&vm);
> > + return (ENOMEM);
> > + }
> > +
> > + /* Make this mapping immutable so userland cannot change it. */
> > + ret = uvm_map_immutable(&p->p_vmspace->vm_map, vmr->vmr_va,
> > + vmr->vmr_va + vmr->vmr_size, 1);
> > + if (ret) {
> > + printf("%s: uvm_map_immutable failed: %d\n", __func__,
> > + ret);
> > + uvm_unmap(&p->p_vmspace->vm_map, vmr->vmr_va,
> > + vmr->vmr_va + vmr->vmr_size);
> > + vm_teardown(&vm);
> > + return (ret);
> > + }
> > +
> > + uao_reference(uao); /* Take a reference for vmm. */
> > + vm->vm_memory_slot[i] = uao;
> > + }
> > +
> > if (vm_impl_init(vm, p)) {
> > printf("failed to init arch-specific features for vm %p\n", vm);
> > vm_teardown(&vm);
> > @@ -433,6 +487,10 @@ vm_create(struct vm_create_params *vcp, struct proc *p
> > vmm_softc->vm_ct++;
> > vmm_softc->vcpu_ct += vm->vm_vcpu_ct;
> >
> > + /* Update the userland process's view of guest memory. */
> > + memcpy(vcp->vcp_memranges, vm->vm_memranges,
> > + vcp->vcp_nmemranges * sizeof(vcp->vcp_memranges[0]));
> > +
> > rw_exit_write(&vmm_softc->vm_lock);
> >
> > return (0);
> > @@ -482,19 +540,6 @@ vm_create_check_mem_ranges(struct vm_create_params *vc
> > }
> >
> > /*
> > - * Make sure that all virtual addresses are within the address
> > - * space of the process and that they do not wrap around.
> > - * Calling uvm_share() when creating the VM will take care of
> > - * further checks.
> > - */
> > - if (vmr->vmr_va < VM_MIN_ADDRESS ||
> > - vmr->vmr_va >= VM_MAXUSER_ADDRESS ||
> > - vmr->vmr_size >= VM_MAXUSER_ADDRESS - vmr->vmr_va) {
> > - DPRINTF("guest va not within range or wraps\n");
> > - return (0);
> > - }
> > -
> > - /*
> > * Make sure that guest physical memory ranges do not overlap
> > * and that they are ascending.
> > */
> > @@ -529,10 +574,11 @@ vm_create_check_mem_ranges(struct vm_create_params *vc
> > void
> > vm_teardown(struct vm **target)
> > {
> > - size_t nvcpu = 0;
> > + size_t i, nvcpu = 0;
> > + vaddr_t sva, eva;
> > struct vcpu *vcpu, *tmp;
> > struct vm *vm = *target;
> > - struct vmspace *vm_vmspace;
> > + struct uvm_object *uao;
> >
> > KERNEL_ASSERT_UNLOCKED();
> >
> > @@ -540,22 +586,29 @@ vm_teardown(struct vm **target)
> > SLIST_FOREACH_SAFE(vcpu, &vm->vm_vcpu_list, vc_vcpu_link, tmp) {
> > SLIST_REMOVE(&vm->vm_vcpu_list, vcpu, vcpu, vc_vcpu_link);
> > vcpu_deinit(vcpu);
> > -
> > pool_put(&vcpu_pool, vcpu);
> > nvcpu++;
> > }
> >
> > - vm_impl_deinit(vm);
> > + /* Remove guest mappings from our nested page tables. */
> > + for (i = 0; i < vm->vm_nmemranges; i++) {
> > + sva = vm->vm_memranges[i].vmr_gpa;
> > + eva = sva + vm->vm_memranges[i].vmr_size - 1;
> > + pmap_remove(vm->vm_pmap, sva, eva);
> > + }
> >
> > - /* teardown guest vmspace */
> > - KERNEL_LOCK();
> > - vm_vmspace = vm->vm_vmspace;
> > - if (vm_vmspace != NULL) {
> > - vm->vm_vmspace = NULL;
> > - uvmspace_free(vm_vmspace);
> > + /* Release UVM anon objects backing our guest memory. */
> > + for (i = 0; i < vm->vm_nmemranges; i++) {
> > + uao = vm->vm_memory_slot[i];
> > + vm->vm_memory_slot[i] = NULL;
> > + if (uao != NULL)
> > + uao_detach(uao);
> > }
> > - KERNEL_UNLOCK();
> >
> > + /* At this point, no UVM-managed pages should reference our pmap. */
> > + pmap_destroy(vm->vm_pmap);
> > + vm->vm_pmap = NULL;
> > +
> > pool_put(&vm_pool, vm);
> > *target = NULL;
> > }
> > @@ -612,7 +665,7 @@ vm_get_info(struct vm_info_params *vip)
> >
> > out[i].vir_memory_size = vm->vm_memory_size;
> > out[i].vir_used_size =
> > - pmap_resident_count(vm->vm_map->pmap) * PAGE_SIZE;
> > + pmap_resident_count(vm->vm_pmap) * PAGE_SIZE;
> > out[i].vir_ncpus = vm->vm_vcpu_ct;
> > out[i].vir_id = vm->vm_id;
> > out[i].vir_creator_pid = vm->vm_creator_pid;
> > @@ -793,17 +846,17 @@ vcpu_must_stop(struct vcpu *vcpu)
> > * Return values:
> > * 0: if successful
> > * ENOENT: if the vm cannot be found by vm_find
> > - * EPERM: if the vm cannot be accessed by the current process
> > - * EINVAL: if the provide memory ranges fail checks
> > - * ENOMEM: if uvm_share fails to find available memory in the destination map
> > + * other errno on uvm_map or uvm_map_immutable failures
> > */
> > int
> > vm_share_mem(struct vm_sharemem_params *vsp, struct proc *p)
> > {
> > - int ret = EINVAL;
> > - size_t i, n;
> > + int ret = EINVAL, unmap = 0;
> > + size_t i, failed_uao = 0, n;
> > struct vm *vm;
> > struct vm_mem_range *src, *dst;
> > + struct uvm_object *uao;
> > + unsigned int uvmflags;
> >
> > ret = vm_find(vsp->vsp_vm_id, &vm);
> > if (ret)
> > @@ -830,36 +883,52 @@ vm_share_mem(struct vm_sharemem_params *vsp, struct pr
> > if (src->vmr_size != dst->vmr_size)
> > goto out;
> >
> > - /* Check our intended destination is page-aligned. */
> > - if (dst->vmr_va & PAGE_MASK)
> > + /* The virtual addresses will be chosen by uvm_map(). */
> > + if (vsp->vsp_va[i] != 0)
> > goto out;
> > }
> >
> > - /*
> > - * Share each range individually with the calling process. We do
> > - * not need PROC_EXEC as the emulated devices do not need to execute
> > - * instructions from guest memory.
> > - */
> > + /* Share each UVM aobj with the calling process. */
> > + uvmflags = UVM_MAPFLAG(PROT_READ | PROT_WRITE, PROT_READ | PROT_WRITE,
> > + MAP_INHERIT_NONE, MADV_NORMAL, UVM_FLAG_CONCEAL);
> > for (i = 0; i < n; i++) {
> > - src = &vm->vm_memranges[i];
> > dst = &vsp->vsp_memranges[i];
> > -
> > - /* Skip MMIO range. */
> > - if (src->vmr_type == VM_MEM_MMIO)
> > + if (dst->vmr_type == VM_MEM_MMIO)
> > continue;
> >
> > - DPRINTF("sharing gpa=0x%lx for pid %d @ va=0x%lx\n",
> > - src->vmr_gpa, p->p_p->ps_pid, dst->vmr_va);
> > - ret = uvm_share(&p->p_vmspace->vm_map, dst->vmr_va,
> > - PROT_READ | PROT_WRITE, vm->vm_map, src->vmr_gpa,
> > - src->vmr_size);
> > + uao = vm->vm_memory_slot[i];
> > + KASSERT(uao != NULL);
> > +
> > + ret = uvm_map(&p->p_p->ps_vmspace->vm_map, &vsp->vsp_va[i],
> > + dst->vmr_size, uao, 0, 0, uvmflags);
> > if (ret) {
> > - printf("%s: uvm_share failed (%d)\n", __func__, ret);
> > - break;
> > + printf("%s: uvm_map failed: %d\n", __func__, ret);
> > + unmap = (i > 0) ? 1 : 0;
> > + failed_uao = i;
> > + goto out;
> > }
> > + uao_reference(uao); /* Add a reference for the process. */
> > +
> > + ret = uvm_map_immutable(&p->p_p->ps_vmspace->vm_map,
> > + vsp->vsp_va[i], vsp->vsp_va[i] + dst->vmr_size, 1);
> > + if (ret) {
> > + printf("%s: uvm_map_immutable failed: %d\n",
> > + __func__, ret);
> > + unmap = 1;
> > + failed_uao = i + 1;
> > + goto out;
> > + }
> > }
> > ret = 0;
> > out:
> > + if (unmap) {
> > + /* Unmap mapped aobjs, which drops the process's reference. */
> > + for (i = 0; i < failed_uao; i++) {
> > + dst = &vsp->vsp_memranges[i];
> > + uvm_unmap(&p->p_p->ps_vmspace->vm_map,
> > + vsp->vsp_va[i], vsp->vsp_va[i] + dst->vmr_size);
> > + }
> > + }
> > refcnt_rele_wake(&vm->vm_refcnt);
> > return (ret);
> > }
> > blob - 621dd08080c5f40e4b594737b4ea68f1ce98d5d1
> > blob + 78d0681cda32a01b11914fd55315230f817c64a5
> > --- sys/dev/vmm/vmm.h
> > +++ sys/dev/vmm/vmm.h
> > @@ -95,6 +95,9 @@ struct vm_sharemem_params {
> > uint32_t vsp_vm_id;
> > size_t vsp_nmemranges;
> > struct vm_mem_range vsp_memranges[VMM_MAX_MEM_RANGES];
> > +
> > + /* Output parameters from VMM_IOC_SHAREMEM */
> > + vaddr_t vsp_va[VMM_MAX_MEM_RANGES];
> > };
> >
> > struct vm_run_params {
> > @@ -138,7 +141,7 @@ struct vm_rwvmparams_params {
> > #define VMM_IOC_READVMPARAMS _IOWR('V', 9, struct vm_rwvmparams_params)
> > /* Set VM params */
> > #define VMM_IOC_WRITEVMPARAMS _IOW('V', 10, struct vm_rwvmparams_params)
> > -#define VMM_IOC_SHAREMEM _IOW('V', 11, struct vm_sharemem_params)
> > +#define VMM_IOC_SHAREMEM _IOWR('V', 11, struct vm_sharemem_params)
> >
> > #ifdef _KERNEL
> >
> > @@ -169,14 +172,17 @@ enum {
> > * V vmm_softc's vm_lock
> > */
> > struct vm {
> > - struct vmspace *vm_vmspace; /* [K] */
> > - vm_map_t vm_map; /* [K] */
> > + pmap_t vm_pmap; /* [r] */
> > +
> > uint32_t vm_id; /* [I] */
> > pid_t vm_creator_pid; /* [I] */
> > +
> > size_t vm_nmemranges; /* [I] */
> > size_t vm_memory_size; /* [I] */
> > - char vm_name[VMM_MAX_NAME_LEN];
> > struct vm_mem_range vm_memranges[VMM_MAX_MEM_RANGES];
> > + struct uvm_object *vm_memory_slot[VMM_MAX_MEM_RANGES]; /* [I] */
> > +
> > + char vm_name[VMM_MAX_NAME_LEN];
> > struct refcnt vm_refcnt; /* [a] */
> >
> > struct vcpu_head vm_vcpu_list; /* [v] */
> > blob - e399c0c0439a02117bc53c62cbb59c950b176773
> > blob + 30b12b0e30936bddf75484aeedf60a34b1296d89
> > --- usr.sbin/vmd/vm.c
> > +++ usr.sbin/vmd/vm.c
> > @@ -49,7 +49,6 @@ static void vm_dispatch_vmm(int, short, void *);
> > static void *event_thread(void *);
> > static void *vcpu_run_loop(void *);
> > static int vmm_create_vm(struct vmd_vm *);
> > -static int alloc_guest_mem(struct vmd_vm *);
> > static int send_vm(int, struct vmd_vm *);
> > static int dump_vmr(int , struct vm_mem_range *);
> > static int dump_mem(int, struct vmd_vm *);
> > @@ -201,7 +200,8 @@ start_vm(struct vmd_vm *vm, int fd)
> > if (!(vm->vm_state & VM_STATE_RECEIVED))
> > create_memory_map(vcp);
> >
> > - ret = alloc_guest_mem(vm);
> > + /* Create the vm in vmm(4). */
> > + ret = vmm_create_vm(vm);
> > if (ret) {
> > struct rlimit lim;
> > char buf[FMT_SCALED_STRSIZE];
> > @@ -209,15 +209,11 @@ start_vm(struct vmd_vm *vm, int fd)
> > if (fmt_scaled(lim.rlim_cur, buf) == 0)
> > fatalx("could not allocate guest memory (data "
> > "limit is %s)", buf);
> > + } else {
> > + errno = ret;
> > + log_warn("could not create vm");
> > }
> > - errno = ret;
> > - log_warn("could not allocate guest memory");
> > - return (ret);
> > - }
> >
> > - /* We've allocated guest memory, so now create the vm in vmm(4). */
> > - ret = vmm_create_vm(vm);
> > - if (ret) {
> > /* Let the vmm process know we failed by sending a 0 vm id. */
> > vcp->vcp_id = 0;
> > atomicio(vwrite, fd, &vcp->vcp_id, sizeof(vcp->vcp_id));
> > @@ -756,62 +752,6 @@ vcpu_reset(uint32_t vmid, uint32_t vcpu_id, struct vcp
> > }
> >
> > /*
> > - * alloc_guest_mem
> > - *
> > - * Allocates memory for the guest.
> > - * Instead of doing a single allocation with one mmap(), we allocate memory
> > - * separately for every range for the following reasons:
> > - * - ASLR for the individual ranges
> > - * - to reduce memory consumption in the UVM subsystem: if vmm(4) had to
> > - * map the single mmap'd userspace memory to the individual guest physical
> > - * memory ranges, the underlying amap of the single mmap'd range would have
> > - * to allocate per-page reference counters. The reason is that the
> > - * individual guest physical ranges would reference the single mmap'd region
> > - * only partially. However, if every guest physical range has its own
> > - * corresponding mmap'd userspace allocation, there are no partial
> > - * references: every guest physical range fully references an mmap'd
> > - * range => no per-page reference counters have to be allocated.
> > - *
> > - * Return values:
> > - * 0: success
> > - * !0: failure - errno indicating the source of the failure
> > - */
> > -int
> > -alloc_guest_mem(struct vmd_vm *vm)
> > -{
> > - void *p;
> > - int ret = 0;
> > - size_t i, j;
> > - struct vm_create_params *vcp = &vm->vm_params.vmc_params;
> > - struct vm_mem_range *vmr;
> > -
> > - for (i = 0; i < vcp->vcp_nmemranges; i++) {
> > - vmr = &vcp->vcp_memranges[i];
> > -
> > - /*
> > - * We only need R/W as userland. vmm(4) will use R/W/X in its
> > - * mapping.
> > - *
> > - * We must use MAP_SHARED so emulated devices will be able
> > - * to generate shared mappings.
> > - */
> > - p = mmap(NULL, vmr->vmr_size, PROT_READ | PROT_WRITE,
> > - MAP_ANON | MAP_CONCEAL | MAP_SHARED, -1, 0);
> > - if (p == MAP_FAILED) {
> > - ret = errno;
> > - for (j = 0; j < i; j++) {
> > - vmr = &vcp->vcp_memranges[j];
> > - munmap((void *)vmr->vmr_va, vmr->vmr_size);
> > - }
> > - return (ret);
> > - }
> > - vmr->vmr_va = (vaddr_t)p;
> > - }
> > -
> > - return (ret);
> > -}
> > -
> > -/*
> > * vmm_create_vm
> > *
> > * Requests vmm(4) to create a new VM using the supplied creation
> > @@ -1384,87 +1324,35 @@ vm_pipe_recv(struct vm_dev_pipe *p)
> > }
> >
> > /*
> > - * Re-map the guest address space using vmm(4)'s VMM_IOC_SHARE
> > + * Re-map the guest address space using vmm(4)'s VMM_IOC_SHAREMEM
> > *
> > - * Returns 0 on success, non-zero in event of failure.
> > + * Returns 0 on success or an errno in event of failure.
> > */
> > int
> > remap_guest_mem(struct vmd_vm *vm, int vmm_fd)
> > {
> > - struct vm_create_params *vcp;
> > - struct vm_mem_range *vmr;
> > + size_t i;
> > + struct vm_create_params *vcp = &vm->vm_params.vmc_params;
> > struct vm_sharemem_params vsp;
> > - size_t i, j;
> > - void *p = NULL;
> > - int ret;
> >
> > if (vm == NULL)
> > - return (1);
> > + return (EINVAL);
> >
> > - vcp = &vm->vm_params.vmc_params;
> > -
> > - /*
> > - * Initialize our VM shared memory request using our original
> > - * creation parameters. We'll overwrite the va's after mmap(2).
> > - */
> > + /* Initialize using our original creation parameters. */
> > memset(&vsp, 0, sizeof(vsp));
> > vsp.vsp_nmemranges = vcp->vcp_nmemranges;
> > vsp.vsp_vm_id = vcp->vcp_id;
> > memcpy(&vsp.vsp_memranges, &vcp->vcp_memranges,
> > sizeof(vsp.vsp_memranges));
> >
> > - /*
> > - * Use mmap(2) to identify virtual address space for our mappings.
> > - */
> > - for (i = 0; i < VMM_MAX_MEM_RANGES; i++) {
> > - if (i < vsp.vsp_nmemranges) {
> > - vmr = &vsp.vsp_memranges[i];
> > -
> > - /* Ignore any MMIO ranges. */
> > - if (vmr->vmr_type == VM_MEM_MMIO) {
> > - vmr->vmr_va = 0;
> > - vcp->vcp_memranges[i].vmr_va = 0;
> > - continue;
> > - }
> > -
> > - /* Make initial mappings for the memrange. */
> > - p = mmap(NULL, vmr->vmr_size, PROT_READ, MAP_ANON, -1,
> > - 0);
> > - if (p == MAP_FAILED) {
> > - ret = errno;
> > - log_warn("%s: mmap", __func__);
> > - for (j = 0; j < i; j++) {
> > - vmr = &vcp->vcp_memranges[j];
> > - munmap((void *)vmr->vmr_va,
> > - vmr->vmr_size);
> > - }
> > - return (ret);
> > - }
> > - vmr->vmr_va = (vaddr_t)p;
> > - vcp->vcp_memranges[i].vmr_va = vmr->vmr_va;
> > - }
> > - }
> > -
> > - /*
> > - * munmap(2) now that we have va's and ranges that don't overlap. vmm
> > - * will use the va's and sizes to recreate the mappings for us.
> > - */
> > - for (i = 0; i < vsp.vsp_nmemranges; i++) {
> > - vmr = &vsp.vsp_memranges[i];
> > - if (vmr->vmr_type == VM_MEM_MMIO)
> > - continue;
> > - if (munmap((void*)vmr->vmr_va, vmr->vmr_size) == -1)
> > - fatal("%s: munmap", __func__);
> > - }
> > -
> > - /*
> > - * Ask vmm to enter the shared mappings for us. They'll point
> > - * to the same host physical memory, but will have a randomized
> > - * virtual address for the calling process.
> > - */
> > + /* Ask vmm(4) to enter a shared mapping to guest memory. */
> > if (ioctl(vmm_fd, VMM_IOC_SHAREMEM, &vsp) == -1)
> > return (errno);
> >
> > + /* Update with the location of the new mappings. */
> > + for (i = 0; i < vsp.vsp_nmemranges; i++)
> > + vcp->vcp_memranges[i].vmr_va = vsp.vsp_va[i];
> > +
> > return (0);
> > }
> >
>
> --
> Dr. Hans-Jörg Höxer Hans-Joerg_Hoexer@genua.de
> Senior Expert Kryptographie
> eXtreme Kernel and Crypto Development
>
> genua GmbH
> Domagkstrasse 7, 85551 Kirchheim bei München
> tel +49 89 991950-0, fax -999, www.genua.eu
>
> Geschäftsführer: Matthias Ochs, Marc Tesch
> Amtsgericht München HRB 98238
> genua ist ein Unternehmen der Bundesdruckerei-Gruppe.
>
>
>
>
>
>
> Julia Template
>
> Betreff : Re: [EXT] Re: Reworking VMM's nested paging & guest memory (de-vmspace-ification)
> Sender : owner-tech+M103328=hshoexer=genua.de@openbsd.org
> Envelope Sender : owner-tech+M103328=hshoexer=genua.de@openbsd.org
> Sender Name : Hans-Jörg Höxer
> Sender Domain : openbsd.org
> Message ID : <aCNMMATdK0ZpRfZl@genua.de>
> Mail Size : 44172
> Time : 13.05.2025 15:42:51 (Di 13 Mai 2025 15:42:51 CEST)
> Julia Commands : Keine Kommandos verwendet
>
> Die Nachricht war signiert.
>
> Allgemeine Informationen zur Signatur:
> Die Signatur der Nachricht war gültig.
> Der Nachrichtenumschlag war S/MIME signiert.
>
> S/MIME-Engine Antworten:
> Envelope Signer : /C=DE/O=genua GmbH/CN=Hans-J\xC3\xB6rg H\xC3\xB6xer/L=Kirchheim b. M\xC3\xBCnchen/GN=Hans-J\xC3\xB6rg/SN=H\xC3\xB6xer/organizationIdentifier=NTRDE-D2601V.HRB98238/serialNumber=CSM043478563/ST=Bayern
>
> Info Signatur : CMS Inhaltstyp : pkcs7-signedData (1.2.840.113549.1.7.2)
> Verschlüsselter Content-Type : pkcs7-data (1.2.840.113549.1.7.1)
> Signatur Zertifikat Fingerprint SHA1: 8A:6A:1A:A3:1A:B4:EA:25:18:45:2E:6E:E2:CE:DC:64:F0:3D:9B:26
> Signatur Zertifikat Schlüssellänge: 4096 (RSA)
> Signatur Zertifikat Seriennummer: 74:36:DA:14:8D:EC:3B:D4:F6:35:7A:F3:88:82:CB:FE
> Signatur Zertifikat gültig ab: Sep 17 14:21:48 2024 GMT
> Signatur Zertifikat gültig bis: Sep 20 14:21:48 2025 GMT
> Signatur Zertifikat gültig für die Verschlüsselung: Ja
> Signatur Zertifikat gültig für die Signatur: Ja
> Signatur Zertifikats CA /C=DE/O=D-Trust GmbH/CN=D-TRUST Application Certificates CA 3-1 2013
> Signatur MD : sha256 (2.16.840.1.101.3.4.2.1)
> Signaturalgorithmus : rsaEncryption (1.2.840.113549.1.1.1)
> Signaturzeitpunkt: May 13 13:42:10 2025 GMT
> Signaturzertifikat enthält nicht die Email-Adresse des Absenders 'owner-tech+M103328=hshoexer=genua.de@openbsd.org'
>
> MD Signatur : : sha256 (2.16.840.1.101.3.4.2.1)
>
> Signature Engine Response :
> Verify Engine Response :
> Verification OK (0)
>
> Qualified Verify Engine Response :
>
>
>
>
>
>
>
> Ende des Julia-Templates
--
Dr. Hans-Jörg Höxer Hans-Joerg_Hoexer@genua.de
Senior Expert Kryptographie
eXtreme Kernel and Crypto Development
genua GmbH
Domagkstrasse 7, 85551 Kirchheim bei München
tel +49 89 991950-0, fax -999, www.genua.eu
Geschäftsführer: Matthias Ochs, Marc Tesch
Amtsgericht München HRB 98238
genua ist ein Unternehmen der Bundesdruckerei-Gruppe.
Reworking VMM's nested paging & guest memory (de-vmspace-ification)
[EXT] Re: Reworking VMM's nested paging & guest memory (de-vmspace-ification)
[EXT] Re: Reworking VMM's nested paging & guest memory (de-vmspace-ification)
[EXT] Reworking VMM's nested paging & guest memory (de-vmspace-ification)