From: Dave Voutila <dv@sisu.io>
Subject: Re: Reworking VMM's nested paging & guest memory (de-vmspace-ification)
To: tech@openbsd.org
Cc: mlarkin@openbsd.org
Date: Tue, 06 May 2025 14:34:37 -0400

Dave Voutila <dv@sisu.io> writes:

> Dave Voutila <dv@sisu.io> writes:
>
>> tl;dr: testers wanted on a variety of hardware where you run
>> vmm/vmd. This needs *both* building a new kernel *and* vmd.
>>
>
> Some positive test reports and so far it looks like bluhm's issue was
> hardware related, but would still like confirmation that this *works*
> with SEV mode enabled on a guest vm.
>
> mpi@ gave some feedback on leaking aobjs so incorporated into this
> version. Now on uvm_map() failure the aobj that failed mapping has an
> accompanying uao_detach() call. I was missing this in vm_create and
> vm_teardown wasn't doing it for me. Also had the same problem in
> vm_share_mem.

Worked with mpi@ on tuning the lifecycle of uao's and now happy with how
this looks. Asking for ok's at this point!

>
> The more I looked at refining the direct use of uvm_map() the
> more I realized this is also a good time to ditch having the vmd process
> "prefind" ranges via mmap(2)/munmap(2) like vmd has been doing. This
> version no longer uses UVM_FLAG_FIXED and instead let's uvm_map()
> determine the mapping locations. The mappings are now also set with
> UVM_FLAG_CONCEAL and MAP_INHERIT_NONE. Lastly, I now set the mappings as
> immutable so userland can't muck with them as vmm is supposed to be
> managing them.
>
> Plus, with some slight cheating by tidying up comments, the diff is
> technically net-negative! The bulk comes from removing the hacky
> behavior in vmd.
>
>   diffstat refs/heads/master refs/heads/vmm-uobj-v5
>    M  sys/arch/amd64/amd64/vmm_machdep.c  |  110+   63-
>    M  sys/dev/vmm/vmm.c                   |  104+   46-
>    M  sys/dev/vmm/vmm.h                   |   10+    4-
>    M  usr.sbin/vmd/vm.c                   |   16+  128-
>
>   4 files changed, 240 insertions(+), 241 deletions(-)
>

No longer net-negative...but only net 10 lines, so not bad?

 4 files changed, 255 insertions(+), 245 deletions(-)

>> Today, vmm uses a UVM vmspace to represent both the nested page table
>> (pmap) for the guest as well as guest memory in the form of amap's
>> within the vmspace's map. This has sort of worked, but I've been hunting
>> constant issues related to vmm triggering uvm_map_teardown inside vmm.
>>
>> In short, using a full blown UVM vmspace to represent a guest makes us
>> do some funky things that are not typical in other hypervisors and leads
>> to some lifecycle headaches:
>>
>>  1. vmm faults pages into the vmspace's map, which is shared via a
>>     uvm_share() function to share map entries into a target process's
>>     own address space. (This uvm_share() function exists *only* for vmm
>>     to assist in sharing guest memory between vmm's vmspace in the
>>     kernel and the vmd process's address space.)
>>
>>  2. This also forces use of uvm_share() for vmd child processes that
>>     need r/w access to guest memory, such as the virtio block/network
>>     emulation.
>>
>>  3. Teardown/cleanup means vmm's vmspace may or may not outlive the vmd
>>     process with which it's sharing mappings.
>>
>> Now that vmd is multi-process for guest emulation, the extra abstraction
>> is simply a burden and increases the surface area to hunt for the source
>> of bugs. Plus, vmm is relying on a UVM function no other kernel system
>> uses making it a bit of a snowflake.
>>
>> This diff removes the use of a vmspace by:
>>
>>  1. creating and managing a pmap directly for representing the nested
>>     page tables (i.e. EPT page tables on Intel) and using pmap_extract /
>>     pmap_enter / pmap_remove for working with them.
>>
>>  2. creates and manages UVM aobjs to represent each slot/range of guest
>>     memory with an explicit size provided at creation time. These are
>>     uvm_map()'d into userland vmd processes and no use of uvm_share() is
>>     needed.
>>
>>  3. cleanup is simplified to dropping a refcount on the aobjs, wiping
>>     page table entries in the pmap, and then a pmap_destroy.
>>
>> The major changes are in vm_create, vm_teardown, svm_fault_page, and
>> vmx_fault_page functions. There's some tidying that can be done
>> afterwards.
>>
>> I've been stress testing this on one of the most cursed Intel machines
>> that is good at finding race conditions. I've also tested it lightly on
>> an AMD machine I own to validate SVM, so more tests are very welcome at
>> this point.
>>
>> OKs welcome, but I'd prefer some test reports before looking to commit,
>> but since we're recently unlocked from release prep, I'd prefer sooner
>> rather than later.
>>

diffstat refs/heads/master refs/heads/vmm-uobj-v7
 M  sys/arch/amd64/amd64/vmm_machdep.c  |  110+   63-
 M  sys/dev/vmm/vmm.c                   |  119+   50-
 M  sys/dev/vmm/vmm.h                   |   10+    4-
 M  usr.sbin/vmd/vm.c                   |   16+  128-

4 files changed, 255 insertions(+), 245 deletions(-)

diff refs/heads/master refs/heads/vmm-uobj-v7
commit - 51b51be201a312011f0589e0b53ca83d03adb6f3
commit + f855c7801b5f793a2473f6a74980edb29f1cc9d0
blob - 0a3be15cd64d4399bd4d038152d44763c9bebe1e
blob + 33e312395cd9300bbaab5bd9dd44051ac72bf24b
--- sys/arch/amd64/amd64/vmm_machdep.c
+++ sys/arch/amd64/amd64/vmm_machdep.c
@@ -116,6 +116,7 @@ int vmm_inject_db(struct vcpu *);
 void vmx_handle_intr(struct vcpu *);
 void vmx_handle_misc_enable_msr(struct vcpu *);
 int vmm_get_guest_memtype(struct vm *, paddr_t);
+vaddr_t vmm_translate_gpa(struct vm *, paddr_t);
 int vmx_get_guest_faulttype(void);
 int svm_get_guest_faulttype(struct vmcb *);
 int vmx_get_exit_qualification(uint64_t *);
@@ -900,54 +901,19 @@ vmx_remote_vmclear(struct cpu_info *ci, struct vcpu *v
 int
 vm_impl_init(struct vm *vm, struct proc *p)
 {
-	int i, mode, ret;
-	vaddr_t mingpa, maxgpa;
-	struct vm_mem_range *vmr;
-
 	/* If not EPT or RVI, nothing to do here */
 	switch (vmm_softc->mode) {
 	case VMM_MODE_EPT:
-		mode = PMAP_TYPE_EPT;
+		pmap_convert(vm->vm_pmap, PMAP_TYPE_EPT);
 		break;
 	case VMM_MODE_RVI:
-		mode = PMAP_TYPE_RVI;
+		pmap_convert(vm->vm_pmap, PMAP_TYPE_RVI);
 		break;
 	default:
 		printf("%s: invalid vmm mode %d\n", __func__, vmm_softc->mode);
 		return (EINVAL);
 	}

-	vmr = &vm->vm_memranges[0];
-	mingpa = vmr->vmr_gpa;
-	vmr = &vm->vm_memranges[vm->vm_nmemranges - 1];
-	maxgpa = vmr->vmr_gpa + vmr->vmr_size;
-
-	/*
-	 * uvmspace_alloc (currently) always returns a valid vmspace
-	 */
-	vm->vm_vmspace = uvmspace_alloc(mingpa, maxgpa, TRUE, FALSE);
-	vm->vm_map = &vm->vm_vmspace->vm_map;
-
-	/* Map the new map with an anon */
-	DPRINTF("%s: created vm_map @ %p\n", __func__, vm->vm_map);
-	for (i = 0; i < vm->vm_nmemranges; i++) {
-		vmr = &vm->vm_memranges[i];
-		ret = uvm_share(vm->vm_map, vmr->vmr_gpa,
-		    PROT_READ | PROT_WRITE | PROT_EXEC,
-		    &p->p_vmspace->vm_map, vmr->vmr_va, vmr->vmr_size);
-		if (ret) {
-			printf("%s: uvm_share failed (%d)\n", __func__, ret);
-			/* uvmspace_free calls pmap_destroy for us */
-			KERNEL_LOCK();
-			uvmspace_free(vm->vm_vmspace);
-			vm->vm_vmspace = NULL;
-			KERNEL_UNLOCK();
-			return (ENOMEM);
-		}
-	}
-
-	pmap_convert(vm->vm_map->pmap, mode);
-
 	return (0);
 }

@@ -1661,7 +1627,7 @@ vcpu_reset_regs_svm(struct vcpu *vcpu, struct vcpu_reg

 	/* NPT */
 	vmcb->v_np_enable = SVM_ENABLE_NP;
-	vmcb->v_n_cr3 = vcpu->vc_parent->vm_map->pmap->pm_pdirpa;
+	vmcb->v_n_cr3 = vcpu->vc_parent->vm_pmap->pm_pdirpa;

 	/* SEV */
 	if (vcpu->vc_sev)
@@ -1675,7 +1641,7 @@ vcpu_reset_regs_svm(struct vcpu *vcpu, struct vcpu_reg
 	/* xcr0 power on default sets bit 0 (x87 state) */
 	vcpu->vc_gueststate.vg_xcr0 = XFEATURE_X87 & xsave_mask;

-	vcpu->vc_parent->vm_map->pmap->eptp = 0;
+	vcpu->vc_parent->vm_pmap->eptp = 0;

 	return ret;
 }
@@ -2596,7 +2562,7 @@ vcpu_init_vmx(struct vcpu *vcpu)
 	}

 	/* Configure EPT Pointer */
-	eptp = vcpu->vc_parent->vm_map->pmap->pm_pdirpa;
+	eptp = vcpu->vc_parent->vm_pmap->pm_pdirpa;
 	msr = rdmsr(IA32_VMX_EPT_VPID_CAP);
 	if (msr & IA32_EPT_VPID_CAP_PAGE_WALK_4) {
 		/* Page walk length 4 supported */
@@ -2620,7 +2586,7 @@ vcpu_init_vmx(struct vcpu *vcpu)
 		goto exit;
 	}

-	vcpu->vc_parent->vm_map->pmap->eptp = eptp;
+	vcpu->vc_parent->vm_pmap->eptp = eptp;

 	/* Host CR0 */
 	cr0 = rcr0() & ~CR0_TS;
@@ -3513,7 +3479,7 @@ vmm_translate_gva(struct vcpu *vcpu, uint64_t va, uint

 		DPRINTF("%s: read pte level %d @ GPA 0x%llx\n", __func__,
 		    level, pte_paddr);
-		if (!pmap_extract(vcpu->vc_parent->vm_map->pmap, pte_paddr,
+		if (!pmap_extract(vcpu->vc_parent->vm_pmap, pte_paddr,
 		    &hpa)) {
 			DPRINTF("%s: cannot extract HPA for GPA 0x%llx\n",
 			    __func__, pte_paddr);
@@ -3694,11 +3660,11 @@ vcpu_run_vmx(struct vcpu *vcpu, struct vm_run_params *

 			/* We're now using this vcpu's EPT pmap on this cpu. */
 			atomic_swap_ptr(&ci->ci_ept_pmap,
-			    vcpu->vc_parent->vm_map->pmap);
+			    vcpu->vc_parent->vm_pmap);

 			/* Invalidate EPT cache. */
 			vid_ept.vid_reserved = 0;
-			vid_ept.vid_eptp = vcpu->vc_parent->vm_map->pmap->eptp;
+			vid_ept.vid_eptp = vcpu->vc_parent->vm_pmap->eptp;
 			if (invept(ci->ci_vmm_cap.vcc_vmx.vmx_invept_mode,
 			    &vid_ept)) {
 				printf("%s: invept\n", __func__);
@@ -4484,6 +4450,29 @@ vmm_get_guest_memtype(struct vm *vm, paddr_t gpa)
 	return (VMM_MEM_TYPE_UNKNOWN);
 }

+vaddr_t
+vmm_translate_gpa(struct vm *vm, paddr_t gpa)
+{
+	int i = 0;
+	vaddr_t hva = 0;
+	struct vm_mem_range *vmr = NULL;
+
+	/*
+	 * Translate GPA -> userland HVA in proc p. Find the memory range
+	 * and use it to translate to the HVA.
+	 */
+	for (i = 0; i < vm->vm_nmemranges; i++) {
+		vmr = &vm->vm_memranges[i];
+		if (gpa >= vmr->vmr_gpa && gpa < vmr->vmr_gpa + vmr->vmr_size) {
+			hva = vmr->vmr_va + (gpa - vmr->vmr_gpa);
+			break;
+		}
+	}
+
+	return (hva);
+}
+
+
 /*
  * vmx_get_exit_qualification
  *
@@ -4565,15 +4554,45 @@ svm_get_guest_faulttype(struct vmcb *vmcb)
 int
 svm_fault_page(struct vcpu *vcpu, paddr_t gpa)
 {
-	paddr_t pa = trunc_page(gpa);
-	int ret;
+	struct proc *p = curproc;
+	paddr_t hpa, pa = trunc_page(gpa);
+	vaddr_t hva;
+	int ret = 1;

-	ret = uvm_fault_wire(vcpu->vc_parent->vm_map, pa, pa + PAGE_SIZE,
-	    PROT_READ | PROT_WRITE | PROT_EXEC);
-	if (ret)
-		printf("%s: uvm_fault returns %d, GPA=0x%llx, rip=0x%llx\n",
-		    __func__, ret, (uint64_t)gpa, vcpu->vc_gueststate.vg_rip);
+	hva = vmm_translate_gpa(vcpu->vc_parent, pa);
+	if (hva == 0) {
+		printf("%s: unable to translate gpa 0x%llx\n", __func__,
+		    (uint64_t)pa);
+		return (EINVAL);
+	}

+	/* If we don't already have a backing page... */
+	if (!pmap_extract(p->p_vmspace->vm_map.pmap, hva, &hpa)) {
+		/* ...fault a RW page into the p's address space... */
+		ret = uvm_fault_wire(&p->p_vmspace->vm_map, hva,
+		    hva + PAGE_SIZE, PROT_READ | PROT_WRITE);
+		if (ret) {
+			printf("%s: uvm_fault failed %d hva=0x%llx\n", __func__,
+			    ret, (uint64_t)hva);
+			return (ret);
+		}
+
+		/* ...and then get the mapping. */
+		if (!pmap_extract(p->p_vmspace->vm_map.pmap, hva, &hpa)) {
+			printf("%s: failed to extract hpa for hva 0x%llx\n",
+			    __func__, (uint64_t)hva);
+			return (EINVAL);
+		}
+	}
+
+	/* Now we insert a RWX mapping into the guest's RVI pmap. */
+	ret = pmap_enter(vcpu->vc_parent->vm_pmap, pa, hpa,
+	    PROT_READ | PROT_WRITE | PROT_EXEC, 0);
+	if (ret) {
+		printf("%s: pmap_enter failed pa=0x%llx, hpa=0x%llx\n",
+		    __func__, (uint64_t)pa, (uint64_t)hpa);
+	}
+
 	return (ret);
 }

@@ -4640,8 +4659,10 @@ svm_handle_np_fault(struct vcpu *vcpu)
 int
 vmx_fault_page(struct vcpu *vcpu, paddr_t gpa)
 {
+	struct proc *p = curproc;
 	int fault_type, ret;
-	paddr_t pa = trunc_page(gpa);
+	paddr_t hpa, pa = trunc_page(gpa);
+	vaddr_t hva;

 	fault_type = vmx_get_guest_faulttype();
 	switch (fault_type) {
@@ -4656,19 +4677,45 @@ vmx_fault_page(struct vcpu *vcpu, paddr_t gpa)
 		break;
 	}

-	/* We may sleep during uvm_fault_wire(), so reload VMCS. */
-	vcpu->vc_last_pcpu = curcpu();
-	ret = uvm_fault_wire(vcpu->vc_parent->vm_map, pa, pa + PAGE_SIZE,
-	    PROT_READ | PROT_WRITE | PROT_EXEC);
-	if (vcpu_reload_vmcs_vmx(vcpu)) {
-		printf("%s: failed to reload vmcs\n", __func__);
+	hva = vmm_translate_gpa(vcpu->vc_parent, pa);
+	if (hva == 0) {
+		printf("%s: unable to translate gpa 0x%llx\n", __func__,
+		    (uint64_t)pa);
 		return (EINVAL);
 	}

-	if (ret)
-		printf("%s: uvm_fault returns %d, GPA=0x%llx, rip=0x%llx\n",
-		    __func__, ret, (uint64_t)gpa, vcpu->vc_gueststate.vg_rip);
+	/* If we don't already have a backing page... */
+	if (!pmap_extract(p->p_vmspace->vm_map.pmap, hva, &hpa)) {
+		/* ...fault a RW page into the p's address space... */
+		vcpu->vc_last_pcpu = curcpu(); /* uvm_fault may sleep. */
+		ret = uvm_fault_wire(&p->p_vmspace->vm_map, hva,
+		    hva + PAGE_SIZE, PROT_READ | PROT_WRITE);
+		if (ret) {
+			printf("%s: uvm_fault failed %d hva=0x%llx\n", __func__,
+			    ret, (uint64_t)hva);
+			return (ret);
+		}
+		if (vcpu_reload_vmcs_vmx(vcpu)) {
+			printf("%s: failed to reload vmcs\n", __func__);
+			return (EINVAL);
+		}

+		/* ...and then get the mapping. */
+		if (!pmap_extract(p->p_vmspace->vm_map.pmap, hva, &hpa)) {
+			printf("%s: failed to extract hpa for hva 0x%llx\n",
+			    __func__, (uint64_t)hva);
+			return (EINVAL);
+		}
+	}
+
+	/* Now we insert a RWX mapping into the guest's EPT pmap. */
+	ret = pmap_enter(vcpu->vc_parent->vm_pmap, pa, hpa,
+	    PROT_READ | PROT_WRITE | PROT_EXEC, 0);
+	if (ret) {
+		printf("%s: pmap_enter failed pa=0x%llx, hpa=0x%llx\n",
+		    __func__, (uint64_t)pa, (uint64_t)hpa);
+	}
+
 	return (ret);
 }

@@ -4953,7 +5000,7 @@ vmx_load_pdptes(struct vcpu *vcpu)
 		return (EINVAL);
 	}

-	if (!pmap_extract(vcpu->vc_parent->vm_map->pmap, (vaddr_t)cr3,
+	if (!pmap_extract(vcpu->vc_parent->vm_pmap, (vaddr_t)cr3,
 	    (paddr_t *)&cr3_host_phys)) {
 		DPRINTF("%s: nonmapped guest CR3, setting PDPTEs to 0\n",
 		    __func__);
@@ -6499,7 +6546,7 @@ vmm_update_pvclock(struct vcpu *vcpu)

 	if (vcpu->vc_pvclock_system_gpa & PVCLOCK_SYSTEM_TIME_ENABLE) {
 		pvclock_gpa = vcpu->vc_pvclock_system_gpa & 0xFFFFFFFFFFFFFFF0;
-		if (!pmap_extract(vm->vm_map->pmap, pvclock_gpa, &pvclock_hpa))
+		if (!pmap_extract(vm->vm_pmap, pvclock_gpa, &pvclock_hpa))
 			return (EINVAL);
 		pvclock_ti = (void*) PMAP_DIRECT_MAP(pvclock_hpa);

blob - 0a86ddbecd3fbd3627336902c7209f6d023bc3f0
blob + 4d39ff3b6910e1b905c959476073eed89471f407
--- sys/dev/vmm/vmm.c
+++ sys/dev/vmm/vmm.c
@@ -25,6 +25,9 @@
 #include <sys/malloc.h>
 #include <sys/signalvar.h>

+#include <uvm/uvm_extern.h>
+#include <uvm/uvm_aobj.h>
+
 #include <machine/vmmvar.h>

 #include <dev/vmm/vmm.h>
@@ -356,6 +359,9 @@ vm_create(struct vm_create_params *vcp, struct proc *p
 	size_t memsize;
 	struct vm *vm;
 	struct vcpu *vcpu;
+	struct uvm_object *uao;
+	struct vm_mem_range *vmr;
+	unsigned int uvmflags = 0;

 	memsize = vm_create_check_mem_ranges(vcp);
 	if (memsize == 0)
@@ -378,13 +384,61 @@ vm_create(struct vm_create_params *vcp, struct proc *p
 	/* Instantiate and configure the new vm. */
 	vm = pool_get(&vm_pool, PR_WAITOK | PR_ZERO);

+	/* Create the VM's identity. */
 	vm->vm_creator_pid = p->p_p->ps_pid;
+	strncpy(vm->vm_name, vcp->vcp_name, VMM_MAX_NAME_LEN - 1);
+
+	/* Create the pmap for nested paging. */
+	vm->vm_pmap = pmap_create();
+
+	/* Initialize memory slots. */
 	vm->vm_nmemranges = vcp->vcp_nmemranges;
 	memcpy(vm->vm_memranges, vcp->vcp_memranges,
 	    vm->vm_nmemranges * sizeof(vm->vm_memranges[0]));
-	vm->vm_memory_size = memsize;
-	strncpy(vm->vm_name, vcp->vcp_name, VMM_MAX_NAME_LEN - 1);
+	vm->vm_memory_size = memsize; /* Calculated above. */

+	uvmflags = UVM_MAPFLAG(PROT_READ | PROT_WRITE, PROT_READ | PROT_WRITE,
+	    MAP_INHERIT_NONE, MADV_NORMAL, UVM_FLAG_CONCEAL);
+	for (i = 0; i < vm->vm_nmemranges; i++) {
+		vmr = &vm->vm_memranges[i];
+		if (vmr->vmr_type == VM_MEM_MMIO)
+			continue;
+
+		uao = NULL;
+		uao = uao_create(vmr->vmr_size, UAO_FLAG_CANFAIL);
+		if (uao == NULL) {
+			printf("%s: failed to initialize memory slot\n",
+			    __func__);
+			vm_teardown(&vm);
+			return (ENOMEM);
+		}
+
+		/* Map the UVM aobj into the process. It owns this reference. */
+		ret = uvm_map(&p->p_vmspace->vm_map, &vmr->vmr_va,
+		    vmr->vmr_size, uao, 0, 0, uvmflags);
+		if (ret) {
+			printf("%s: uvm_map failed: %d\n", __func__, ret);
+			uao_detach(uao);
+			vm_teardown(&vm);
+			return (ENOMEM);
+		}
+
+		/* Make this mapping immutable so userland cannot change it. */
+		ret = uvm_map_immutable(&p->p_vmspace->vm_map, vmr->vmr_va,
+		    vmr->vmr_va + vmr->vmr_size, 1);
+		if (ret) {
+			printf("%s: uvm_map_immutable failed: %d\n", __func__,
+			    ret);
+			uvm_unmap(&p->p_vmspace->vm_map, vmr->vmr_va,
+			    vmr->vmr_va + vmr->vmr_size);
+			vm_teardown(&vm);
+			return (ret);
+		}
+
+		uao_reference(uao);	/* Take a reference for vmm. */
+		vm->vm_memory_slot[i] = uao;
+	}
+
 	if (vm_impl_init(vm, p)) {
 		printf("failed to init arch-specific features for vm %p\n", vm);
 		vm_teardown(&vm);
@@ -433,6 +487,10 @@ vm_create(struct vm_create_params *vcp, struct proc *p
 	vmm_softc->vm_ct++;
 	vmm_softc->vcpu_ct += vm->vm_vcpu_ct;

+	/* Update the userland process's view of guest memory. */
+	memcpy(vcp->vcp_memranges, vm->vm_memranges,
+	    vcp->vcp_nmemranges * sizeof(vcp->vcp_memranges[0]));
+
 	rw_exit_write(&vmm_softc->vm_lock);

 	return (0);
@@ -482,19 +540,6 @@ vm_create_check_mem_ranges(struct vm_create_params *vc
 		}

 		/*
-		 * Make sure that all virtual addresses are within the address
-		 * space of the process and that they do not wrap around.
-		 * Calling uvm_share() when creating the VM will take care of
-		 * further checks.
-		 */
-		if (vmr->vmr_va < VM_MIN_ADDRESS ||
-		    vmr->vmr_va >= VM_MAXUSER_ADDRESS ||
-		    vmr->vmr_size >= VM_MAXUSER_ADDRESS - vmr->vmr_va) {
-			DPRINTF("guest va not within range or wraps\n");
-			return (0);
-		}
-
-		/*
 		 * Make sure that guest physical memory ranges do not overlap
 		 * and that they are ascending.
 		 */
@@ -529,10 +574,11 @@ vm_create_check_mem_ranges(struct vm_create_params *vc
 void
 vm_teardown(struct vm **target)
 {
-	size_t nvcpu = 0;
+	size_t i, nvcpu = 0;
+	vaddr_t sva, eva;
 	struct vcpu *vcpu, *tmp;
 	struct vm *vm = *target;
-	struct vmspace *vm_vmspace;
+	struct uvm_object *uao;

 	KERNEL_ASSERT_UNLOCKED();

@@ -540,22 +586,29 @@ vm_teardown(struct vm **target)
 	SLIST_FOREACH_SAFE(vcpu, &vm->vm_vcpu_list, vc_vcpu_link, tmp) {
 		SLIST_REMOVE(&vm->vm_vcpu_list, vcpu, vcpu, vc_vcpu_link);
 		vcpu_deinit(vcpu);
-
 		pool_put(&vcpu_pool, vcpu);
 		nvcpu++;
 	}

-	vm_impl_deinit(vm);
+	/* Remove guest mappings from our nested page tables. */
+	for (i = 0; i < vm->vm_nmemranges; i++) {
+		sva = vm->vm_memranges[i].vmr_gpa;
+		eva = sva + vm->vm_memranges[i].vmr_size - 1;
+		pmap_remove(vm->vm_pmap, sva, eva);
+	}

-	/* teardown guest vmspace */
-	KERNEL_LOCK();
-	vm_vmspace = vm->vm_vmspace;
-	if (vm_vmspace != NULL) {
-		vm->vm_vmspace = NULL;
-		uvmspace_free(vm_vmspace);
+	/* Release UVM anon objects backing our guest memory. */
+	for (i = 0; i < vm->vm_nmemranges; i++) {
+		uao = vm->vm_memory_slot[i];
+		vm->vm_memory_slot[i] = NULL;
+		if (uao != NULL)
+			uao_detach(uao);
 	}
-	KERNEL_UNLOCK();

+	/* At this point, no UVM-managed pages should reference our pmap. */
+	pmap_destroy(vm->vm_pmap);
+	vm->vm_pmap = NULL;
+
 	pool_put(&vm_pool, vm);
 	*target = NULL;
 }
@@ -612,7 +665,7 @@ vm_get_info(struct vm_info_params *vip)

 		out[i].vir_memory_size = vm->vm_memory_size;
 		out[i].vir_used_size =
-		    pmap_resident_count(vm->vm_map->pmap) * PAGE_SIZE;
+		    pmap_resident_count(vm->vm_pmap) * PAGE_SIZE;
 		out[i].vir_ncpus = vm->vm_vcpu_ct;
 		out[i].vir_id = vm->vm_id;
 		out[i].vir_creator_pid = vm->vm_creator_pid;
@@ -793,17 +846,17 @@ vcpu_must_stop(struct vcpu *vcpu)
  * Return values:
  *  0: if successful
  *  ENOENT: if the vm cannot be found by vm_find
- *  EPERM: if the vm cannot be accessed by the current process
- *  EINVAL: if the provide memory ranges fail checks
- *  ENOMEM: if uvm_share fails to find available memory in the destination map
+ *  other errno on uvm_map or uvm_map_immutable failures
  */
 int
 vm_share_mem(struct vm_sharemem_params *vsp, struct proc *p)
 {
-	int ret = EINVAL;
-	size_t i, n;
+	int ret = EINVAL, unmap = 0;
+	size_t i, failed_uao = 0, n;
 	struct vm *vm;
 	struct vm_mem_range *src, *dst;
+	struct uvm_object *uao;
+	unsigned int uvmflags;

 	ret = vm_find(vsp->vsp_vm_id, &vm);
 	if (ret)
@@ -830,36 +883,52 @@ vm_share_mem(struct vm_sharemem_params *vsp, struct pr
 		if (src->vmr_size != dst->vmr_size)
 			goto out;

-		/* Check our intended destination is page-aligned. */
-		if (dst->vmr_va & PAGE_MASK)
+		/* The virtual addresses will be chosen by uvm_map(). */
+		if (vsp->vsp_va[i] != 0)
 			goto out;
 	}

-	/*
-	 * Share each range individually with the calling process. We do
-	 * not need PROC_EXEC as the emulated devices do not need to execute
-	 * instructions from guest memory.
-	 */
+	/* Share each UVM aobj with the calling process. */
+	uvmflags = UVM_MAPFLAG(PROT_READ | PROT_WRITE, PROT_READ | PROT_WRITE,
+	    MAP_INHERIT_NONE, MADV_NORMAL, UVM_FLAG_CONCEAL);
 	for (i = 0; i < n; i++) {
-		src = &vm->vm_memranges[i];
 		dst = &vsp->vsp_memranges[i];
-
-		/* Skip MMIO range. */
-		if (src->vmr_type == VM_MEM_MMIO)
+		if (dst->vmr_type == VM_MEM_MMIO)
 			continue;

-		DPRINTF("sharing gpa=0x%lx for pid %d @ va=0x%lx\n",
-		    src->vmr_gpa, p->p_p->ps_pid, dst->vmr_va);
-		ret = uvm_share(&p->p_vmspace->vm_map, dst->vmr_va,
-		    PROT_READ | PROT_WRITE, vm->vm_map, src->vmr_gpa,
-		    src->vmr_size);
+		uao = vm->vm_memory_slot[i];
+		KASSERT(uao != NULL);
+
+		ret = uvm_map(&p->p_p->ps_vmspace->vm_map, &vsp->vsp_va[i],
+		    dst->vmr_size, uao, 0, 0, uvmflags);
 		if (ret) {
-			printf("%s: uvm_share failed (%d)\n", __func__, ret);
-			break;
+			printf("%s: uvm_map failed: %d\n", __func__, ret);
+			unmap = (i > 0) ? 1 : 0;
+			failed_uao = i;
+			goto out;
 		}
+		uao_reference(uao);	/* Add a reference for the process. */
+
+		ret = uvm_map_immutable(&p->p_p->ps_vmspace->vm_map,
+		    vsp->vsp_va[i], vsp->vsp_va[i] + dst->vmr_size, 1);
+		if (ret) {
+			printf("%s: uvm_map_immutable failed: %d\n",
+			    __func__, ret);
+			unmap = 1;
+			failed_uao = i + 1;
+			goto out;
+		}
 	}
 	ret = 0;
 out:
+	if (unmap) {
+		/* Unmap mapped aobjs, which drops the process's reference. */
+		for (i = 0; i < failed_uao; i++) {
+			dst = &vsp->vsp_memranges[i];
+			uvm_unmap(&p->p_p->ps_vmspace->vm_map,
+			    vsp->vsp_va[i], vsp->vsp_va[i] + dst->vmr_size);
+		}
+	}
 	refcnt_rele_wake(&vm->vm_refcnt);
 	return (ret);
 }
blob - 621dd08080c5f40e4b594737b4ea68f1ce98d5d1
blob + 78d0681cda32a01b11914fd55315230f817c64a5
--- sys/dev/vmm/vmm.h
+++ sys/dev/vmm/vmm.h
@@ -95,6 +95,9 @@ struct vm_sharemem_params {
 	uint32_t		vsp_vm_id;
 	size_t			vsp_nmemranges;
 	struct vm_mem_range	vsp_memranges[VMM_MAX_MEM_RANGES];
+
+	/* Output parameters from VMM_IOC_SHAREMEM */
+	vaddr_t			vsp_va[VMM_MAX_MEM_RANGES];
 };

 struct vm_run_params {
@@ -138,7 +141,7 @@ struct vm_rwvmparams_params {
 #define VMM_IOC_READVMPARAMS _IOWR('V', 9, struct vm_rwvmparams_params)
 /* Set VM params */
 #define VMM_IOC_WRITEVMPARAMS _IOW('V', 10, struct vm_rwvmparams_params)
-#define VMM_IOC_SHAREMEM _IOW('V', 11, struct vm_sharemem_params)
+#define VMM_IOC_SHAREMEM _IOWR('V', 11, struct vm_sharemem_params)

 #ifdef _KERNEL

@@ -169,14 +172,17 @@ enum {
  *	V	vmm_softc's vm_lock
  */
 struct vm {
-	struct vmspace		 *vm_vmspace;		/* [K] */
-	vm_map_t		 vm_map;		/* [K] */
+	pmap_t			 vm_pmap;		/* [r] */
+
 	uint32_t		 vm_id;			/* [I] */
 	pid_t			 vm_creator_pid;	/* [I] */
+
 	size_t			 vm_nmemranges;		/* [I] */
 	size_t			 vm_memory_size;	/* [I] */
-	char			 vm_name[VMM_MAX_NAME_LEN];
 	struct vm_mem_range	 vm_memranges[VMM_MAX_MEM_RANGES];
+	struct uvm_object	*vm_memory_slot[VMM_MAX_MEM_RANGES]; /* [I] */
+
+	char			 vm_name[VMM_MAX_NAME_LEN];
 	struct refcnt		 vm_refcnt;		/* [a] */

 	struct vcpu_head	 vm_vcpu_list;		/* [v] */
blob - e399c0c0439a02117bc53c62cbb59c950b176773
blob + 30b12b0e30936bddf75484aeedf60a34b1296d89
--- usr.sbin/vmd/vm.c
+++ usr.sbin/vmd/vm.c
@@ -49,7 +49,6 @@ static void vm_dispatch_vmm(int, short, void *);
 static void *event_thread(void *);
 static void *vcpu_run_loop(void *);
 static int vmm_create_vm(struct vmd_vm *);
-static int alloc_guest_mem(struct vmd_vm *);
 static int send_vm(int, struct vmd_vm *);
 static int dump_vmr(int , struct vm_mem_range *);
 static int dump_mem(int, struct vmd_vm *);
@@ -201,7 +200,8 @@ start_vm(struct vmd_vm *vm, int fd)
 	if (!(vm->vm_state & VM_STATE_RECEIVED))
 		create_memory_map(vcp);

-	ret = alloc_guest_mem(vm);
+	/* Create the vm in vmm(4). */
+	ret = vmm_create_vm(vm);
 	if (ret) {
 		struct rlimit lim;
 		char buf[FMT_SCALED_STRSIZE];
@@ -209,15 +209,11 @@ start_vm(struct vmd_vm *vm, int fd)
 			if (fmt_scaled(lim.rlim_cur, buf) == 0)
 				fatalx("could not allocate guest memory (data "
 				    "limit is %s)", buf);
+		} else {
+			errno = ret;
+			log_warn("could not create vm");
 		}
-		errno = ret;
-		log_warn("could not allocate guest memory");
-		return (ret);
-	}

-	/* We've allocated guest memory, so now create the vm in vmm(4). */
-	ret = vmm_create_vm(vm);
-	if (ret) {
 		/* Let the vmm process know we failed by sending a 0 vm id. */
 		vcp->vcp_id = 0;
 		atomicio(vwrite, fd, &vcp->vcp_id, sizeof(vcp->vcp_id));
@@ -756,62 +752,6 @@ vcpu_reset(uint32_t vmid, uint32_t vcpu_id, struct vcp
 }

 /*
- * alloc_guest_mem
- *
- * Allocates memory for the guest.
- * Instead of doing a single allocation with one mmap(), we allocate memory
- * separately for every range for the following reasons:
- * - ASLR for the individual ranges
- * - to reduce memory consumption in the UVM subsystem: if vmm(4) had to
- *   map the single mmap'd userspace memory to the individual guest physical
- *   memory ranges, the underlying amap of the single mmap'd range would have
- *   to allocate per-page reference counters. The reason is that the
- *   individual guest physical ranges would reference the single mmap'd region
- *   only partially. However, if every guest physical range has its own
- *   corresponding mmap'd userspace allocation, there are no partial
- *   references: every guest physical range fully references an mmap'd
- *   range => no per-page reference counters have to be allocated.
- *
- * Return values:
- *  0: success
- *  !0: failure - errno indicating the source of the failure
- */
-int
-alloc_guest_mem(struct vmd_vm *vm)
-{
-	void *p;
-	int ret = 0;
-	size_t i, j;
-	struct vm_create_params *vcp = &vm->vm_params.vmc_params;
-	struct vm_mem_range *vmr;
-
-	for (i = 0; i < vcp->vcp_nmemranges; i++) {
-		vmr = &vcp->vcp_memranges[i];
-
-		/*
-		 * We only need R/W as userland. vmm(4) will use R/W/X in its
-		 * mapping.
-		 *
-		 * We must use MAP_SHARED so emulated devices will be able
-		 * to generate shared mappings.
-		 */
-		p = mmap(NULL, vmr->vmr_size, PROT_READ | PROT_WRITE,
-		    MAP_ANON | MAP_CONCEAL | MAP_SHARED, -1, 0);
-		if (p == MAP_FAILED) {
-			ret = errno;
-			for (j = 0; j < i; j++) {
-				vmr = &vcp->vcp_memranges[j];
-				munmap((void *)vmr->vmr_va, vmr->vmr_size);
-			}
-			return (ret);
-		}
-		vmr->vmr_va = (vaddr_t)p;
-	}
-
-	return (ret);
-}
-
-/*
  * vmm_create_vm
  *
  * Requests vmm(4) to create a new VM using the supplied creation
@@ -1384,87 +1324,35 @@ vm_pipe_recv(struct vm_dev_pipe *p)
 }

 /*
- * Re-map the guest address space using vmm(4)'s VMM_IOC_SHARE
+ * Re-map the guest address space using vmm(4)'s VMM_IOC_SHAREMEM
  *
- * Returns 0 on success, non-zero in event of failure.
+ * Returns 0 on success or an errno in event of failure.
  */
 int
 remap_guest_mem(struct vmd_vm *vm, int vmm_fd)
 {
-	struct vm_create_params	*vcp;
-	struct vm_mem_range	*vmr;
+	size_t i;
+	struct vm_create_params	*vcp = &vm->vm_params.vmc_params;
 	struct vm_sharemem_params vsp;
-	size_t			 i, j;
-	void			*p = NULL;
-	int			 ret;

 	if (vm == NULL)
-		return (1);
+		return (EINVAL);

-	vcp = &vm->vm_params.vmc_params;
-
-	/*
-	 * Initialize our VM shared memory request using our original
-	 * creation parameters. We'll overwrite the va's after mmap(2).
-	 */
+	/* Initialize using our original creation parameters. */
 	memset(&vsp, 0, sizeof(vsp));
 	vsp.vsp_nmemranges = vcp->vcp_nmemranges;
 	vsp.vsp_vm_id = vcp->vcp_id;
 	memcpy(&vsp.vsp_memranges, &vcp->vcp_memranges,
 	    sizeof(vsp.vsp_memranges));

-	/*
-	 * Use mmap(2) to identify virtual address space for our mappings.
-	 */
-	for (i = 0; i < VMM_MAX_MEM_RANGES; i++) {
-		if (i < vsp.vsp_nmemranges) {
-			vmr = &vsp.vsp_memranges[i];
-
-			/* Ignore any MMIO ranges. */
-			if (vmr->vmr_type == VM_MEM_MMIO) {
-				vmr->vmr_va = 0;
-				vcp->vcp_memranges[i].vmr_va = 0;
-				continue;
-			}
-
-			/* Make initial mappings for the memrange. */
-			p = mmap(NULL, vmr->vmr_size, PROT_READ, MAP_ANON, -1,
-			    0);
-			if (p == MAP_FAILED) {
-				ret = errno;
-				log_warn("%s: mmap", __func__);
-				for (j = 0; j < i; j++) {
-					vmr = &vcp->vcp_memranges[j];
-					munmap((void *)vmr->vmr_va,
-					    vmr->vmr_size);
-				}
-				return (ret);
-			}
-			vmr->vmr_va = (vaddr_t)p;
-			vcp->vcp_memranges[i].vmr_va = vmr->vmr_va;
-		}
-	}
-
-	/*
-	 * munmap(2) now that we have va's and ranges that don't overlap. vmm
-	 * will use the va's and sizes to recreate the mappings for us.
-	 */
-	for (i = 0; i < vsp.vsp_nmemranges; i++) {
-		vmr = &vsp.vsp_memranges[i];
-		if (vmr->vmr_type == VM_MEM_MMIO)
-			continue;
-		if (munmap((void*)vmr->vmr_va, vmr->vmr_size) == -1)
-			fatal("%s: munmap", __func__);
-	}
-
-	/*
-	 * Ask vmm to enter the shared mappings for us. They'll point
-	 * to the same host physical memory, but will have a randomized
-	 * virtual address for the calling process.
-	 */
+	/* Ask vmm(4) to enter a shared mapping to guest memory. */
 	if (ioctl(vmm_fd, VMM_IOC_SHAREMEM, &vsp) == -1)
 		return (errno);

+	/* Update with the location of the new mappings. */
+	for (i = 0; i < vsp.vsp_nmemranges; i++)
+		vcp->vcp_memranges[i].vmr_va = vsp.vsp_va[i];
+
 	return (0);
 }