Index | Thread | Search

From:
Kirill A. Korinsky <kirill@korins.ky>
Subject:
Re: Reworking VMM's nested paging & guest memory (de-vmspace-ification)
To:
Dave Voutila <dv@sisu.io>
Cc:
tech@openbsd.org, mlarkin@openbsd.org
Date:
Tue, 15 Apr 2025 22:47:07 +0200

Download raw body.

Thread
On Tue, 15 Apr 2025 02:18:46 +0200,
Dave Voutila <dv@sisu.io> wrote:
> 
> tl;dr: testers wanted on a variety of hardware where you run
> vmm/vmd. This needs *both* building a new kernel *and* vmd.
>

Works well on:

cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz, 3292.33 MHz, 06-8e-0c, patch 000000fc
cpu0: cpuid 1 edx=bfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> ecx=77fafbbf<SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND>
cpu0: cpuid 6 eax=27f7<SENSOR,ARAT> ecx=9<EFFFREQ>
cpu0: cpuid 7.0 ebx=29c67af<FSGSBASE,TSC_ADJUST,SGX,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PT> edx=bc000600<SRBDS_CTRL,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD>
cpu0: cpuid a vers=4, gp=4, gpwidth=48, ff=3, ffwidth=48
cpu0: cpuid d.1 eax=f<XSAVEOPT,XSAVEC,XGETBV1,XSAVES>
cpu0: cpuid 80000001 edx=2c100800<NXE,PAGE1GB,RDTSCP,LONG> ecx=121<LAHF,ABM,3DNOWP>
cpu0: cpuid 80000007 edx=100<ITSC>
cpu0: msr 10a=a0a0c2b<IBRS_ALL,SKIP_L1DFL,MDS_NO,MISC_PKG_CT,ENERGY_FILT,FB_CLEAR,RRSBA,GDS_CTRL,RFDS_NO>
cpu0: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 256KB 64b/line 4-way L2 cache, 6MB 64b/line 12-way L3 cache
cpu0: smt 0, core 0, package 0


> Today, vmm uses a UVM vmspace to represent both the nested page table
> (pmap) for the guest as well as guest memory in the form of amap's
> within the vmspace's map. This has sort of worked, but I've been hunting
> constant issues related to vmm triggering uvm_map_teardown inside vmm.
> 
> In short, using a full blown UVM vmspace to represent a guest makes us
> do some funky things that are not typical in other hypervisors and leads
> to some lifecycle headaches:
> 
>  1. vmm faults pages into the vmspace's map, which is shared via a
>     uvm_share() function to share map entries into a target process's
>     own address space. (This uvm_share() function exists *only* for vmm
>     to assist in sharing guest memory between vmm's vmspace in the
>     kernel and the vmd process's address space.)
> 
>  2. This also forces use of uvm_share() for vmd child processes that
>     need r/w access to guest memory, such as the virtio block/network
>     emulation.
> 
>  3. Teardown/cleanup means vmm's vmspace may or may not outlive the vmd
>     process with which it's sharing mappings.
> 
> Now that vmd is multi-process for guest emulation, the extra abstraction
> is simply a burden and increases the surface area to hunt for the source
> of bugs. Plus, vmm is relying on a UVM function no other kernel system
> uses making it a bit of a snowflake.
> 
> This diff removes the use of a vmspace by:
> 
>  1. creating and managing a pmap directly for representing the nested
>     page tables (i.e. EPT page tables on Intel) and using pmap_extract /
>     pmap_enter / pmap_remove for working with them.
> 
>  2. creates and manages UVM aobjs to represent each slot/range of guest
>     memory with an explicit size provided at creation time. These are
>     uvm_map()'d into userland vmd processes and no use of uvm_share() is
>     needed.
> 
>  3. cleanup is simplified to dropping a refcount on the aobjs, wiping
>     page table entries in the pmap, and then a pmap_destroy.
> 
> The major changes are in vm_create, vm_teardown, svm_fault_page, and
> vmx_fault_page functions. There's some tidying that can be done
> afterwards.
> 
> I've been stress testing this on one of the most cursed Intel machines
> that is good at finding race conditions. I've also tested it lightly on
> an AMD machine I own to validate SVM, so more tests are very welcome at
> this point.
> 
> OKs welcome, but I'd prefer some test reports before looking to commit,
> but since we're recently unlocked from release prep, I'd prefer sooner
> rather than later.
> 
> Note: this doesn't nuke uvm_share(), but that can come afterwards if
> this lands and nobody needs/wants it.
> 
> -dv
> 
> diff refs/heads/master refs/heads/vmm-aobj
> commit - ade9dbe6546b6b7741371ee61e71f48f219d977a
> commit + ea39d23aa67862930c7cbccaaa31182d2d11a86f
> blob - e3205f48eedda8f4158d9a8d42e6164534921844
> blob + e58730ec23702850524db522dd82d750589d4c18
> --- sys/arch/amd64/amd64/vmm_machdep.c
> +++ sys/arch/amd64/amd64/vmm_machdep.c
> @@ -115,6 +115,7 @@ int vmm_inject_db(struct vcpu *);
>  void vmx_handle_intr(struct vcpu *);
>  void vmx_handle_misc_enable_msr(struct vcpu *);
>  int vmm_get_guest_memtype(struct vm *, paddr_t);
> +vaddr_t vmm_translate_gpa(struct vm *, paddr_t);
>  int vmx_get_guest_faulttype(void);
>  int svm_get_guest_faulttype(struct vmcb *);
>  int vmx_get_exit_qualification(uint64_t *);
> @@ -905,54 +906,19 @@ vmx_remote_vmclear(struct cpu_info *ci, struct vcpu *v
>  int
>  vm_impl_init(struct vm *vm, struct proc *p)
>  {
> -	int i, mode, ret;
> -	vaddr_t mingpa, maxgpa;
> -	struct vm_mem_range *vmr;
> -
>  	/* If not EPT or RVI, nothing to do here */
>  	switch (vmm_softc->mode) {
>  	case VMM_MODE_EPT:
> -		mode = PMAP_TYPE_EPT;
> +		pmap_convert(vm->vm_pmap, PMAP_TYPE_EPT);
>  		break;
>  	case VMM_MODE_RVI:
> -		mode = PMAP_TYPE_RVI;
> +		pmap_convert(vm->vm_pmap, PMAP_TYPE_RVI);
>  		break;
>  	default:
>  		printf("%s: invalid vmm mode %d\n", __func__, vmm_softc->mode);
>  		return (EINVAL);
>  	}
> 
> -	vmr = &vm->vm_memranges[0];
> -	mingpa = vmr->vmr_gpa;
> -	vmr = &vm->vm_memranges[vm->vm_nmemranges - 1];
> -	maxgpa = vmr->vmr_gpa + vmr->vmr_size;
> -
> -	/*
> -	 * uvmspace_alloc (currently) always returns a valid vmspace
> -	 */
> -	vm->vm_vmspace = uvmspace_alloc(mingpa, maxgpa, TRUE, FALSE);
> -	vm->vm_map = &vm->vm_vmspace->vm_map;
> -
> -	/* Map the new map with an anon */
> -	DPRINTF("%s: created vm_map @ %p\n", __func__, vm->vm_map);
> -	for (i = 0; i < vm->vm_nmemranges; i++) {
> -		vmr = &vm->vm_memranges[i];
> -		ret = uvm_share(vm->vm_map, vmr->vmr_gpa,
> -		    PROT_READ | PROT_WRITE | PROT_EXEC,
> -		    &p->p_vmspace->vm_map, vmr->vmr_va, vmr->vmr_size);
> -		if (ret) {
> -			printf("%s: uvm_share failed (%d)\n", __func__, ret);
> -			/* uvmspace_free calls pmap_destroy for us */
> -			KERNEL_LOCK();
> -			uvmspace_free(vm->vm_vmspace);
> -			vm->vm_vmspace = NULL;
> -			KERNEL_UNLOCK();
> -			return (ENOMEM);
> -		}
> -	}
> -
> -	pmap_convert(vm->vm_map->pmap, mode);
> -
>  	return (0);
>  }
> 
> @@ -1666,7 +1632,7 @@ vcpu_reset_regs_svm(struct vcpu *vcpu, struct vcpu_reg
> 
>  	/* NPT */
>  	vmcb->v_np_enable = SVM_ENABLE_NP;
> -	vmcb->v_n_cr3 = vcpu->vc_parent->vm_map->pmap->pm_pdirpa;
> +	vmcb->v_n_cr3 = vcpu->vc_parent->vm_pmap->pm_pdirpa;
> 
>  	/* SEV */
>  	if (vcpu->vc_sev)
> @@ -1680,7 +1646,7 @@ vcpu_reset_regs_svm(struct vcpu *vcpu, struct vcpu_reg
>  	/* xcr0 power on default sets bit 0 (x87 state) */
>  	vcpu->vc_gueststate.vg_xcr0 = XFEATURE_X87 & xsave_mask;
> 
> -	vcpu->vc_parent->vm_map->pmap->eptp = 0;
> +	vcpu->vc_parent->vm_pmap->eptp = 0;
> 
>  	return ret;
>  }
> @@ -2601,7 +2567,7 @@ vcpu_init_vmx(struct vcpu *vcpu)
>  	}
> 
>  	/* Configure EPT Pointer */
> -	eptp = vcpu->vc_parent->vm_map->pmap->pm_pdirpa;
> +	eptp = vcpu->vc_parent->vm_pmap->pm_pdirpa;
>  	msr = rdmsr(IA32_VMX_EPT_VPID_CAP);
>  	if (msr & IA32_EPT_VPID_CAP_PAGE_WALK_4) {
>  		/* Page walk length 4 supported */
> @@ -2625,7 +2591,7 @@ vcpu_init_vmx(struct vcpu *vcpu)
>  		goto exit;
>  	}
> 
> -	vcpu->vc_parent->vm_map->pmap->eptp = eptp;
> +	vcpu->vc_parent->vm_pmap->eptp = eptp;
> 
>  	/* Host CR0 */
>  	cr0 = rcr0() & ~CR0_TS;
> @@ -3490,7 +3456,7 @@ vmm_translate_gva(struct vcpu *vcpu, uint64_t va, uint
> 
>  		DPRINTF("%s: read pte level %d @ GPA 0x%llx\n", __func__,
>  		    level, pte_paddr);
> -		if (!pmap_extract(vcpu->vc_parent->vm_map->pmap, pte_paddr,
> +		if (!pmap_extract(vcpu->vc_parent->vm_pmap, pte_paddr,
>  		    &hpa)) {
>  			DPRINTF("%s: cannot extract HPA for GPA 0x%llx\n",
>  			    __func__, pte_paddr);
> @@ -3671,11 +3637,11 @@ vcpu_run_vmx(struct vcpu *vcpu, struct vm_run_params *
> 
>  			/* We're now using this vcpu's EPT pmap on this cpu. */
>  			atomic_swap_ptr(&ci->ci_ept_pmap,
> -			    vcpu->vc_parent->vm_map->pmap);
> +			    vcpu->vc_parent->vm_pmap);
> 
>  			/* Invalidate EPT cache. */
>  			vid_ept.vid_reserved = 0;
> -			vid_ept.vid_eptp = vcpu->vc_parent->vm_map->pmap->eptp;
> +			vid_ept.vid_eptp = vcpu->vc_parent->vm_pmap->eptp;
>  			if (invept(ci->ci_vmm_cap.vcc_vmx.vmx_invept_mode,
>  			    &vid_ept)) {
>  				printf("%s: invept\n", __func__);
> @@ -4461,6 +4427,29 @@ vmm_get_guest_memtype(struct vm *vm, paddr_t gpa)
>  	return (VMM_MEM_TYPE_UNKNOWN);
>  }
> 
> +vaddr_t
> +vmm_translate_gpa(struct vm *vm, paddr_t gpa)
> +{
> +	int i = 0;
> +	vaddr_t hva = 0;
> +	struct vm_mem_range *vmr = NULL;
> +
> +	/*
> +	 * Translate GPA -> userland HVA in proc p. Find the memory range
> +	 * and use it to translate to the HVA.
> +	 */
> +	for (i = 0; i < vm->vm_nmemranges; i++) {
> +		vmr = &vm->vm_memranges[i];
> +		if (gpa >= vmr->vmr_gpa && gpa < vmr->vmr_gpa + vmr->vmr_size) {
> +			hva = vmr->vmr_va + (gpa - vmr->vmr_gpa);
> +			break;
> +		}
> +	}
> +
> +	return (hva);
> +}
> +
> +
>  /*
>   * vmx_get_exit_qualification
>   *
> @@ -4542,15 +4531,45 @@ svm_get_guest_faulttype(struct vmcb *vmcb)
>  int
>  svm_fault_page(struct vcpu *vcpu, paddr_t gpa)
>  {
> -	paddr_t pa = trunc_page(gpa);
> -	int ret;
> +	struct proc *p = curproc;
> +	paddr_t hpa, pa = trunc_page(gpa);
> +	vaddr_t hva;
> +	int ret = 1;
> 
> -	ret = uvm_fault_wire(vcpu->vc_parent->vm_map, pa, pa + PAGE_SIZE,
> -	    PROT_READ | PROT_WRITE | PROT_EXEC);
> -	if (ret)
> -		printf("%s: uvm_fault returns %d, GPA=0x%llx, rip=0x%llx\n",
> -		    __func__, ret, (uint64_t)gpa, vcpu->vc_gueststate.vg_rip);
> +	hva = vmm_translate_gpa(vcpu->vc_parent, pa);
> +	if (hva == 0) {
> +		printf("%s: unable to translate gpa 0x%llx\n", __func__,
> +		    (uint64_t)pa);
> +		return (EINVAL);
> +	}
> 
> +	/* If we don't already have a backing page... */
> +	if (!pmap_extract(p->p_vmspace->vm_map.pmap, hva, &hpa)) {
> +		/* ...fault a RW page into the p's address space... */
> +		ret = uvm_fault_wire(&p->p_vmspace->vm_map, hva,
> +		    hva + PAGE_SIZE, PROT_READ | PROT_WRITE);
> +		if (ret) {
> +			printf("%s: uvm_fault failed %d hva=0x%llx\n", __func__,
> +			    ret, (uint64_t)hva);
> +			return (ret);
> +		}
> +
> +		/* ...and then get the mapping. */
> +		if (!pmap_extract(p->p_vmspace->vm_map.pmap, hva, &hpa)) {
> +			printf("%s: failed to extract hpa for hva 0x%llx\n",
> +			    __func__, (uint64_t)hva);
> +			return (EINVAL);
> +		}
> +	}
> +
> +	/* Now we insert a RWX mapping into the guest's RVI pmap. */
> +	ret = pmap_enter(vcpu->vc_parent->vm_pmap, pa, hpa,
> +	    PROT_READ | PROT_WRITE | PROT_EXEC, 0);
> +	if (ret) {
> +		printf("%s: pmap_enter failed pa=0x%llx, hpa=0x%llx\n",
> +		    __func__, (uint64_t)pa, (uint64_t)hpa);
> +	}
> +
>  	return (ret);
>  }
> 
> @@ -4617,8 +4636,10 @@ svm_handle_np_fault(struct vcpu *vcpu)
>  int
>  vmx_fault_page(struct vcpu *vcpu, paddr_t gpa)
>  {
> +	struct proc *p = curproc;
>  	int fault_type, ret;
> -	paddr_t pa = trunc_page(gpa);
> +	paddr_t hpa, pa = trunc_page(gpa);
> +	vaddr_t hva;
> 
>  	fault_type = vmx_get_guest_faulttype();
>  	switch (fault_type) {
> @@ -4633,19 +4654,45 @@ vmx_fault_page(struct vcpu *vcpu, paddr_t gpa)
>  		break;
>  	}
> 
> -	/* We may sleep during uvm_fault_wire(), so reload VMCS. */
> -	vcpu->vc_last_pcpu = curcpu();
> -	ret = uvm_fault_wire(vcpu->vc_parent->vm_map, pa, pa + PAGE_SIZE,
> -	    PROT_READ | PROT_WRITE | PROT_EXEC);
> -	if (vcpu_reload_vmcs_vmx(vcpu)) {
> -		printf("%s: failed to reload vmcs\n", __func__);
> +	hva = vmm_translate_gpa(vcpu->vc_parent, pa);
> +	if (hva == 0) {
> +		printf("%s: unable to translate gpa 0x%llx\n", __func__,
> +		    (uint64_t)pa);
>  		return (EINVAL);
>  	}
> 
> -	if (ret)
> -		printf("%s: uvm_fault returns %d, GPA=0x%llx, rip=0x%llx\n",
> -		    __func__, ret, (uint64_t)gpa, vcpu->vc_gueststate.vg_rip);
> +	/* If we don't already have a backing page... */
> +	if (!pmap_extract(p->p_vmspace->vm_map.pmap, hva, &hpa)) {
> +		/* ...fault a RW page into the p's address space... */
> +		vcpu->vc_last_pcpu = curcpu(); /* uvm_fault may sleep. */
> +		ret = uvm_fault_wire(&p->p_vmspace->vm_map, hva,
> +		    hva + PAGE_SIZE, PROT_READ | PROT_WRITE);
> +		if (ret) {
> +			printf("%s: uvm_fault failed %d hva=0x%llx\n", __func__,
> +			    ret, (uint64_t)hva);
> +			return (ret);
> +		}
> +		if (vcpu_reload_vmcs_vmx(vcpu)) {
> +			printf("%s: failed to reload vmcs\n", __func__);
> +			return (EINVAL);
> +		}
> 
> +		/* ...and then get the mapping. */
> +		if (!pmap_extract(p->p_vmspace->vm_map.pmap, hva, &hpa)) {
> +			printf("%s: failed to extract hpa for hva 0x%llx\n",
> +			    __func__, (uint64_t)hva);
> +			return (EINVAL);
> +		}
> +	}
> +
> +	/* Now we insert a RWX mapping into the guest's EPT pmap. */
> +	ret = pmap_enter(vcpu->vc_parent->vm_pmap, pa, hpa,
> +	    PROT_READ | PROT_WRITE | PROT_EXEC, 0);
> +	if (ret) {
> +		printf("%s: pmap_enter failed pa=0x%llx, hpa=0x%llx\n",
> +		    __func__, (uint64_t)pa, (uint64_t)hpa);
> +	}
> +
>  	return (ret);
>  }
> 
> @@ -4930,7 +4977,7 @@ vmx_load_pdptes(struct vcpu *vcpu)
>  		return (EINVAL);
>  	}
> 
> -	if (!pmap_extract(vcpu->vc_parent->vm_map->pmap, (vaddr_t)cr3,
> +	if (!pmap_extract(vcpu->vc_parent->vm_pmap, (vaddr_t)cr3,
>  	    (paddr_t *)&cr3_host_phys)) {
>  		DPRINTF("%s: nonmapped guest CR3, setting PDPTEs to 0\n",
>  		    __func__);
> @@ -6459,7 +6506,7 @@ vmm_update_pvclock(struct vcpu *vcpu)
> 
>  	if (vcpu->vc_pvclock_system_gpa & PVCLOCK_SYSTEM_TIME_ENABLE) {
>  		pvclock_gpa = vcpu->vc_pvclock_system_gpa & 0xFFFFFFFFFFFFFFF0;
> -		if (!pmap_extract(vm->vm_map->pmap, pvclock_gpa, &pvclock_hpa))
> +		if (!pmap_extract(vm->vm_pmap, pvclock_gpa, &pvclock_hpa))
>  			return (EINVAL);
>  		pvclock_ti = (void*) PMAP_DIRECT_MAP(pvclock_hpa);
> 
> blob - 0a86ddbecd3fbd3627336902c7209f6d023bc3f0
> blob + f5bfc6cce441f23bc97fd94ef63684e59b3161cd
> --- sys/dev/vmm/vmm.c
> +++ sys/dev/vmm/vmm.c
> @@ -25,6 +25,9 @@
>  #include <sys/malloc.h>
>  #include <sys/signalvar.h>
> 
> +#include <uvm/uvm_extern.h>
> +#include <uvm/uvm_aobj.h>
> +
>  #include <machine/vmmvar.h>
> 
>  #include <dev/vmm/vmm.h>
> @@ -356,6 +359,8 @@ vm_create(struct vm_create_params *vcp, struct proc *p
>  	size_t memsize;
>  	struct vm *vm;
>  	struct vcpu *vcpu;
> +	struct uvm_object *uao;
> +	unsigned int uvmflag = 0;
> 
>  	memsize = vm_create_check_mem_ranges(vcp);
>  	if (memsize == 0)
> @@ -378,13 +383,48 @@ vm_create(struct vm_create_params *vcp, struct proc *p
>  	/* Instantiate and configure the new vm. */
>  	vm = pool_get(&vm_pool, PR_WAITOK | PR_ZERO);
> 
> +	/* Create the VM's identity. */
>  	vm->vm_creator_pid = p->p_p->ps_pid;
> +	strncpy(vm->vm_name, vcp->vcp_name, VMM_MAX_NAME_LEN - 1);
> +
> +	/* Create the pmap for nested paging. */
> +	vm->vm_pmap = pmap_create();
> +
> +	/* Initialize memory slots. */
>  	vm->vm_nmemranges = vcp->vcp_nmemranges;
>  	memcpy(vm->vm_memranges, vcp->vcp_memranges,
>  	    vm->vm_nmemranges * sizeof(vm->vm_memranges[0]));
> -	vm->vm_memory_size = memsize;
> -	strncpy(vm->vm_name, vcp->vcp_name, VMM_MAX_NAME_LEN - 1);
> +	vm->vm_memory_size = memsize; /* Calculated above. */
> +	uvmflag = UVM_MAPFLAG(PROT_READ | PROT_WRITE, PROT_READ | PROT_WRITE,
> +	    MAP_INHERIT_SHARE, 0, UVM_FLAG_FIXED);
> +	for (i = 0; i < vm->vm_nmemranges; i++) {
> +		uao = NULL;
> +		if (vm->vm_memranges[i].vmr_type != VM_MEM_MMIO) {
> +			uao = uao_create(vm->vm_memranges[i].vmr_size,
> +			    UAO_FLAG_CANFAIL);
> +			if (uao == NULL) {
> +				printf("%s: failed to initialize memory slot\n",
> +				    __func__);
> +				vm_teardown(&vm);
> +				return (ENOMEM);
> +			}
> 
> +			/* Map the UVM aobj into the process. */
> +			uao_reference(uao);
> +			ret = uvm_map(&p->p_vmspace->vm_map,
> +			    &vm->vm_memranges[i].vmr_va,
> +			    vm->vm_memranges[i].vmr_size, uao, 0, 0, uvmflag);
> +			if (ret) {
> +				printf("%s: uvm_map failed: %d\n", __func__,
> +				    ret);
> +				vm_teardown(&vm);
> +				return (ENOMEM);
> +			}
> +		}
> +		vm->vm_memory_slot[i] = uao;
> +	}
> +
> +
>  	if (vm_impl_init(vm, p)) {
>  		printf("failed to init arch-specific features for vm %p\n", vm);
>  		vm_teardown(&vm);
> @@ -487,13 +527,13 @@ vm_create_check_mem_ranges(struct vm_create_params *vc
>  		 * Calling uvm_share() when creating the VM will take care of
>  		 * further checks.
>  		 */
> -		if (vmr->vmr_va < VM_MIN_ADDRESS ||
> +/*		if (vmr->vmr_va < VM_MIN_ADDRESS ||
>  		    vmr->vmr_va >= VM_MAXUSER_ADDRESS ||
>  		    vmr->vmr_size >= VM_MAXUSER_ADDRESS - vmr->vmr_va) {
>  			DPRINTF("guest va not within range or wraps\n");
>  			return (0);
>  		}
> -
> +*/
>  		/*
>  		 * Make sure that guest physical memory ranges do not overlap
>  		 * and that they are ascending.
> @@ -529,10 +569,11 @@ vm_create_check_mem_ranges(struct vm_create_params *vc
>  void
>  vm_teardown(struct vm **target)
>  {
> -	size_t nvcpu = 0;
> +	size_t i, nvcpu = 0;
> +	vaddr_t sva, eva;
>  	struct vcpu *vcpu, *tmp;
>  	struct vm *vm = *target;
> -	struct vmspace *vm_vmspace;
> +	struct uvm_object *uao;
> 
>  	KERNEL_ASSERT_UNLOCKED();
> 
> @@ -545,17 +586,25 @@ vm_teardown(struct vm **target)
>  		nvcpu++;
>  	}
> 
> -	vm_impl_deinit(vm);
> +	/* Remove guest mappings from our nested page tables. */
> +	for (i = 0; i < vm->vm_nmemranges; i++) {
> +		sva = vm->vm_memranges[i].vmr_gpa;
> +		eva = sva + vm->vm_memranges[i].vmr_size - 1;
> +		pmap_remove(vm->vm_pmap, sva, eva);
> +	}
> 
> -	/* teardown guest vmspace */
> -	KERNEL_LOCK();
> -	vm_vmspace = vm->vm_vmspace;
> -	if (vm_vmspace != NULL) {
> -		vm->vm_vmspace = NULL;
> -		uvmspace_free(vm_vmspace);
> +	/* Release UVM anon objects backing our guest memory. */
> +	for (i = 0; i < vm->vm_nmemranges; i++) {
> +		uao = vm->vm_memory_slot[i];
> +		vm->vm_memory_slot[i] = NULL;
> +		if (uao != NULL)
> +			uao_detach(uao);
>  	}
> -	KERNEL_UNLOCK();
> 
> +	/* At this point, no UVM-managed pages should reference our pmap. */
> +	pmap_destroy(vm->vm_pmap);
> +	vm->vm_pmap = NULL;
> +
>  	pool_put(&vm_pool, vm);
>  	*target = NULL;
>  }
> @@ -612,7 +661,7 @@ vm_get_info(struct vm_info_params *vip)
> 
>  		out[i].vir_memory_size = vm->vm_memory_size;
>  		out[i].vir_used_size =
> -		    pmap_resident_count(vm->vm_map->pmap) * PAGE_SIZE;
> +		    pmap_resident_count(vm->vm_pmap) * PAGE_SIZE;
>  		out[i].vir_ncpus = vm->vm_vcpu_ct;
>  		out[i].vir_id = vm->vm_id;
>  		out[i].vir_creator_pid = vm->vm_creator_pid;
> @@ -804,6 +853,8 @@ vm_share_mem(struct vm_sharemem_params *vsp, struct pr
>  	size_t i, n;
>  	struct vm *vm;
>  	struct vm_mem_range *src, *dst;
> +	struct uvm_object *uao;
> +	unsigned int uvmflags;
> 
>  	ret = vm_find(vsp->vsp_vm_id, &vm);
>  	if (ret)
> @@ -840,6 +891,8 @@ vm_share_mem(struct vm_sharemem_params *vsp, struct pr
>  	 * not need PROC_EXEC as the emulated devices do not need to execute
>  	 * instructions from guest memory.
>  	 */
> +	uvmflags = UVM_MAPFLAG(PROT_READ | PROT_WRITE, PROT_READ | PROT_WRITE,
> +	    MAP_INHERIT_SHARE, 0, UVM_FLAG_FIXED);
>  	for (i = 0; i < n; i++) {
>  		src = &vm->vm_memranges[i];
>  		dst = &vsp->vsp_memranges[i];
> @@ -848,14 +901,16 @@ vm_share_mem(struct vm_sharemem_params *vsp, struct pr
>  		if (src->vmr_type == VM_MEM_MMIO)
>  			continue;
> 
> -		DPRINTF("sharing gpa=0x%lx for pid %d @ va=0x%lx\n",
> -		    src->vmr_gpa, p->p_p->ps_pid, dst->vmr_va);
> -		ret = uvm_share(&p->p_vmspace->vm_map, dst->vmr_va,
> -		    PROT_READ | PROT_WRITE, vm->vm_map, src->vmr_gpa,
> -		    src->vmr_size);
> +		uao = vm->vm_memory_slot[i];
> +		KASSERT(uao != NULL);
> +
> +		uao_reference(uao);
> +		ret = uvm_map(&p->p_p->ps_vmspace->vm_map, &dst->vmr_va,
> +		    src->vmr_size, uao, 0, 0, uvmflags);
>  		if (ret) {
> -			printf("%s: uvm_share failed (%d)\n", __func__, ret);
> -			break;
> +			printf("%s: uvm_map failed: %d\n", __func__, ret);
> +			uao_detach(uao);
> +			goto out;
>  		}
>  	}
>  	ret = 0;
> blob - ca5a152f550c3795714128f1a0d764dd2e593e59
> blob + 9017362553a0b9dad5ecb34c8ca9da69c66180b9
> --- sys/dev/vmm/vmm.h
> +++ sys/dev/vmm/vmm.h
> @@ -168,14 +168,17 @@ enum {
>   *	V	vmm_softc's vm_lock
>   */
>  struct vm {
> -	struct vmspace		 *vm_vmspace;		/* [K] */
> -	vm_map_t		 vm_map;		/* [K] */
> +	pmap_t			 vm_pmap;		/* [r] */
> +
>  	uint32_t		 vm_id;			/* [I] */
>  	pid_t			 vm_creator_pid;	/* [I] */
> +
>  	size_t			 vm_nmemranges;		/* [I] */
>  	size_t			 vm_memory_size;	/* [I] */
> -	char			 vm_name[VMM_MAX_NAME_LEN];
>  	struct vm_mem_range	 vm_memranges[VMM_MAX_MEM_RANGES];
> +	struct uvm_object	*vm_memory_slot[VMM_MAX_MEM_RANGES]; /* [I] */
> +
> +	char			 vm_name[VMM_MAX_NAME_LEN];
>  	struct refcnt		 vm_refcnt;		/* [a] */
> 
>  	struct vcpu_head	 vm_vcpu_list;		/* [v] */
> blob - e399c0c0439a02117bc53c62cbb59c950b176773
> blob + ed8cfb028d218f73fd9feeeb02af36030bfd4c7d
> --- usr.sbin/vmd/vm.c
> +++ usr.sbin/vmd/vm.c
> @@ -808,6 +808,15 @@ alloc_guest_mem(struct vmd_vm *vm)
>  		vmr->vmr_va = (vaddr_t)p;
>  	}
> 
> +	/*
> +	 * XXX for now, we munmap(2) because vmm(4) will remap uvm aobjs at
> +	 * these addresses. Updating VMM_IOC_CREATE and VMM_IOC_SHAREMEM t0
> +	 * do this for us will let us remove the mmap(2)/munmap(2) dance.
> +	 */
> +	for (i = 0; i < vcp->vcp_nmemranges; i++) {
> +		vmr = &vcp->vcp_memranges[i];
> +		munmap((void *)vmr->vmr_va, vmr->vmr_size);
> +	}
>  	return (ret);
>  }
> 
> @@ -1448,6 +1457,9 @@ remap_guest_mem(struct vmd_vm *vm, int vmm_fd)
>  	/*
>  	 * munmap(2) now that we have va's and ranges that don't overlap. vmm
>  	 * will use the va's and sizes to recreate the mappings for us.
> +	 *
> +	 * XXX can remove this once VMM_IOC_SHAREMEM handles mapping aobj's
> +	 * for us.
>  	 */
>  	for (i = 0; i < vsp.vsp_nmemranges; i++) {
>  		vmr = &vsp.vsp_memranges[i];
> 

-- 
wbr, Kirill