Index | Thread | Search

From:
Dave Voutila <dv@sisu.io>
Subject:
Re: [EXT] Re: Allow vmm guests to save/restore pkru via xsave region
To:
tech@openbsd.org
Date:
Tue, 02 Sep 2025 18:00:39 -0400

Download raw body.

Thread
Hans-Jörg Höxer <hshoexer@genua.de> writes:

> Hi,
>
> On Wed, Aug 20, 2025 at 04:12:07PM -0400, Dave Voutila wrote:
>> Dave Voutila <dv@sisu.io> writes:
>>
>> > Dave Voutila <dv@sisu.io> writes:
>> >
>> >> We're currently letting a guest see the PKU bit via cpuid(0x7, 0) but
>> >> then not exposing the XSAVE area bit for PKRU via cpuid(0xd, 0).
>> >>
>> >> This is fine for OpenBSD guests as we don't use XSAVE/XRSTOR to
>> >> save/load PKRU state on context switches given how we use PKU today for
>> >> x-only on Intel. (We explicitly use RDPKRU/WRPKRU.)
>> >>
>> >> Newer Linux guests see PKU support and then freak out when the support
>> >> for XSAVE doesn't exist. This puts them into a very "fun" state where
>> >> they don't panic but instead thrash about like a lunatic. It's sad,
>> >> really. See [1] for details.
>> >>
>> >> This diff exposes the cpuid bit for the PKRU xsave area if the host has
>> >> enabled PKU mode. It also adds handling to let the guest set the xcr0
>> >> bit to let XSAVE/XRSTOR work for the PKRU state.
>> >>
>> >> It does not change host PKU or XSAVE behavior.
>> >>
>> >> The cpuid emulation change is a bit hacky in reporting xsave area max
>> >> size and current size, but since the PKRU state comes after all the
>> >> AVX512 stuff it's basically just needing to use the size of `struct
>> >> savefpu` plus 64 bits (32 for PKRU, 32 for padding/alignment).
>> >
>> > Had a slight mistake in previous diff and was adding 64 bytes to the
>> > end, not 64 bits. Doh. Now this looks less like it's using a magic
>> > number of 64 and instead uses a uint64_t.
>> >
>>
>> This version makes a comment more intelligible and fixes what looks like
>> a related issue in the xsetbv handler (not masking to just the bits in
>> the xsave_mask that are XCR0 related).
>>
>> >>
>> >> ok? feedback?
>
> I can reproduce the issue with the current debian netinst; I'm seeing
> the hang from [1].  With your diff, kernel boots and netinst installation
> works.  And I see no regression with my OpenBSD guests.
>
> Diff looks good.
>

So before committing I did a bit more testing and found while this
allowed linux guests to boot, they were complaining about not being able
to compute the xsave area needed for pkru and disabling it.

It turns out we need to also emulate another subleaf (9) of cpuid 0xd to
disclose the size and offset of the PKRU area. The below diff fixes that
and the sample code [1] from the Linux man page works.

https://man7.org/linux/man-pages/man7/pkeys.7.html

> ok hshoexer
>
>> >>
>> >> [1] https://marc.info/?l=openbsd-misc&m=175505530816494&w=2
>> >>
>> >>
>>
>> diff refs/heads/master refs/heads/vmm-xsave-pku
>> commit - b981b677995ab0627cfaec84db748a34dc362f99
>> commit + 8808090b6a2e33830bfce786baad06590e4edbd1
>> blob - bd1909f33280ca49fa1c546c313f54ac3a52d9b1
>> blob + 471b12b223a8b12fdca49d7dd21df999516b6e23
>> --- sys/arch/amd64/amd64/vmm_machdep.c
>> +++ sys/arch/amd64/amd64/vmm_machdep.c
>> @@ -3468,8 +3468,8 @@ vmm_fpusave(struct vcpu *vcpu)
>>  	}
>>
>>  	/*
>> -	 * Save full copy of FPU state - guest content is always
>> -	 * a subset of host's save area (see xsetbv exit handler)
>> +	 * Save a copy of FPU state - guest content is always
>> +	 * a subset of host's save area. PKRU is saved separately.
>>  	 */
>>  	fpusavereset(&vcpu->vc_g_fpu);
>>  	vcpu->vc_fpuinited = 1;
>> @@ -5927,7 +5927,7 @@ svm_handle_xsetbv(struct vcpu *vcpu)
>>  int
>>  vmm_handle_xsetbv(struct vcpu *vcpu, uint64_t *rax)
>>  {
>> -	uint64_t *rdx, *rcx, val;
>> +	uint64_t *rdx, *rcx, val, mask = xsave_mask & XFEATURE_XCR0_MASK;
>>
>>  	rcx = &vcpu->vc_gueststate.vg_rcx;
>>  	rdx = &vcpu->vc_gueststate.vg_rdx;
>> @@ -5943,8 +5943,12 @@ vmm_handle_xsetbv(struct vcpu *vcpu, uint64_t *rax)
>>  		return (vmm_inject_gp(vcpu));
>>  	}
>>
>> +	/* If we're exposing PKRU features, allow guests to set PKRU in xcr0. */
>> +	if (vmm_softc->sc_md.pkru_enabled)
>> +		mask |= XFEATURE_PKRU;
>> +
>>  	val = *rax + (*rdx << 32);
>> -	if (val & ~xsave_mask) {
>> +	if (val & ~mask) {
>>  		DPRINTF("%s: guest specified xcr0 outside xsave_mask %lld\n",
>>  		    __func__, val);
>>  		return (vmm_inject_gp(vcpu));
>> @@ -6187,6 +6191,19 @@ vmm_handle_cpuid_0xd(struct vcpu *vcpu, uint32_t suble
>>  		}
>>  		eax = xsave_mask & XFEATURE_XCR0_MASK;
>>  		edx = (xsave_mask & XFEATURE_XCR0_MASK) >> 32;
>> +
>> +		/*
>> +		 * Emulate support for the pkru xsave region if the
>> +		 * host has pku enabled. This allow guests to enable
>> +		 * it in xcr0 and use xsave/xrstor on context switches
>> +		 * to save or restore pkru.
>> +		 */
>> +		if (vmm_softc->sc_md.pkru_enabled) {
>> +			eax |= XFEATURE_PKRU;
>> +			ecx = sizeof(struct savefpu) + sizeof(uint64_t);
>> +			if (xcr0 & XFEATURE_PKRU)
>> +				ebx = ecx;
>> +		}
>>  	} else if (subleaf == 1) {
>>  		/* mask out XSAVEC, XSAVES, and XFD support */
>>  		eax &= XSAVE_XSAVEOPT | XSAVE_XGETBV1;
>>