Download raw body.
Improving time keeping in x86 VMs
On Sun, 18 May 2025, Mike Larkin wrote:
> On Sat, May 17, 2025 at 05:39:35PM +0200, Stefan Fritsch wrote:
> > On Sat, 12 Apr 2025, Stefan Fritsch wrote:
> >
> > > On Fri, 11 Apr 2025, Mike Larkin wrote:
> > >
> > > > On Wed, Mar 12, 2025 at 09:27:50PM +0100, Stefan Fritsch wrote:
> > > > > On Tue, 11 Mar 2025, Dave Voutila wrote:
> > > > >
> > > > > > Stefan Fritsch <sf@openbsd.org> writes:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > after talking to bluhm@ about time keeping issues of linux guests on
> > > > > > > vmm, I have looked into making linux attach to vmm's pvclock. In doing
> > > > > > > so, I also found some issues in the pvclock(4) guest driver. So, here
> > > > > > > comes a larger diff that hopefully improves time keeping in VMs, both
> > > > > > > with openbsd as guest and as hypervisor.
> > > > > >
> > > > > > My kingdom for something to kill off my hacky Linux pvclock module.
> > > > > >
> > > > > > Some questions / comments below.
> > > > > >
> > > > > > >
> > > > > > > The diff changes in pvclock(4):
> > > > > > >
> > > > > > > * Fix integer overflows during multiplication by taking an assembler
> > > > > > > implementation from FreeBSD that uses 128 bit intermediate results.
> > > > > > >
> > > > > > > * Increase accuracy by disabling interrupts while reading the clock
> > > > > > >
> > > > > > > * Fix the not-TSC_STABLE handling which was busted because it wrongly
> > > > > > > compared 32 and 64 bit values. (This requires some atomic hackery on
> > > > > > > i386).
> > > > > > >
> > > > > > > * Add a timedelta sensor using the KVM WALL_CLOCK MSR, similar to the
> > > > > > > sensor in vmmci(4)
> > > > > >
> > > > > > Lovely...
> > > > >
> > > > > Today, I found a diff by Scott Cheloha from 2022 that was never
> > > > > committed and that I have missed before. It has some overlap with my
> > > > > pvclock diff. I will look at it some more and produce a combined diff.
> > >
> > > Sorry, I got side-tracked by other issues that were morge urgent. Updated
> > > diff below. The vmm part is unchanged but the pvclock part has several
> > > fixes, some inspired by Scott's earlier diff.
> > >
> > > This would need to be split up into several diffs before committing. I am
> > > not 100% sure that the memory clobbers in i386_atomic_testset_uq and
> > > i386_atomic_cas_uq are neccessary, but I think they are. If the new
> > > i386_atomic_* functions are not wanted in atomic.h, they could of course
> > > be moved into pvclock.c. But i386_atomic_testset_uq() is already in
> > > atomic.h.
> >
> > There was something wrong with the i386 assembler atomics that caused
> > havoc if the TSC stable bit was not set by the hypervisor. Updated diff
> > below.
> >
> > diff --git a/sys/arch/amd64/amd64/vmm_machdep.c b/sys/arch/amd64/amd64/vmm_machdep.c
> > index 0a3be15cd64..5a1570aa8d9 100644
> > --- a/sys/arch/amd64/amd64/vmm_machdep.c
> > +++ b/sys/arch/amd64/amd64/vmm_machdep.c
> > @@ -146,6 +146,7 @@ int svm_get_vmsa_pa(uint32_t, uint32_t, uint64_t *);
> > int vmm_gpa_is_valid(struct vcpu *vcpu, paddr_t gpa, size_t obj_size);
> > void vmm_init_pvclock(struct vcpu *, paddr_t);
> > int vmm_update_pvclock(struct vcpu *);
> > +void vmm_pv_wall_clock(struct vcpu *, paddr_t);
> > int vmm_pat_is_valid(uint64_t);
> >
> > #ifdef MULTIPROCESSOR
> > @@ -185,6 +186,7 @@ extern uint64_t tsc_frequency;
> > extern int tsc_is_invariant;
> >
> > const char *vmm_hv_signature = VMM_HV_SIGNATURE;
> > +const char *kvm_hv_signature = "KVMKVMKVM\0\0\0";
> >
> > const struct kmem_pa_mode vmm_kp_contig = {
> > .kp_constraint = &no_constraint,
> > @@ -5578,6 +5580,10 @@ vmx_handle_wrmsr(struct vcpu *vcpu)
> > vmm_init_pvclock(vcpu,
> > (*rax & 0xFFFFFFFFULL) | (*rdx << 32));
> > break;
> > + case KVM_MSR_WALL_CLOCK:
> > + vmm_pv_wall_clock(vcpu,
> > + (*rax & 0xFFFFFFFFULL) | (*rdx << 32));
> > + break;
> > #ifdef VMM_DEBUG
> > default:
> > /*
> > @@ -5639,6 +5645,10 @@ svm_handle_msr(struct vcpu *vcpu)
> > vmm_init_pvclock(vcpu,
> > (*rax & 0xFFFFFFFFULL) | (*rdx << 32));
> > break;
> > + case KVM_MSR_WALL_CLOCK:
> > + vmm_pv_wall_clock(vcpu,
> > + (*rax & 0xFFFFFFFFULL) | (*rdx << 32));
> > + break;
> > default:
> > /* Log the access, to be able to identify unknown MSRs */
> > DPRINTF("%s: wrmsr exit, msr=0x%llx, discarding data "
> > @@ -5994,6 +6004,20 @@ vmm_handle_cpuid(struct vcpu *vcpu)
> > *rcx = 0;
> > *rdx = 0;
> > break;
> > + case 0x40000100: /* Hypervisor information KVM */
> > + *rax = 0x40000101;
> > + *rbx = *((uint32_t *)&kvm_hv_signature[0]);
> > + *rcx = *((uint32_t *)&kvm_hv_signature[4]);
> > + *rdx = *((uint32_t *)&kvm_hv_signature[8]);
> > + break;
> > + case 0x40000101: /* KVM hypervisor features */
> > + *rax = (1 << KVM_FEATURE_CLOCKSOURCE2) |
> > + (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
> > + (1 << KVM_FEATURE_NOP_IO_DELAY);
> > + *rbx = 0;
> > + *rcx = 0;
> > + *rdx = 0;
> > + break;
>
> This makes us claim we are kvm. Even though you are only advertising support
> for these 3 features, that makes a big assumption that every linux kernel out
> there respects a neutered/weird "KVM" like this.
>
> How many varieties of guest kernels did you test this on? I think we need to
> be very careful if we do this. We are basically claiming to be something
> that doesn't really exist anywhere else.
I have only tested this with relatively new linux kernels. One hypervisor
implementing some pv features of another in order to take advantage of
existing guest pv support is nothing unusual. For example qemu/KVM can
emulate features from Hyper-V and vmware. However, this is configurable.
If you don't want more knobs in vmd(4), I see two ways forward:
One could commit it anyway and see if there are any complaints until the
release. If there are major problems, one could still revert it. If there
are minor problems, maybe one could point users to Linux's nopv command
line option, which exists since 2019 / 5.3.
Or one could redefine the current VMM_HV_SIGNATURE feature bits in a way
that allows one to detect the new KVM_MSR_WALL_CLOCK support and then try
to get a patch into linux to detect VMM_HV_SIGNATURE and those feature
bits.
What do you think?
Cheers,
Stefan
>
> -ml
>
> > case 0x80000000: /* Extended function level */
> > /* We don't emulate past 0x8000001f currently. */
> > *rax = min(curcpu()->ci_pnfeatset, 0x8000001f);
> > @@ -6508,7 +6532,7 @@ vmm_update_pvclock(struct vcpu *vcpu)
> > (++vcpu->vc_pvclock_version << 1) | 0x1;
> >
> > pvclock_ti->ti_tsc_timestamp = rdtsc();
> > - nanotime(&tv);
> > + nanouptime(&tv);
> > pvclock_ti->ti_system_time =
> > tv.tv_sec * 1000000000L + tv.tv_nsec;
> > pvclock_ti->ti_tsc_shift = 12;
> > @@ -6522,6 +6546,36 @@ vmm_update_pvclock(struct vcpu *vcpu)
> > return (0);
> > }
> >
> > +void
> > +vmm_pv_wall_clock(struct vcpu *vcpu, paddr_t gpa)
> > +{
> > + struct pvclock_wall_clock *pvclock_wc;
> > + struct timespec tv;
> > + struct vm *vm = vcpu->vc_parent;
> > + paddr_t hpa;
> > +
> > + if (!vmm_gpa_is_valid(vcpu, gpa, sizeof(struct pvclock_wall_clock)))
> > + goto err;
> > +
> > + /* XXX: handle case when this struct goes over page boundaries */
> > + if ((gpa & PAGE_MASK) + sizeof(struct pvclock_wall_clock) > PAGE_SIZE)
> > + goto err;
> > +
> > + if (!pmap_extract(vm->vm_map->pmap, gpa, &hpa))
> > + goto err;
> > + pvclock_wc = (void*) PMAP_DIRECT_MAP(hpa);
> > + pvclock_wc->wc_version |= 0x1;
> > + nanoboottime(&tv);
> > + pvclock_wc->wc_sec = tv.tv_sec;
> > + pvclock_wc->wc_nsec = tv.tv_nsec;
> > + pvclock_wc->wc_version += 1;
> > +
> > + return;
> > +err:
> > + vmm_inject_gp(vcpu);
> > +}
> > +
> > +
> > int
> > vmm_pat_is_valid(uint64_t pat)
> > {
Improving time keeping in x86 VMs