From: Mike Larkin <mlarkin@nested.page>
Subject: Re: Improving time keeping in x86 VMs
To: Stefan Fritsch <sf@openbsd.org>
Cc: Dave Voutila <dv@sisu.io>, tech@openbsd.org
Date: Fri, 11 Apr 2025 00:12:03 -0700

On Wed, Mar 12, 2025 at 09:27:50PM +0100, Stefan Fritsch wrote:
> On Tue, 11 Mar 2025, Dave Voutila wrote:
>
> > Stefan Fritsch <sf@openbsd.org> writes:
> >
> > > Hi,
> > >
> > > after talking to bluhm@ about time keeping issues of linux guests on
> > > vmm, I have looked into making linux attach to vmm's pvclock. In doing
> > > so, I also found some issues in the pvclock(4) guest driver. So, here
> > > comes a larger diff that hopefully improves time keeping in VMs, both
> > > with openbsd as guest and as hypervisor.
> >
> > My kingdom for something to kill off my hacky Linux pvclock module.
> >
> > Some questions / comments below.
> >
> > >
> > > The diff changes in pvclock(4):
> > >
> > > * Fix integer overflows during multiplication by taking an assembler
> > >   implementation from FreeBSD that uses 128 bit intermediate results.
> > >
> > > * Increase accuracy by disabling interrupts while reading the clock
> > >
> > > * Fix the not-TSC_STABLE handling which was busted because it wrongly
> > >   compared 32 and 64 bit values. (This requires some atomic hackery on
> > >   i386).
> > >
> > > * Add a timedelta sensor using the KVM WALL_CLOCK MSR, similar to the
> > >   sensor in vmmci(4)
> >
> > Lovely...
>
> Today, I found a diff by Scott Cheloha from 2022 that was never
> committed and that I have missed before. It has some overlap with my
> pvclock diff. I will look at it some more and produce a combined diff.
>
> >
> > >
> > > In vmm(4):
> > >
> > > * Expose pvclock to linux guests by adding KVM compatible cpuid pages.
> > >   While there, also expose the NOP_IO_DELAY feature bit which may make
> > >   linux skip some ioport 80 reads. This should be ok at least until
> > >   vmm(4) implements device pass through.
> > >
> >
> > ...however, this makes me nervous.
> >
> > We've purposely not advertised ourselves as KVM in the past to avoid
> > possibly triggering any KVM-based code paths in Linux that assume
> > blindly that we are KVM.
> >
> > I'd like to see testing on a variety of Linux kernel versions to see if
> > there's fallout. Any modules that trust kvm_para_available() to detect
> > if the host is actually KVM may make assumptions that don't hold true.
>
> I have only done some testing with a linux 6.6 guest. It seems that all
> kvm features have feature bits. But of course there may be bugs in the
> feature detection or missing feature bit checks. I found one somewhat
> related bug fix in 5.16 (760849b14). It may make sense to add a knob to
> vmd so that the user can switch this on and off. That would be still a
> lot easier than installing a custom kernel module in the guest.
>

I don't know what ever came of this thread but please no "knobs added to
vmd" for this.

-ml

> >
> >
> > > * As there is only a single feature bit for both KVM_MSR_SYSTEM_TIME
> > >   and KVM_MSR_WALL_CLOCK, implement the latter in vmm.
> > >
> > > If the integer overflows in pvclock(4) cause problems or not depends
> > > on the multiplicator and shift values chosen by the hypervisor, which
> > > in turn depend on the CPU's TSC frequency. So there may be systems
> > > where this is not a problem at all, while on other systems it may
> > > cause great havoc.
> > >
> > > The vmm and pvclock parts of the diff work independently from each other.
> > > Testers and comments are welcome.
> > >
> > > Cheers,
> > > Stefan
> > >
> > >
> > > diff --git a/sys/arch/amd64/amd64/vmm_machdep.c b/sys/arch/amd64/amd64/vmm_machdep.c
> > > index e3205f48eed..f2ecb0668b0 100644
> > > --- a/sys/arch/amd64/amd64/vmm_machdep.c
> > > +++ b/sys/arch/amd64/amd64/vmm_machdep.c
> > > @@ -142,6 +142,7 @@ void svm_set_dirty(struct vcpu *, uint32_t);
> > >  int vmm_gpa_is_valid(struct vcpu *vcpu, paddr_t gpa, size_t obj_size);
> > >  void vmm_init_pvclock(struct vcpu *, paddr_t);
> > >  int vmm_update_pvclock(struct vcpu *);
> > > +void vmm_pv_wall_clock(struct vcpu *, paddr_t);
> > >  int vmm_pat_is_valid(uint64_t);
> > >
> > >  #ifdef MULTIPROCESSOR
> > > @@ -181,6 +182,7 @@ extern uint64_t tsc_frequency;
> > >  extern int tsc_is_invariant;
> > >
> > >  const char *vmm_hv_signature = VMM_HV_SIGNATURE;
> > > +const char *kvm_hv_signature = "KVMKVMKVM\0\0\0";
> > >
> > >  const struct kmem_pa_mode vmm_kp_contig = {
> > >  	.kp_constraint = &no_constraint,
> > > @@ -5555,6 +5557,10 @@ vmx_handle_wrmsr(struct vcpu *vcpu)
> > >  		vmm_init_pvclock(vcpu,
> > >  		    (*rax & 0xFFFFFFFFULL) | (*rdx  << 32));
> > >  		break;
> > > +	case KVM_MSR_WALL_CLOCK:
> > > +		vmm_pv_wall_clock(vcpu,
> > > +		    (*rax & 0xFFFFFFFFULL) | (*rdx  << 32));
> > > +		break;
> > >  #ifdef VMM_DEBUG
> > >  	default:
> > >  		/*
> > > @@ -5616,6 +5622,10 @@ svm_handle_msr(struct vcpu *vcpu)
> > >  			vmm_init_pvclock(vcpu,
> > >  			    (*rax & 0xFFFFFFFFULL) | (*rdx  << 32));
> > >  			break;
> > > +		case KVM_MSR_WALL_CLOCK:
> > > +			vmm_pv_wall_clock(vcpu,
> > > +			    (*rax & 0xFFFFFFFFULL) | (*rdx  << 32));
> > > +			break;
> > >  		default:
> > >  			/* Log the access, to be able to identify unknown MSRs */
> > >  			DPRINTF("%s: wrmsr exit, msr=0x%llx, discarding data "
> > > @@ -5971,6 +5981,20 @@ vmm_handle_cpuid(struct vcpu *vcpu)
> > >  		*rcx = 0;
> > >  		*rdx = 0;
> > >  		break;
> > > +	case 0x40000100:	/* Hypervisor information KVM */
> >
> > Where did you get these CPUIDs from? I thought 0x40000000 was what Linux
> > used for KVM detection per what we already do in VMM and:
> >
> >  https://www.kernel.org/doc/html/latest/virt/kvm/x86/cpuid.html
> >
> > I see there's something in linux/arch/x86/kvm/cpuid.c about masking off
> > 0x100 and using those values...but still not sure how this use of
> > 0x40000100 would somehow supersede what vmm(4) is already doing. Is this
> > some trick to not convince OpenBSD guests that we're KVM?
>
> The behavior introduced by qemu/kvm/linux and copied by many other OSes is
> to look for hypervisor signatures in steps of 0x100. OpenBSD pvbus(4) does
> this, too. This way, a hypervisor can offer paravirtualized interfaces
> compatible with several different hypervisors and the guest can pick the
> interface it supports. For example, if qemu/kvm is configured to offer
> HyperV interfaces, it will put the HyperV cpuid pages at 0x40000000 and
> the KVM cpuid pages at 0x40000100.
>
> >
> > > +		*rax = 0x40000101;
> > > +		*rbx = *((uint32_t *)&kvm_hv_signature[0]);
> > > +		*rcx = *((uint32_t *)&kvm_hv_signature[4]);
> > > +		*rdx = *((uint32_t *)&kvm_hv_signature[8]);
> > > +		break;
> > > +	case 0x40000101:	/* KVM hypervisor features */
> > > +		*rax = (1 << KVM_FEATURE_CLOCKSOURCE2) |
> > > +		    (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
> > > +		    (1 << KVM_FEATURE_NOP_IO_DELAY);
> > > +		*rbx = 0;
> > > +		*rcx = 0;
> > > +		*rdx = 0;
> > > +		break;
> > >  	case 0x80000000:	/* Extended function level */
> > >  		/* We don't emulate past 0x8000001f currently. */
> > >  		*rax = min(curcpu()->ci_pnfeatset, 0x8000001f);
> > > @@ -6468,7 +6492,7 @@ vmm_update_pvclock(struct vcpu *vcpu)
> > >  		    (++vcpu->vc_pvclock_version << 1) | 0x1;
> > >
> > >  		pvclock_ti->ti_tsc_timestamp = rdtsc();
> > > -		nanotime(&tv);
> > > +		nanouptime(&tv);
> > >  		pvclock_ti->ti_system_time =
> > >  		    tv.tv_sec * 1000000000L + tv.tv_nsec;
> > >  		pvclock_ti->ti_tsc_shift = 12;
> > > @@ -6482,6 +6506,36 @@ vmm_update_pvclock(struct vcpu *vcpu)
> > >  	return (0);
> > >  }
> > >
> > > +void
> > > +vmm_pv_wall_clock(struct vcpu *vcpu, paddr_t gpa)
> > > +{
> > > +	struct pvclock_wall_clock *pvclock_wc;
> > > +	struct timespec tv;
> >
> >    ^^^^^ spaces vs. tabs
> >
> > > +        struct vm *vm = vcpu->vc_parent;
> > > +        paddr_t hpa;
> > > +
> > > +	if (!vmm_gpa_is_valid(vcpu, gpa, sizeof(struct pvclock_wall_clock)))
> > > +		goto err;
> > > +
> > > +	/* XXX: handle case when this struct goes over page boundaries */
> > > +	if ((gpa & PAGE_MASK) + sizeof(struct pvclock_wall_clock) > PAGE_SIZE)
> > > +		goto err;
> > > +
> > > +	if (!pmap_extract(vm->vm_map->pmap, gpa, &hpa))
> > > +		goto err;
> > > +	pvclock_wc = (void*) PMAP_DIRECT_MAP(hpa);
> > > +	pvclock_wc->wc_version |= 0x1;
> > > +	nanoboottime(&tv);
> > > +	pvclock_wc->wc_sec = tv.tv_sec;
> > > +	pvclock_wc->wc_nsec = tv.tv_nsec;
> > > +	pvclock_wc->wc_version += 1;
> > > +
> > > +	return;
> > > +err:
> > > +	vmm_inject_gp(vcpu);
> > > +}
> > > +
> > > +
> > >  int
> > >  vmm_pat_is_valid(uint64_t pat)
> > >  {
> > > diff --git a/sys/arch/i386/include/atomic.h b/sys/arch/i386/include/atomic.h
> > > index 45aedb819c8..e779b51869f 100644
> > > --- a/sys/arch/i386/include/atomic.h
> > > +++ b/sys/arch/i386/include/atomic.h
> > > @@ -274,7 +274,18 @@ static __inline u_int64_t
> > >  i386_atomic_testset_uq(volatile u_int64_t *ptr, u_int64_t val)
> > >  {
> > >  	__asm__ volatile ("\n1:\t" _LOCK " cmpxchg8b (%1); jnz 1b" : "+A" (val) :
> > > -	    "r" (ptr), "b" ((u_int32_t)val), "c" ((u_int32_t)(val >> 32)));
> > > +	    "r" (ptr), "b" ((u_int32_t)val), "c" ((u_int32_t)(val >> 32)) :
> > > +	    "cc");
> > > +	return val;
> > > +}
> > > +
> > > +static __inline u_int64_t
> > > +i386_atomic_load_uq(volatile u_int64_t *ptr)
> > > +{
> > > +	uint64_t val;
> > > +	__asm__ volatile ("movl %%ebx,%%eax; movl %%ecx, %%edx; "
> > > +	    _LOCK " cmpxchg8b (%1)" : "=&A" (val) : "r" (ptr) :
> > > +	    "cc");
> > >  	return val;
> > >  }
> > >
> > > diff --git a/sys/dev/pv/pvclock.c b/sys/dev/pv/pvclock.c
> > > index 994fc4a337c..c3f4c0049e0 100644
> > > --- a/sys/dev/pv/pvclock.c
> > > +++ b/sys/dev/pv/pvclock.c
> > > @@ -22,6 +22,7 @@
> > >
> > >  #include <sys/param.h>
> > >  #include <sys/systm.h>
> > > +#include <sys/timeout.h>
> > >  #include <sys/timetc.h>
> > >
> > >  #include <machine/cpu.h>
> > > @@ -31,13 +32,21 @@
> > >  #include <dev/pv/pvvar.h>
> > >  #include <dev/pv/pvreg.h>
> > >
> > > -uint pvclock_lastcount;
> > > +uint64_t pvclock_lastcount;
> > > +
> > > +struct pvpage {
> > > +	struct pvclock_time_info	ti;
> > > +	struct pvclock_wall_clock	wc;
> > > +};
> > >
> > >  struct pvclock_softc {
> > >  	struct device		 sc_dev;
> > > -	void			*sc_time;
> > > +	struct pvpage		*sc_page;
> > >  	paddr_t			 sc_paddr;
> > >  	struct timecounter	*sc_tc;
> > > +	struct ksensordev	 sc_sensordev;
> > > +	struct ksensor		 sc_sensor;
> > > +	struct timeout		 sc_tick;
> > >  };
> > >
> > >  #define DEVNAME(_s)			((_s)->sc_dev.dv_xname)
> > > @@ -46,12 +55,16 @@ int	 pvclock_match(struct device *, void *, void *);
> > >  void	 pvclock_attach(struct device *, struct device *, void *);
> > >  int	 pvclock_activate(struct device *, int);
> > >
> > > +uint64_t pvclock_get(struct timecounter *);
> > >  uint	 pvclock_get_timecount(struct timecounter *);
> > > +void	 pvclock_tick_hook(struct device *);
> > >
> > >  static inline uint32_t
> > >  	 pvclock_read_begin(const struct pvclock_time_info *);
> > >  static inline int
> > >  	 pvclock_read_done(const struct pvclock_time_info *, uint32_t);
> > > +static inline uint64_t
> > > +	 pvclock_scale_delta(uint64_t, uint32_t, int);
> > >
> > >  const struct cfattach pvclock_ca = {
> > >  	sizeof(struct pvclock_softc),
> > > @@ -119,26 +132,28 @@ void
> > >  pvclock_attach(struct device *parent, struct device *self, void *aux)
> > >  {
> > >  	struct pvclock_softc		*sc = (struct pvclock_softc *)self;
> > > +	struct pv_attach_args		*pva = aux;
> > >  	struct pvclock_time_info	*ti;
> > >  	paddr_t			 	 pa;
> > >  	uint32_t			 version;
> > >  	uint8_t				 flags;
> > > +	struct pvbus_hv			*kvm;
> > >
> > > -	if ((sc->sc_time = km_alloc(PAGE_SIZE,
> > > +	if ((sc->sc_page = km_alloc(PAGE_SIZE,
> > >  	    &kv_any, &kp_zero, &kd_nowait)) == NULL) {
> > >  		printf(": time page allocation failed\n");
> > >  		return;
> > >  	}
> > > -	if (!pmap_extract(pmap_kernel(), (vaddr_t)sc->sc_time, &pa)) {
> > > +	if (!pmap_extract(pmap_kernel(), (vaddr_t)sc->sc_page, &pa)) {
> > >  		printf(": time page PA extraction failed\n");
> > > -		km_free(sc->sc_time, PAGE_SIZE, &kv_any, &kp_zero);
> > > +		km_free(sc->sc_page, PAGE_SIZE, &kv_any, &kp_zero);
> > >  		return;
> > >  	}
> > >
> > >  	wrmsr(KVM_MSR_SYSTEM_TIME, pa | PVCLOCK_SYSTEM_TIME_ENABLE);
> > >  	sc->sc_paddr = pa;
> > >
> > > -	ti = sc->sc_time;
> > > +	ti = &sc->sc_page->ti;
> > >  	do {
> > >  		version = pvclock_read_begin(ti);
> > >  		flags = ti->ti_flags;
> > > @@ -162,6 +177,22 @@ pvclock_attach(struct device *parent, struct device *self, void *aux)
> > >
> > >  	tc_init(sc->sc_tc);
> > >
> > > +	/*
> > > +	 * The openbsd vmm pvclock does not support the WALL_CLOCK msr,
> > > +	 * therefore we look only for kvm.
> > > +	 */
> > > +	kvm = &pva->pva_hv[PVBUS_KVM];
> > > +	if (kvm->hv_features & (1 << KVM_FEATURE_CLOCKSOURCE2)) {
> > > +		strlcpy(sc->sc_sensordev.xname, sc->sc_dev.dv_xname,
> > > +		    sizeof(sc->sc_sensordev.xname));
> > > +		sc->sc_sensor.type = SENSOR_TIMEDELTA;
> > > +		sc->sc_sensor.status = SENSOR_S_UNKNOWN;
> > > +		sensor_attach(&sc->sc_sensordev, &sc->sc_sensor);
> > > +		sensordev_install(&sc->sc_sensordev);
> > > +
> > > +		config_mountroot(self, pvclock_tick_hook);
> > > +	}
> > > +
> > >  	printf("\n");
> > >  }
> > >
> > > @@ -200,17 +231,67 @@ pvclock_read_done(const struct pvclock_time_info *ti,
> > >  	return (ti->ti_version == version);
> > >  }
> > >
> > > -uint
> > > -pvclock_get_timecount(struct timecounter *tc)
> > > +static inline uint64_t
> > > +pvclock_scale_delta(uint64_t delta, uint32_t mul_frac, int shift)
> > > +{
> > > +	uint64_t product;
> > > +
> > > +	if (shift < 0)
> > > +		delta >>= -shift;
> > > +	else
> > > +		delta <<= shift;
> > > +#if defined(__i386__)
> > > +	{
> > > +		uint32_t tmp1, tmp2;
> > > +
> > > +		/**
> > > +		 * For i386, the formula looks like:
> > > +		 *
> > > +		 *   lower = (mul_frac * (delta & UINT_MAX)) >> 32
> > > +		 *   upper = mul_frac * (delta >> 32)
> > > +		 *   product = lower + upper
> > > +		 */
> > > +		__asm__ (
> > > +			"mul  %5       ; "
> > > +			"mov  %4,%%eax ; "
> > > +			"mov  %%edx,%4 ; "
> > > +			"mul  %5       ; "
> > > +			"xor  %5,%5    ; "
> > > +			"add  %4,%%eax ; "
> > > +			"adc  %5,%%edx ; "
> > > +			: "=A" (product), "=r" (tmp1), "=r" (tmp2)
> > > +			: "a" ((uint32_t)delta), "1" ((uint32_t)(delta >> 32)),
> > > +			  "2" (mul_frac) );
> > > +	}
> > > +#elif defined(__amd64__)
> > > +	{
> > > +		unsigned long tmp;
> > > +
> > > +		__asm__ (
> > > +			"mulq %[mul_frac] ; shrd $32, %[hi], %[lo]"
> > > +			: [lo]"=a" (product), [hi]"=d" (tmp)
> > > +			: "0" (delta), [mul_frac]"rm"((uint64_t)mul_frac));
> > > +	}
> > > +#else
> > > +#error "pvclock: unsupported x86 architecture?"
> > > +#endif
> > > +	return (product);
> > > +}
> > > +
> > > +uint64_t
> > > +pvclock_get(struct timecounter *tc)
> > >  {
> > >  	struct pvclock_softc		*sc = tc->tc_priv;
> > >  	struct pvclock_time_info	*ti;
> > >  	uint64_t			 tsc_timestamp, system_time, delta, ctr;
> > > +	uint64_t			 lastcount;
> > >  	uint32_t			 version, mul_frac;
> > >  	int8_t				 shift;
> > >  	uint8_t				 flags;
> > > +	int				 s;
> > >
> > > -	ti = sc->sc_time;
> > > +	ti = &sc->sc_page->ti;
> > > +	s = splhigh();
> > >  	do {
> > >  		version = pvclock_read_begin(ti);
> > >  		system_time = ti->ti_system_time;
> > > @@ -218,26 +299,82 @@ pvclock_get_timecount(struct timecounter *tc)
> > >  		mul_frac = ti->ti_tsc_to_system_mul;
> > >  		shift = ti->ti_tsc_shift;
> > >  		flags = ti->ti_flags;
> > > +		delta = rdtsc();
> > >  	} while (!pvclock_read_done(ti, version));
> > > +	splx(s);
> > >
> > >  	/*
> > >  	 * The algorithm is described in
> > > -	 * linux/Documentation/virtual/kvm/msr.txt
> > > +	 * linux/Documentation/virt/kvm/x86/msr.rst
> > >  	 */
> > > -	delta = rdtsc() - tsc_timestamp;
> > > -	if (shift < 0)
> > > -		delta >>= -shift;
> > > -	else
> > > -		delta <<= shift;
> > > -	ctr = ((delta * mul_frac) >> 32) + system_time;
> > > +	delta -= tsc_timestamp;
> > > +	ctr = pvclock_scale_delta(delta, mul_frac, shift) + system_time;
> > >
> > >  	if ((flags & PVCLOCK_FLAG_TSC_STABLE) != 0)
> > >  		return (ctr);
> > >
> > > -	if (ctr < pvclock_lastcount)
> > > -		return (pvclock_lastcount);
> > > -
> > > -	atomic_swap_uint(&pvclock_lastcount, ctr);
> > > +#if defined(__i386__)
> > > +	/*
> > > +	 * We are running on virtualization. Therefore we can assume that we
> > > +	 * have cmpxchg8b, available on pentium and newer.
> > > +	 */
> > > +	lastcount = i386_atomic_load_uq(&pvclock_lastcount);
> > > +#elif defined(__amd64__)
> > > +	lastcount = pvclock_lastcount;
> > > +#else
> > > +#error "pvclock: unsupported x86 architecture?"
> > > +#endif
> > > +	if (ctr < lastcount)
> > > +		return (lastcount);
> > > +
> > > +#if defined(__i386__)
> > > +	ctr = i386_atomic_testset_uq(&pvclock_lastcount, ctr);
> > > +#elif defined(__amd64__)
> > > +	atomic_swap_64(&pvclock_lastcount, ctr);
> > > +#else
> > > +#error "pvclock: unsupported x86 architecture?"
> > > +#endif
> > >
> > >  	return (ctr);
> > >  }
> > > +
> > > +uint
> > > +pvclock_get_timecount(struct timecounter *tc)
> > > +{
> > > +	return (pvclock_get(tc));
> > > +}
> > > +
> > > +void
> > > +pvclock_tick(void *arg)
> > > +{
> > > +	struct pvclock_softc		*sc = arg;
> > > +	struct timespec			 ts;
> > > +	struct pvclock_wall_clock	*wc = &sc->sc_page->wc;
> > > +	int64_t			 	 value;
> > > +
> > > +	wrmsr(KVM_MSR_WALL_CLOCK, sc->sc_paddr + offsetof(struct pvpage, wc));
> > > +	while (wc->wc_version & 0x1)
> > > +		virtio_membar_sync();
> > > +	if (wc->wc_sec) {
> > > +		nanotime(&ts);
> > > +		value = TIMESPEC_TO_NSEC(&ts) -
> > > +		    SEC_TO_NSEC(wc->wc_sec) - wc->wc_nsec -
> > > +		    pvclock_get(&pvclock_timecounter);
> > > +
> > > +		TIMESPEC_TO_TIMEVAL(&sc->sc_sensor.tv, &ts);
> > > +		sc->sc_sensor.value = value;
> > > +		sc->sc_sensor.status = SENSOR_S_OK;
> > > +	} else
> > > +		sc->sc_sensor.status = SENSOR_S_UNKNOWN;
> > > +
> > > +	timeout_add_sec(&sc->sc_tick, 15);
> > > +}
> > > +
> > > +void
> > > +pvclock_tick_hook(struct device *self)
> > > +{
> > > +	struct pvclock_softc	*sc = (struct pvclock_softc *)self;
> > > +
> > > +	timeout_set(&sc->sc_tick, pvclock_tick, sc);
> > > +	pvclock_tick(sc);
> > > +}
> >
>