Index | Thread | Search

From:
Mike Larkin <mlarkin@nested.page>
Subject:
Re: vmd: add checksum offload for guests
To:
Jan Klemkow <j.klemkow@wemelug.de>
Cc:
David Gwynne <david@gwynne.id.au>, Dave Voutila <dv@sisu.io>, Klemens Nanni <kn@openbsd.org>, Alexander Bluhm <bluhm@openbsd.org>, tech@openbsd.org
Date:
Thu, 15 Jan 2026 14:08:43 -0800

Download raw body.

Thread
On Thu, Jan 15, 2026 at 11:04:45PM +0100, Jan Klemkow wrote:
> On Fri, Jan 16, 2026 at 07:35:35AM +1000, David Gwynne wrote:
> > On Thu, Jan 15, 2026 at 12:32:13PM -0800, Mike Larkin wrote:
> > > On Thu, Jan 15, 2026 at 09:00:18PM +0100, Jan Klemkow wrote:
> > > > On Thu, Jan 15, 2026 at 01:46:59PM -0500, Dave Voutila wrote:
> > > > > Jan Klemkow <j.klemkow@wemelug.de> writes:
> > > > >
> > > > > > On Wed, Jan 14, 2026 at 05:27:07PM -0500, Dave Voutila wrote:
> > > > > >> Jan Klemkow <j.klemkow@wemelug.de> writes:
> > > > > >>
> > > > > >> > On Wed, Jan 14, 2026 at 02:21:18PM -0500, Dave Voutila wrote:
> > > > > >> >> Jan Klemkow <jan@openbsd.org> writes:
> > > > > >> >>
> > > > > >> >> > On Sat, May 24, 2025 at 06:14:38AM +0000, Klemens Nanni wrote:
> > > > > >> >> >> 24.05.2025 06:33, Jan Klemkow ??????????:
> > > > > >> >> >> Still breaks:
> > > > > >> >> >>
> > > > > >> >> >> May 24 09:12:25 atar vmd[44493]: vionet_tx: bad source address 22:8d:47:b5:88:f6
> > > > > >> >> >> May 24 09:12:56 atar last message repeated 25 time
> > > > > >> >> >>
> > > > > >> >> >> Linux VM is completely offline.
> > > > > >> >> >
> > > > > >> >> > There was a bug in the csum_start and csum_offset calculation which is fixed in
> > > > > >> >> > the following diff.  I tested it successfully with Debian/Linux and OpenBSD
> > > > > >> >> > guests.
> > > > > >> >> >
> > > > > >> >> > This diff introduces optional checksum offloading for VMM guests.
> > > > > >> >> >
> > > > > >> >> > Tests are welcome.
> > > > > >> >> >
> > > > > >> >> > ok?
> > > > > >> >> >
> > > > > >> >>
> > > > > >> >> Questions in line about pledge changes.
> > > > > >> >>
> > > > > >> >> Other question is broader: why the need for memory copying with this
> > > > > >> >> offload feature?
> > > > > >> >
> > > > > >> >> I don't know how this offload works,
> > > > > >> >
> > > > > >> > When packets are just transmitted between host and guests, no one is
> > > > > >> > calculating the checksum at all.  Thus, we save two checksum calculations per
> > > > > >> > packet.  One on the sender side and another on the receiver side.  Just in case
> > > > > >> > of a transmit out of the machine via physical network, it needs to be
> > > > > >> > calculated.
> > > > > >> >
> > > > >
> > > > > So in reviewing the virtio spec I'm quite confused where checksumming is
> > > > > occuring with this proposed change.
> > > > >
> > > > > From Virtio 1.2, 5.1.6.4.1 Device Requirements: Processing of Incoming
> > > > > Packets:
> > > > >
> > > > >   If the VIRTIO_NET_F_GUEST_CSUM feature has been negotiated, the device
> > > > >   MAY set the VIRTIO_NET_HDR_F_NEEDS_CSUM bit in flags, if so:
> > > > >
> > > > >   1. the device MUST validate the packet checksum at offset csum_offset
> > > > >      from csum_start as well as all preceding offsets;
> > > > >
> > > > >   2. the device MUST set the packet checksum stored in the receive buffer
> > > > >      to the TCP/UDP pseudo header;
> > > > >
> > > > >   3. the device MUST set csum_start and csum_offset such that
> > > > >      calculating a ones' complement checksum from csum_start up until
> > > > >      the end of the packet and storing the result at offset csum_offset
> > > > >      from csum_start will result in a fully checksummed packet;
> > > > >
> > > > > This reads like vionet needs to be computing checksums in some cases
> > > > > when we set VIRTIO_NET_HDR_F_NEEDS_CSUM...which appears to happen in
> > > > > thdr2vhdr() on the rx side.
> > > >
> > > > The other sections make the idea more clear:
> > > >
> > > > >From Virtio 1.2, 5.1.5 Device Initialization:
> > > >
> > > > 	Note: For example, a network packet transported between two guests on
> > > > 	the same system might not need checksumming at all, nor segmentation,
> > > > 	if both guests are amenable.
> > > >
> > > > This means, we just have to keep the information, that the packet has an
> > > > incomplete checksum for the case, the packet leaved our machine.
> > > >
> > > > Any guest, just accepts a packet with VIRTIO_NET_HDR_F_NEEDS_CSUM, without
> > > > computing and verifying the checksum.  Also, our host system accepts this kind
> > > > of packet.  We store this information with the M_TCP_CSUM_OUT flag in our mbuf
> > > > pkthdr.  So, a device like em(4) with checksum offloading computes the checksum
> > > > for us before sending it on the wire.  Or, our stack is computing the checksum,
> > > > if the device doesn't have this feature.
> > > >
> > > > thdr2vhdr() and vhdr2thdr() just translates the feature bits between the tunhdr
> > > > und virtio_net_hdr structs.  Nothing is computed.
> > > >
> > > > > (It also blindly sets that flag and assumes the guest/driver actually
> > > > > negotiated VIRTIO_NET_F_GUEST_CSUM, which is not a safe assumption.)
> > > >
> > > > We just activate the tun(4) offloading feature, if we successfully negotiated
> > > > the VIRTIO_NET_F_GUEST_CSUM feature in vionet_update_offload().
> > > >
> > > > The guest could set the VIRTIO_NET_HDR_F_NEEDS_CSUM flag regardless of a
> > > > negotiated VIRTIO_NET_F_GUEST_CSUM feature.  What should we do in this case?
> > > >
> > > > At the moment, we translate the bits anyways an keep this information.
> > > > We could also just ignore the bit?  Or, we could drop the packet?
> > > >
> > > > Whats your favorite behavior?  I'm fine with everything.
> > > >
> > >
> > > As I read that I nodded that packets between host and guest might not need the
> > > checksumming. And you mentioned that you tested openbsd and linux guests.
> > >
> > > But what about two guests talking to each other via a veb or something like that?
> > >
> > > Who does the checksumming then?
> >
> > no one.
>
> Yes, when both guests support checksum offloading.
>
> In the case that the receiving guest has no checksumming capability,
> same as a physical interface at the veb(4), the veb(4) is computing the
> checksum before delivering the packet to an interface without this
> capability.
>
> veb(4) itself provides checksum offloading and computes the checksum, if
> tap(4) does not provide the IFCAP_CSUM_TCPv4 feature.  And tap(4) just
> provides this feature, if it was activated via the ioctl(2) TUNSCAP.
>
> > there's an unstated assumption that the hardware (eg, cpu and ram) a
> > hypervisor and the guests run on is ok, and the software is correct.
> >
> > if you take a step back, the idea behind the checksums in packets
> > is to help identify if a packet has been damaged as it travels
> > across a medium like a network cable in an ethernet network. the
> > important information is that the packet is ok, not that checksum
> > is actually calculated. it just happens that you use the actual checksum
> > for packets coming off the wire to make that claim about the packet
> > being ok.
> >
> > if the path for a packet is inside a hypervisor between two guests,
> > does it matter if the checksums are actually calculated if the platform
> > is already trusted? we know the packet is good.
>
> Good summery of the reason behind this.
> Thanks David.
>

Does this "just work" no matter what guests I run? That's really all I care
about.

-ml

> > > > > >> >> but adding in more memcpy never seems like an improvement for efficiency.
> > > > > >> >
> > > > > >> > If we would use the virtio_net_hdr header between tun(4) und vmd(8), as Linux
> > > > > >> > and FreeBSD do, we could remove all the memcpy(3)s and packet parsing in this
> > > > > >> > diff.  I would prefer it and also suggestes this in past, but claudio and dlg,
> > > > > >> > want to have an openbsd-own tun_hdr for this [1].
> > > > > >> >
> > > > > >> > Thus, we have to swap tun_hdr for virtio_net_hdr and vice versa while
> > > > > >> > forwarding packets between and tun(4) and guest.  I'm fine with this solution,
> > > > > >> > now.  At least I can say, it doesn't decrease the performance in a significant
> > > > > >> > way.
> > > > > >> >
> > > > > >> > [1]: https://marc.info/?l=openbsd-tech&m=173076189216230
> > > > > >> >
> > > > > >> >> Does the tun header stuff not have some requirement to not be fragmented
> > > > > >> >> across buffers?
> > > > > >> >
> > > > > >> > The tun_hdr itself is not fragmented, but the other packet headers
> > > > > >> > (ether_header, ip and tcphdr/udphdr) might be fragmented.
> > > > > >> >
> > > > > >> >> > Index: sys/kern/kern_pledge.c
> > > > > >> >> > ===================================================================
> > > > > >> >> > RCS file: /cvs/src/sys/kern/kern_pledge.c,v
> > > > > >> >> > diff -u -p -r1.335 kern_pledge.c
> > > > > >> >> > --- sys/kern/kern_pledge.c	13 Nov 2025 20:59:14 -0000	1.335
> > > > > >> >> > +++ sys/kern/kern_pledge.c	14 Jan 2026 17:25:57 -0000
> > > > > >> >> > @@ -46,6 +46,7 @@
> > > > > >> >> >  #include <net/route.h>
> > > > > >> >> >  #include <net/if.h>
> > > > > >> >> >  #include <net/if_var.h>
> > > > > >> >> > +#include <net/if_tun.h>
> > > > > >> >> >  #include <netinet/in.h>
> > > > > >> >> >  #include <netinet6/in6_var.h>
> > > > > >> >> >  #include <netinet6/nd6.h>
> > > > > >> >> > @@ -1337,6 +1338,12 @@ pledge_ioctl(struct proc *p, long com, s
> > > > > >> >> >  		    cdevsw[major(vp->v_rdev)].d_open == vmmopen) {
> > > > > >> >> >  			error = pledge_ioctl_vmm(p, com);
> > > > > >> >> >  			if (error == 0)
> > > > > >> >> > +				return 0;
> > > > > >> >> > +		}
> > > > > >> >> > +		if ((fp->f_type == DTYPE_VNODE) &&
> > > > > >> >> > +		    (vp->v_type == VCHR) &&
> > > > > >> >> > +		    (cdevsw[major(vp->v_rdev)].d_open == tapopen)) {
> > > > > >> >> > +			if (com == TUNSCAP)
> > > > > >> >> >  				return 0;
> > > > > >> >> >  		}
> > > > > >> >>
> > > > > >> >> The diff scope here isn't showing the actual logic change: this adds
> > > > > >> >> capabilities to the "vmm" pledge. Can this be something specific to
> > > > > >> >> tap/tun devices or something related? See below for more commentary.
> > > > > >> >
> > > > > >> > Is the condition "d_open == tapopen" not specific to tap/tun enough?
> > > > > >> >
> > > > > >> > I already discussed this with hshoexer, mlarkin and deraadt[2].  But, wenn you
> > > > > >> > have an idea, how to make this more specific and tighter, I'm open for better
> > > > > >> > solutions.
> > > > > >> >
> > > > > >> > [2]: https://marc.info/?l=openbsd-tech&m=174787581327675&w=2
> > > > > >> >
> > > > > >>
> > > > > >> I think the right way to do this is to have an already privileged
> > > > > >> process make the change to the tap device. Expanding the pivileges of
> > > > > >> vionet to add all vmm ioctl capabilities for a single tun/tap ioctl is
> > > > > >> overkill.
> > > > > >>
> > > > > >> The "vmd" process already opens the tap device on behalf of the vm, so
> > > > > >> it can do the initial TUNSCAP ioctl with all the bits 0 if that's a
> > > > > >> necessity? (I'm not sure why we need to call this with no bits set, but
> > > > > >> if there's no harm even if the guest doesn't enable the offload, let's
> > > > > >> do it at open(2) time.)
> > > > > >>
> > > > > >> I think if we want to add "can do ioctl's on a tap(4)" to the "vmm"
> > > > > >> pledge, it's best to have vionet ask its parent process (which already
> > > > > >> has "vmm") instead of expanding the capabilities of the vionet process.
> > > > > >>
> > > > > >> I have a design in my head for this. If you can give me 2-3 days I can
> > > > > >> put together how the IPC will work.
> > > > > >>
> > > > > >> It will require keeping a copy of the tap descriptor open in the vm
> > > > > >> process, but tying it to that lifetime as well shouldn't be an issue.
> > > > > >
> > > > > > Here is a version of my diff with ioctl(2) for TUNSCAP in virtio.c.  I use a
> > > > > > new imsg type VIODEV_MSG_TUNSCAP to transfer the bits.  Thus, vionet.c does
> > > > > > not need to call ioctl(2) and don't need pledge "vmm".
> > > > > >
> > > > > > It this the way, you want to have it?
> > > > >
> > > > > Yes in general that's along what I was thinking. In talking to dlg@ it
> > > > > should be fine for this change to be asynchronous as long as the TUNSCAP
> > > > > mode is already enabled and we're just changing feature bits.
> > > > >
> > > > > >
> > > > > > ps: i saw you mail about local interfaces a few minutes ago.  I'll check it.
> > > > > >
> > > > > > Thanks,
> > > > > > Jan
> > > > > >
> > > > > > Index: sys/kern/kern_pledge.c
> > > > > > ===================================================================
> > > > > > RCS file: /cvs/src/sys/kern/kern_pledge.c,v
> > > > > > diff -u -p -r1.335 kern_pledge.c
> > > > > > --- sys/kern/kern_pledge.c	13 Nov 2025 20:59:14 -0000	1.335
> > > > > > +++ sys/kern/kern_pledge.c	14 Jan 2026 17:25:57 -0000
> > > > > > @@ -46,6 +46,7 @@
> > > > > >  #include <net/route.h>
> > > > > >  #include <net/if.h>
> > > > > >  #include <net/if_var.h>
> > > > > > +#include <net/if_tun.h>
> > > > > >  #include <netinet/in.h>
> > > > > >  #include <netinet6/in6_var.h>
> > > > > >  #include <netinet6/nd6.h>
> > > > > > @@ -1337,6 +1338,12 @@ pledge_ioctl(struct proc *p, long com, s
> > > > > >  		    cdevsw[major(vp->v_rdev)].d_open == vmmopen) {
> > > > > >  			error = pledge_ioctl_vmm(p, com);
> > > > > >  			if (error == 0)
> > > > > > +				return 0;
> > > > > > +		}
> > > > > > +		if ((fp->f_type == DTYPE_VNODE) &&
> > > > > > +		    (vp->v_type == VCHR) &&
> > > > > > +		    (cdevsw[major(vp->v_rdev)].d_open == tapopen)) {
> > > > > > +			if (com == TUNSCAP)
> > > > > >  				return 0;
> > > > > >  		}
> > > > > >  	}
> > > > > > Index: usr.sbin/vmd/vionet.c
> > > > > > ===================================================================
> > > > > > RCS file: /cvs/src/usr.sbin/vmd/vionet.c,v
> > > > > > diff -u -p -r1.29 vionet.c
> > > > > > --- usr.sbin/vmd/vionet.c	14 Jan 2026 03:09:05 -0000	1.29
> > > > > > +++ usr.sbin/vmd/vionet.c	15 Jan 2026 15:55:28 -0000
> > > > > > @@ -22,7 +22,12 @@
> > > > > >  #include <dev/pv/virtioreg.h>
> > > > > >
> > > > > >  #include <net/if.h>
> > > > > > +#include <net/if_tun.h>
> > > > > >  #include <netinet/in.h>
> > > > > > +#include <netinet/ip.h>
> > > > > > +#include <netinet/ip6.h>
> > > > > > +#include <netinet/tcp.h>
> > > > > > +#include <netinet/udp.h>
> > > > > >  #include <netinet/if_ether.h>
> > > > > >
> > > > > >  #include <errno.h>
> > > > > > @@ -50,6 +55,7 @@
> > > > > >
> > > > > >  #define VIRTIO_NET_CONFIG_MAC		 0 /*  8 bit x 6 byte */
> > > > > >
> > > > > > +#define VIRTIO_NET_F_GUEST_CSUM	(1 << 1)
> > > > > >  #define VIRTIO_NET_F_MAC	(1 << 5)
> > > > > >  #define RXQ	0
> > > > > >  #define TXQ	1
> > > > > > @@ -65,7 +71,7 @@ static void *rx_run_loop(void *);
> > > > > >  static void *tx_run_loop(void *);
> > > > > >  static int vionet_rx(struct virtio_dev *, int);
> > > > > >  static ssize_t vionet_rx_copy(struct vionet_dev *, int, const struct iovec *,
> > > > > > -    int, size_t);
> > > > > > +    int, size_t, struct tun_hdr *th);
> > > > > >  static ssize_t vionet_rx_zerocopy(struct vionet_dev *, int,
> > > > > >      const struct iovec *, int);
> > > > > >  static void vionet_rx_event(int, short, void *);
> > > > > > @@ -84,6 +90,10 @@ static void read_pipe_rx(int, short, voi
> > > > > >  static void read_pipe_tx(int, short, void *);
> > > > > >  static void vionet_assert_pic_irq(struct virtio_dev *);
> > > > > >  static void vionet_deassert_pic_irq(struct virtio_dev *);
> > > > > > +static void vhdr2thdr(struct virtio_net_hdr *, struct tun_hdr *,
> > > > > > +    const struct iovec *, int);
> > > > > > +static void thdr2vhdr(struct tun_hdr *, struct virtio_net_hdr *,
> > > > > > +    const struct iovec *, int);
> > > > > >
> > > > > >  /* Device Globals */
> > > > > >  struct event ev_tap;
> > > > > > @@ -300,6 +310,30 @@ fail:
> > > > > >  }
> > > > > >
> > > > > >  /*
> > > > > > + * Update and sync offload features with tap(4).
> > > > > > + */
> > > > > > +static void
> > > > > > +vionet_update_offload(struct virtio_dev *dev)
> > > > > > +{
> > > > > > +	struct viodev_msg	msg;
> > > > > > +	int			ret;
> > > > > > +
> > > > > > +	memset(&msg, 0, sizeof(msg));
> > > > > > +	msg.irq = dev->irq;
> > > > > > +	msg.type = VIODEV_MSG_TUNSCAP;
> > > > > > +
> > > > > > +	if (dev->driver_feature & VIRTIO_NET_F_GUEST_CSUM) {
> > > > > > +		msg.data |= IFCAP_CSUM_TCPv4 | IFCAP_CSUM_UDPv4;
> > > > > > +		msg.data |= IFCAP_CSUM_TCPv6 | IFCAP_CSUM_UDPv6;
> > > > > > +	}
> > > > > > +
> > > > > > +	ret = imsg_compose_event2(&dev->async_iev, IMSG_DEVOP_MSG, 0, 0, -1,
> > > > > > +	    &msg, sizeof(msg), ev_base_main);
> > > > > > +	if (ret == -1)
> > > > > > +		log_warnx("%s: failed to assert irq %d", __func__, dev->irq);
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > >   * vionet_rx
> > > > > >   *
> > > > > >   * Pull packet from the provided fd and fill the receive-side virtqueue. We
> > > > > > @@ -321,6 +355,7 @@ vionet_rx(struct virtio_dev *dev, int fd
> > > > > >  	struct virtio_net_hdr *hdr = NULL;
> > > > > >  	struct virtio_vq_info *vq_info;
> > > > > >  	struct iovec *iov;
> > > > > > +	struct tun_hdr th;
> > > > > >  	int notify = 0;
> > > > > >  	ssize_t sz;
> > > > > >  	uint8_t status = 0;
> > > > > > @@ -351,8 +386,8 @@ vionet_rx(struct virtio_dev *dev, int fd
> > > > > >  			goto reset;
> > > > > >  		}
> > > > > >
> > > > > > -		iov = &iov_rx[0];
> > > > > > -		iov_cnt = 1;
> > > > > > +		iov = &iov_rx[1];
> > > > > > +		iov_cnt = 2;
> > > > > >
> > > > > >  		/*
> > > > > >  		 * First descriptor should be at least as large as the
> > > > > > @@ -373,7 +408,6 @@ vionet_rx(struct virtio_dev *dev, int fd
> > > > > >  		if (iov->iov_base == NULL)
> > > > > >  			goto reset;
> > > > > >  		hdr = iov->iov_base;
> > > > > > -		memset(hdr, 0, sizeof(struct virtio_net_hdr));
> > > > > >
> > > > > >  		/* Tweak the iovec to account for the virtio_net_hdr. */
> > > > > >  		iov->iov_len -= sizeof(struct virtio_net_hdr);
> > > > > > @@ -418,22 +452,26 @@ vionet_rx(struct virtio_dev *dev, int fd
> > > > > >  			goto reset;
> > > > > >  		}
> > > > > >
> > > > > > -		hdr->num_buffers = iov_cnt;
> > > > > > -
> > > > > >  		/*
> > > > > >  		 * If we're enforcing hardware address or handling an injected
> > > > > >  		 * packet, we need to use a copy-based approach.
> > > > > >  		 */
> > > > > >  		if (vionet->lockedmac || fd != vionet->data_fd)
> > > > > >  			sz = vionet_rx_copy(vionet, fd, iov_rx, iov_cnt,
> > > > > > -			    chain_len);
> > > > > > -		else
> > > > > > +			    chain_len, &th);
> > > > > > +		else {
> > > > > > +			iov_rx[0].iov_base = &th;
> > > > > > +			iov_rx[0].iov_len = sizeof(th);
> > > > > >  			sz = vionet_rx_zerocopy(vionet, fd, iov_rx, iov_cnt);
> > > > > > +		}
> > > > > >  		if (sz == -1)
> > > > > >  			goto reset;
> > > > > >  		if (sz == 0)	/* No packets, so bail out for now. */
> > > > > >  			break;
> > > > > >
> > > > > > +		thdr2vhdr(&th, hdr, iov_rx + 1, iov_cnt - 1);
> > > > > > +		hdr->num_buffers = iov_cnt - 1;
> > > > > > +
> > > > > >  		/*
> > > > > >  		 * Account for the prefixed header since it wasn't included
> > > > > >  		 * in the copy or zerocopy operations.
> > > > > > @@ -473,9 +511,9 @@ reset:
> > > > > >   */
> > > > > >  ssize_t
> > > > > >  vionet_rx_copy(struct vionet_dev *dev, int fd, const struct iovec *iov,
> > > > > > -    int iov_cnt, size_t chain_len)
> > > > > > +    int iov_cnt, size_t chain_len, struct tun_hdr *th)
> > > > > >  {
> > > > > > -	static uint8_t		 buf[VIONET_HARD_MTU];
> > > > > > +	static uint8_t		 buf[sizeof(struct tun_hdr) + VIONET_HARD_MTU];
> > > > > >  	struct packet		*pkt = NULL;
> > > > > >  	struct ether_header	*eh = NULL;
> > > > > >  	uint8_t			*payload = buf;
> > > > > > @@ -483,9 +521,10 @@ vionet_rx_copy(struct vionet_dev *dev, i
> > > > > >  	ssize_t			 sz;
> > > > > >
> > > > > >  	/* If reading from the tap(4), try to right-size the read. */
> > > > > > -	if (fd == dev->data_fd)
> > > > > > -		nbytes = MIN(chain_len, VIONET_HARD_MTU);
> > > > > > -	else if (fd == pipe_inject[READ])
> > > > > > +	if (fd == dev->data_fd) {
> > > > > > +		nbytes = sizeof(struct tun_hdr) +
> > > > > > +		    MIN(chain_len, VIONET_HARD_MTU);
> > > > > > +	} else if (fd == pipe_inject[READ])
> > > > > >  		nbytes = sizeof(struct packet);
> > > > > >  	else {
> > > > > >  		log_warnx("%s: invalid fd: %d", __func__, fd);
> > > > > > @@ -504,10 +543,20 @@ vionet_rx_copy(struct vionet_dev *dev, i
> > > > > >  			return (-1);
> > > > > >  		}
> > > > > >  		return (0);
> > > > > > -	} else if (fd == dev->data_fd && sz < VIONET_MIN_TXLEN) {
> > > > > > +	} else if (fd == dev->data_fd) {
> > > > > > +		if ((size_t)sz < sizeof(struct tun_hdr)) {
> > > > > > +			log_warnx("%s: short tun_hdr", __func__);
> > > > > > +			return (0);
> > > > > > +		}
> > > > > > +		memcpy(th, payload, sizeof *th);
> > > > > > +		payload += sizeof(struct tun_hdr);
> > > > > > +		sz -= sizeof(struct tun_hdr);
> > > > > > +
> > > > > >  		/* If reading the tap(4), we should get valid ethernet. */
> > > > > > -		log_warnx("%s: invalid packet size", __func__);
> > > > > > -		return (0);
> > > > > > +		if (sz < VIONET_MIN_TXLEN) {
> > > > > > +			log_warnx("%s: invalid packet size", __func__);
> > > > > > +			return (0);
> > > > > > +		}
> > > > > >  	} else if (fd == pipe_inject[READ] && sz != sizeof(struct packet)) {
> > > > > >  		log_warnx("%s: invalid injected packet object (sz=%ld)",
> > > > > >  		    __func__, sz);
> > > > > > @@ -585,6 +634,12 @@ vionet_rx_zerocopy(struct vionet_dev *de
> > > > > >  	sz = readv(fd, iov, iov_cnt);
> > > > > >  	if (sz == -1 && errno == EAGAIN)
> > > > > >  		return (0);
> > > > > > +
> > > > > > +	if ((size_t)sz < sizeof(struct tun_hdr))
> > > > > > +		return (0);
> > > > > > +
> > > > > > +	sz -= sizeof(struct tun_hdr);
> > > > > > +
> > > > > >  	return (sz);
> > > > > >  }
> > > > > >
> > > > > > @@ -666,6 +721,8 @@ vionet_tx(struct virtio_dev *dev)
> > > > > >  	struct iovec *iov;
> > > > > >  	struct packet pkt;
> > > > > >  	uint8_t status = 0;
> > > > > > +	struct virtio_net_hdr *vhp;
> > > > > > +	struct tun_hdr th;
> > > > > >
> > > > > >  	status = dev->status & VIRTIO_CONFIG_DEVICE_STATUS_DRIVER_OK;
> > > > > >  	if (status != VIRTIO_CONFIG_DEVICE_STATUS_DRIVER_OK) {
> > > > > > @@ -692,8 +749,10 @@ vionet_tx(struct virtio_dev *dev)
> > > > > >  			goto reset;
> > > > > >  		}
> > > > > >
> > > > > > -		iov = &iov_tx[0];
> > > > > > -		iov_cnt = 0;
> > > > > > +		/* the 0th slot will by used by the tun_hdr */
> > > > > > +
> > > > > > +		iov = &iov_tx[1];
> > > > > > +		iov_cnt = 1;
> > > > > >  		chain_len = 0;
> > > > > >
> > > > > >  		/*
> > > > > > @@ -704,13 +763,16 @@ vionet_tx(struct virtio_dev *dev)
> > > > > >  			log_warnx("%s: invalid descriptor length", __func__);
> > > > > >  			goto reset;
> > > > > >  		}
> > > > > > -		iov->iov_len = desc->len;
> > > > > >
> > > > > > -		if (iov->iov_len > sizeof(struct virtio_net_hdr)) {
> > > > > > -			/* Chop off the virtio header, leaving packet data. */
> > > > > > -			iov->iov_len -= sizeof(struct virtio_net_hdr);
> > > > > > -			iov->iov_base = hvaddr_mem(desc->addr +
> > > > > > -			    sizeof(struct virtio_net_hdr), iov->iov_len);
> > > > > > +		/* Chop the virtio net header off */
> > > > > > +		vhp = hvaddr_mem(desc->addr, sizeof(*vhp));
> > > > > > +		if (vhp == NULL)
> > > > > > +			goto reset;
> > > > > > +
> > > > > > +		iov->iov_len = desc->len - sizeof(*vhp);
> > > > > > +		if (iov->iov_len > 0) {
> > > > > > +			iov->iov_base = hvaddr_mem(desc->addr + sizeof(*vhp),
> > > > > > +			    iov->iov_len);
> > > > > >  			if (iov->iov_base == NULL)
> > > > > >  				goto reset;
> > > > > >
> > > > > > @@ -758,7 +820,7 @@ vionet_tx(struct virtio_dev *dev)
> > > > > >  		 * descriptor with packet data contains a large enough buffer
> > > > > >  		 * for this inspection.
> > > > > >  		 */
> > > > > > -		iov = &iov_tx[0];
> > > > > > +		iov = &iov_tx[1];
> > > > > >  		if (vionet->lockedmac) {
> > > > > >  			if (iov->iov_len < ETHER_HDR_LEN) {
> > > > > >  				log_warnx("%s: insufficient header data",
> > > > > > @@ -784,6 +846,15 @@ vionet_tx(struct virtio_dev *dev)
> > > > > >  			}
> > > > > >  		}
> > > > > >
> > > > > > +		/*
> > > > > > +		 * if we look at more of vhp we might need to copy
> > > > > > +		 * it so it's aligned properly
> > > > > > +		 */
> > > > > > +		vhdr2thdr(vhp, &th, iov_tx + 1, iov_cnt - 1);
> > > > > > +
> > > > > > +		iov_tx[0].iov_base = &th;
> > > > > > +		iov_tx[0].iov_len = sizeof(th);
> > > > > > +
> > > > > >  		/* Write our packet to the tap(4). */
> > > > > >  		sz = writev(vionet->data_fd, iov_tx, iov_cnt);
> > > > > >  		if (sz == -1 && errno != ENOBUFS) {
> > > > > > @@ -1114,6 +1185,7 @@ vionet_cfg_write(struct virtio_dev *dev,
> > > > > >  		dev->driver_feature &= dev->device_feature;
> > > > > >  		DPRINTF("%s: driver features 0x%llx", __func__,
> > > > > >  		    dev->driver_feature);
> > > > > > +		vionet_update_offload(dev);
> > > > > >  		break;
> > > > > >  	case VIO1_PCI_CONFIG_MSIX_VECTOR:
> > > > > >  		/* Ignore until we support MSIX. */
> > > > > > @@ -1555,6 +1627,155 @@ vionet_assert_pic_irq(struct virtio_dev
> > > > > >  	    &msg, sizeof(msg), ev_base_main);
> > > > > >  	if (ret == -1)
> > > > > >  		log_warnx("%s: failed to assert irq %d", __func__, dev->irq);
> > > > > > +}
> > > > > > +
> > > > > > +static int
> > > > > > +memcpyv(void *buf, size_t len, size_t off, const struct iovec *iov, int iovcnt)
> > > > > > +{
> > > > > > +	uint8_t *dst = buf;
> > > > > > +	size_t l;
> > > > > > +
> > > > > > +	for (;;) {
> > > > > > +		if (iovcnt == 0)
> > > > > > +			return (-1);
> > > > > > +
> > > > > > +		if (off < iov->iov_len)
> > > > > > +			break;
> > > > > > +
> > > > > > +		off -= iov->iov_len;
> > > > > > +		iov++;
> > > > > > +		iovcnt--;
> > > > > > +	}
> > > > > > +
> > > > > > +	l = off + len;
> > > > > > +	if (l > iov->iov_len)
> > > > > > +		l = iov->iov_len;
> > > > > > +	l -= off;
> > > > > > +
> > > > > > +	memcpy(dst, (const uint8_t *)iov->iov_base + off, l);
> > > > > > +	dst += l;
> > > > > > +	len -= l;
> > > > > > +
> > > > > > +	if (len == 0)
> > > > > > +		return (0);
> > > > > > +
> > > > > > +	for (;;) {
> > > > > > +		if (iovcnt == 0)
> > > > > > +			return (-1);
> > > > > > +
> > > > > > +		l = len;
> > > > > > +		if (l > iov->iov_len)
> > > > > > +			l = iov->iov_len;
> > > > > > +
> > > > > > +		memcpy(dst, (const uint8_t *)iov->iov_base, l);
> > > > > > +		dst += l;
> > > > > > +		len -= l;
> > > > > > +
> > > > > > +		if (len == 0)
> > > > > > +			break;
> > > > > > +
> > > > > > +		iov++;
> > > > > > +		iovcnt--;
> > > > > > +	}
> > > > > > +
> > > > > > +	return (0);
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +hdr_extract(const struct iovec *iov, int iovcnt, size_t *off, uint8_t *proto)
> > > > > > +{
> > > > > > +	size_t		offs;
> > > > > > +	uint16_t	etype;
> > > > > > +
> > > > > > +	if (memcpyv(&etype, sizeof(etype),
> > > > > > +	    offsetof(struct ether_header, ether_type),
> > > > > > +	    iov, iovcnt) == -1)
> > > > > > +		return;
> > > > > > +
> > > > > > +	*off = sizeof(struct ether_header);
> > > > > > +
> > > > > > +	if (etype == htons(ETHERTYPE_VLAN)) {
> > > > > > +		if (memcpyv(&etype, sizeof(etype),
> > > > > > +		    offsetof(struct ether_vlan_header, evl_proto),
> > > > > > +		    iov, iovcnt) == -1)
> > > > > > +			return;
> > > > > > +
> > > > > > +		*off = sizeof(struct ether_vlan_header);
> > > > > > +	}
> > > > > > +
> > > > > > +	if (etype == htons(ETHERTYPE_IP)) {
> > > > > > +		uint8_t hl;
> > > > > > +
> > > > > > +		/* Get ipproto field from IP header. */
> > > > > > +		offs = *off + offsetof(struct ip, ip_p);
> > > > > > +		if (memcpyv(proto, sizeof(*proto), offs, iov, iovcnt) == -1)
> > > > > > +			return;
> > > > > > +
> > > > > > +		/* Get IP header length field from IP header. */
> > > > > > +		offs = *off;
> > > > > > +		if (memcpyv(&hl, sizeof(hl), offs, iov, iovcnt) == -1)
> > > > > > +			return;
> > > > > > +
> > > > > > +		*off += (hl & 0x0f) << 2;
> > > > > > +	} else if (etype == htons(ETHERTYPE_IPV6)) {
> > > > > > +		/* Get next header field from IP header. */
> > > > > > +		offs = *off + offsetof(struct ip6_hdr, ip6_nxt);
> > > > > > +		if (memcpyv(proto, sizeof(*proto), offs, iov, iovcnt) == -1)
> > > > > > +			return;
> > > > > > +
> > > > > > +		*off += sizeof(struct ip6_hdr);
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +vhdr2thdr(struct virtio_net_hdr *vh, struct tun_hdr *th,
> > > > > > +    const struct iovec *iov, int iovcnt)
> > > > > > +{
> > > > > > +	memset(th, 0, sizeof(*th));
> > > > > > +
> > > > > > +	if (vh->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
> > > > > > +		size_t	off;
> > > > > > +		uint8_t	proto;
> > > > > > +
> > > > > > +		hdr_extract(iov, iovcnt, &off, &proto);
> > > > > > +
> > > > > > +		switch (proto) {
> > > > > > +		case IPPROTO_TCP:
> > > > > > +			th->th_flags |= TUN_H_TCP_CSUM;
> > > > > > +			break;
> > > > > > +
> > > > > > +		case IPPROTO_UDP:
> > > > > > +			th->th_flags |= TUN_H_UDP_CSUM;
> > > > > > +			break;
> > > > > > +		}
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +static void
> > > > > > +thdr2vhdr(struct tun_hdr *th, struct virtio_net_hdr *vh,
> > > > > > +    const struct iovec *iov, int iovcnt)
> > > > > > +{
> > > > > > +	size_t	off;
> > > > > > +	uint8_t	proto;
> > > > > > +
> > > > > > +	memset(vh, 0, sizeof(*vh));
> > > > > > +
> > > > > > +	if (th->th_flags & (TUN_H_TCP_CSUM | TUN_H_UDP_CSUM)) {
> > > > > > +		hdr_extract(iov, iovcnt, &off, &proto);
> > > > > > +
> > > > > > +		vh->flags |= VIRTIO_NET_HDR_F_NEEDS_CSUM;
> > > > > > +		vh->csum_start = off;
> > > > > > +
> > > > > > +		switch (proto) {
> > > > > > +		case IPPROTO_TCP:
> > > > > > +			vh->csum_offset = offsetof(struct tcphdr, th_sum);
> > > > > > +			break;
> > > > > > +
> > > > > > +		case IPPROTO_UDP:
> > > > > > +			vh->csum_offset = offsetof(struct udphdr, uh_sum);
> > > > > > +			break;
> > > > > > +		}
> > > > > > +	}
> > > > > >  }
> > > > > >
> > > > > >  /*
> > > > > > Index: usr.sbin/vmd/virtio.c
> > > > > > ===================================================================
> > > > > > RCS file: /cvs/src/usr.sbin/vmd/virtio.c,v
> > > > > > diff -u -p -r1.134 virtio.c
> > > > > > --- usr.sbin/vmd/virtio.c	14 Jan 2026 03:09:05 -0000	1.134
> > > > > > +++ usr.sbin/vmd/virtio.c	15 Jan 2026 15:55:36 -0000
> > > > > > @@ -19,6 +19,7 @@
> > > > > >  #include <sys/param.h>	/* PAGE_SIZE */
> > > > > >  #include <sys/socket.h>
> > > > > >  #include <sys/wait.h>
> > > > > > +#include <sys/ioctl.h>
> > > > > >
> > > > > >  #include <dev/pci/pcireg.h>
> > > > > >  #include <dev/pci/pcidevs.h>
> > > > > > @@ -28,6 +29,7 @@
> > > > > >  #include <dev/vmm/vmm.h>
> > > > > >
> > > > > >  #include <net/if.h>
> > > > > > +#include <net/if_tun.h>
> > > > > >  #include <netinet/in.h>
> > > > > >  #include <netinet/if_ether.h>
> > > > > >
> > > > > > @@ -64,6 +66,8 @@ SLIST_HEAD(virtio_dev_head, virtio_dev)
> > > > > >
> > > > > >  #define MAXPHYS	(64 * 1024)	/* max raw I/O transfer size */
> > > > > >
> > > > > > +#define VIRTIO_NET_F_CSUM	(1<<0)
> > > > > > +#define VIRTIO_NET_F_GUEST_CSUM	(1<<1)
> > > > > >  #define VIRTIO_NET_F_MAC	(1<<5)
> > > > > >
> > > > > >  #define VMMCI_F_TIMESYNC	(1<<0)
> > > > > > @@ -1020,6 +1024,8 @@ virtio_init(struct vmd_vm *vm, int child
> > > > > >  	/* Virtio 1.x Network Devices */
> > > > > >  	if (vmc->vmc_nnics > 0) {
> > > > > >  		for (i = 0; i < vmc->vmc_nnics; i++) {
> > > > > > +			struct tun_capabilities	tcap;
> > > > > > +
> > > > > >  			dev = malloc(sizeof(struct virtio_dev));
> > > > > >  			if (dev == NULL) {
> > > > > >  				log_warn("calloc failure allocating vionet");
> > > > > > @@ -1034,7 +1040,8 @@ virtio_init(struct vmd_vm *vm, int child
> > > > > >  			}
> > > > > >  			virtio_dev_init(vm, dev, id, VIONET_QUEUE_SIZE_DEFAULT,
> > > > > >  			    VIRTIO_NET_QUEUES,
> > > > > > -			    (VIRTIO_NET_F_MAC | VIRTIO_F_VERSION_1));
> > > > > > +			    (VIRTIO_NET_F_MAC | VIRTIO_NET_F_CSUM |
> > > > > > +				VIRTIO_NET_F_GUEST_CSUM | VIRTIO_F_VERSION_1));
> > > > > >
> > > > > >  			if (pci_add_bar(id, PCI_MAPREG_TYPE_IO, virtio_pci_io,
> > > > > >  			    dev) == -1) {
> > > > > > @@ -1056,6 +1063,14 @@ virtio_init(struct vmd_vm *vm, int child
> > > > > >  			dev->vmm_id = vm->vm_vmmid;
> > > > > >  			dev->vionet.data_fd = child_taps[i];
> > > > > >
> > > > > > +			/*
> > > > > > +			 * IFCAPs are tweaked after feature negotiation with
> > > > > > +			 * the guest later.
> > > > > > +			 */
> > > > > > +			memset(&tcap, 0, sizeof(tcap));
> > > > > > +			if (ioctl(dev->vionet.data_fd, TUNSCAP, &tcap) == -1)
> > > > > > +				fatal("tap(4) TUNSCAP");
> > > > > > +
> > > > > >  			/* MAC address has been assigned by the parent */
> > > > > >  			memcpy(&dev->vionet.mac, &vmc->vmc_macs[i], 6);
> > > > > >  			dev->vionet.lockedmac =
> > > > > > @@ -1532,10 +1547,12 @@ virtio_dev_launch(struct vmd_vm *vm, str
> > > > > >  		}
> > > > > >
> > > > > >  		/* Close data fds. Only the child device needs them now. */
> > > > > > -		if (virtio_dev_closefds(dev) == -1) {
> > > > > > -			log_warnx("%s: failed to close device data fds",
> > > > > > -			    __func__);
> > > > > > -			goto err;
> > > > > > +		if (dev->dev_type != VMD_DEVTYPE_NET) {
> > > > > > +			if (virtio_dev_closefds(dev) == -1) {
> > > > > > +				log_warnx("%s: failed to close device data fds",
> > > > > > +				    __func__);
> > > > > > +				goto err;
> > > > > > +			}
> > > > > >  		}
> > > > > >
> > > > > >  		/* 2. Send over details on the VM (including memory fds). */
> > > > > > @@ -1758,6 +1775,18 @@ handle_dev_msg(struct viodev_msg *msg, s
> > > > > >  	case VIODEV_MSG_ERROR:
> > > > > >  		log_warnx("%s: device reported error", __func__);
> > > > > >  		break;
> > > > > > +	case VIODEV_MSG_TUNSCAP:
> > > > > > +	{
> > > > > > +		struct tun_capabilities	tcap;
> > > > > > +
> > > > > > +		memset(&tcap, 0, sizeof(tcap));
> > > > > > +		tcap.tun_if_capabilities = msg->data;
> > > > > > +
> > > > > > +		if (ioctl(gdev->vionet.data_fd, TUNSCAP, &tcap) == -1)
> > > > > > +			fatal("%s: tap(4) TUNSCAP", __func__);
> > > > > > +
> > > > > > +		break;
> > > > > > +	}
> > > > > >  	case VIODEV_MSG_INVALID:
> > > > > >  	case VIODEV_MSG_IO_READ:
> > > > > >  	case VIODEV_MSG_IO_WRITE:
> > > > > > Index: usr.sbin/vmd/virtio.h
> > > > > > ===================================================================
> > > > > > RCS file: /cvs/src/usr.sbin/vmd/virtio.h,v
> > > > > > diff -u -p -r1.60 virtio.h
> > > > > > --- usr.sbin/vmd/virtio.h	14 Jan 2026 03:09:05 -0000	1.60
> > > > > > +++ usr.sbin/vmd/virtio.h	15 Jan 2026 15:55:48 -0000
> > > > > > @@ -134,6 +134,7 @@ struct viodev_msg {
> > > > > >  #define VIODEV_MSG_IO_WRITE	5
> > > > > >  #define VIODEV_MSG_DUMP		6
> > > > > >  #define VIODEV_MSG_SHUTDOWN	7
> > > > > > +#define VIODEV_MSG_TUNSCAP	8
> > > > > >
> > > > > >  	uint16_t reg;		/* VirtIO register */
> > > > > >  	uint8_t io_sz;		/* IO instruction size */
> > > > > > @@ -309,6 +310,9 @@ struct virtio_net_hdr {
> > > > > >  	uint16_t padding_reserved;
> > > > > >  	*/
> > > > > >  };
> > > > > > +
> > > > > > +#define VIRTIO_NET_HDR_F_NEEDS_CSUM	1 /* flags */
> > > > > > +#define VIRTIO_NET_HDR_F_DATA_VALID	2 /* flags */
> > > > > >
> > > > > >  enum vmmci_cmd {
> > > > > >  	VMMCI_NONE = 0,
> > > > >
> > > >
> > >
> >
> >