From: Stefan Fritsch Subject: Re: vmd(8): TCP/UDP checksum offloading for guests To: Claudio Jeker Cc: Jan Klemkow , tech@openbsd.org, David Gwynne Date: Tue, 5 Nov 2024 15:50:26 +0100 On Tue, 5 Nov 2024, Claudio Jeker wrote: > On Mon, Nov 04, 2024 at 05:25:17PM +0100, Jan Klemkow wrote: > > On Fri, Aug 23, 2024 at 02:20:37PM GMT, Claudio Jeker wrote: > > > On Fri, Aug 23, 2024 at 12:29:10PM +0200, Jan Klemkow wrote: > > > > On Fri, Aug 23, 2024 at 09:48:58AM +0200, Claudio Jeker wrote: > > > > > On Fri, Aug 23, 2024 at 11:05:43AM +1000, David Gwynne wrote: > > > > > > On Fri, Aug 16, 2024 at 02:51:06PM +0200, Jan Klemkow wrote: > > > > > > > This diff introduces the transmit side of TCP/UDP checksum offloading > > > > > > > for vmd(8) guests. It uses a similar implementation as FreeBSD and > > > > > > > Linux. vmd(8) tell tap(4) via a new ioctl to expect virtio net header > > > > > > > structs. These are used to transport offloading feature bits between > > > > > > > guest and tap(4) device. The tap(4) translates the feature bits between > > > > > > > virtio net header and our mbufs. > > > > > > > > > > > > > > This diff just contains the sending part, to keep it short. > > > > > > > > > > > > > > When both parts are implemented, we never calculate a checksum of > > > > > > > packets between our guests or the host system. As its described in the > > > > > > > virtio-network specification: > > > > > > > > > > > > > > A network packet transported between two guests on the same > > > > > > > system might not need checksumming at all, nor segmentation, > > > > > > > if both guests are amenable. > > > > > > > > > > > > > > This behavior is similar to our loopback traffic, as we also don't > > > > > > > calculate checksums of packets between two processes here. > > > > > > > > > > > > > > Tests and any kind of feedback is welcome! > > > > > > > > > > > > i love the idea, but i really dislike how tun and virtio get tied > > > > > > together by this diff. > > > > > > > > > > > > ideally tun and only tun would get extended to be able to communicate > > > > > > this info between userland and the kernel in a way that makes sense for > > > > > > our stack. separately, vmd would be responsible for mapping the virtio > > > > > > semantics to our tun driver semantics. > > > > > > > > > > > > i understand that tun/tap is kind of a psuedo standard interface > > > > > > between a bunch of different systems, and we're catching up to some of > > > > > > those systems in this area, but making struct virtio_net_hdr part > > > > > > of the tun api seems a bit much. virtio can tweak this struct when > > > > > > it makes sense for a guest talking to a hv via a memory mapped ring > > > > > > and visa versa, but automatically inheriting the change for userland > > > > > > talking to the kernel on an fd is annoying. > > > > > > > > > > I'm very much on the same page with that. Linux did a disservice to > > > > > everyone (once again). I would prefer if tun/tap had their own extension. > > > > > The virtio header is a pain especially since they altered the format and > > > > > you now have to handle two different forms. > > > > > So this dependency of virtio and vio structs in tun/tap is just terrible. > > > > > > > > Linux uses this interface for KVM and FreeBSD also uses it for bhyve. > > > > Why don't we use it for VMM, too? > > > > > > Because is it sub-standard and a layer violation of major proportions. > > > We are not just followers adding bad code to our kernel. I think OpenBSD > > > was always about doing the right thing even if is hard. > > > > > > > At least there is an specification [1] for this interface. Which is > > > > quiet stable with backward compatibility. Thus, it don't seems to break > > > > in the future. > > > > > > Until they break it again with more backwards compat and whatever. Why tie > > > tun/tap together with virtio? tun and tap are used for other things as > > > well. Some time ago I looked at what linux did for tun / tap and walked > > > away exactly because of this. Their solution is a hack on a hack with a > > > hack and a layer violation on top. > > > > > > > I get your point, that you don't want to deal with two different > > > > versions of the virtio net header structure in our kernel. To get rid > > > > of the legacy format we have to upgrade vmd(8) from virtio 0.9 to 1.0. > > > > > > > > [1]: https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html > > > > > > > > What would an OpenBSD only extension to the tap(4) interface looks like? > > > > Do we define our own structure tap_net_hdr? And vmd(8) has to translate > > > > between virtio_net_hdr and tap_net_hdr? > > > > > > Yes. Exactly. We could create one sensible header that would allow for all > > > the things we care. Not only TX/RX csum offload, e.g. multi packet send > > > and recv and multiqueue support. > > > > > > In linux land virtio and tap is now so interlinked that it is hard to > > > improve either of them. > > > > > > > vmd(8) don't need the information in virtio_net_hdr. tap(4) does. > > > > tap(4) is the only one here, who can transfer information between mbuf > > > > and virtio_net_hdr. Like, our drivers in dev/if_*.c translate between > > > > their receive descriptors and our mbuf structure. > > > > > > No, our drivers do not create an ethernet extension header that is sent to > > > the switch to do the checksum offload there... This is what people created > > > here. They introduce signaling over two device abstractions. That is just > > > bad design. virtio is already a mess and I would prefer if that mess > > > remains in vmd / qemu. > > > > > > > > > will any ports make use of this? would this be used outside vmd? i > > > > > > have vague memories of tailscale omgoptimising a wireguard > > > > > > implementation to do offloads which probably uses this, but nothing > > > > > > else springs to mind. > > > > > > > > > > From my understanding on qemu and maybe tailscale/openvpn would care. > > > > > IMO this is easy enough to handle in ports. > > > > > > > > Yes, qemu uses this interface. tailscale and openvpn don't as far as I > > > > see it in their sources. > > > > After talking to claudio at h2k24 I don't see any other version of this > > at the moment. We only have two consumers of this API (vmd and qemu). > > Every other API would make it much more complicated without any benefit > > for other userland tools. > > > > At least its an optional extension to tap(4) and don't change the > > default behavior. > > > > ok? > > Looking at the diff it still disgusts me very much. The virtio header > includes a lot more than required and it is unclear how tap should handle > stuff that has bits set that it does not really no about. > > Just the bits that have to be moved into a public header shows that this > is something that was not thought through by the people pushing for this. I think there is a bit of a misunderstanding here. The feature bits do not have to be moved to a public header for the tun(4) change, but moving all the "hardware" definitions to a if_vioreg.h header is a good idea. Then vmd can use the bits from there instead of bringing its own definitions. This is completely independent of what tun(4) does. > > We only need CSUM, TSO, LRO flags and maybe VLAN MTU ones. A lot of the > other other feature bits are just screaming for bad behaviour. > So this should really be done less terrible. The complete definition of the virtio_net_hdr in virtio 1.2 is this: struct virtio_net_hdr { #define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 #define VIRTIO_NET_HDR_F_DATA_VALID 2 #define VIRTIO_NET_HDR_F_RSC_INFO 4 u8 flags; #define VIRTIO_NET_HDR_GSO_NONE 0 #define VIRTIO_NET_HDR_GSO_TCPV4 1 #define VIRTIO_NET_HDR_GSO_UDP 3 #define VIRTIO_NET_HDR_GSO_TCPV6 4 #define VIRTIO_NET_HDR_GSO_UDP_L4 5 #define VIRTIO_NET_HDR_GSO_ECN 0x80 u8 gso_type; le16 hdr_len; le16 gso_size; le16 csum_start; le16 csum_offset; le16 num_buffers; le32 hash_value; #define VIRTIO_NET_HASH_REPORT_NONE 0 #define VIRTIO_NET_HASH_REPORT_IPv4 1 #define VIRTIO_NET_HASH_REPORT_TCPv4 2 #define VIRTIO_NET_HASH_REPORT_UDPv4 3 #define VIRTIO_NET_HASH_REPORT_IPv6 4 #define VIRTIO_NET_HASH_REPORT_TCPv6 5 #define VIRTIO_NET_HASH_REPORT_UDPv6 6 #define VIRTIO_NET_HASH_REPORT_IPv6_EX 7 #define VIRTIO_NET_HASH_REPORT_TCPv6_EX 8 #define VIRTIO_NET_HASH_REPORT_UDPv6_EX 9 le16 hash_report; le16 padding_reserved; }; But I don't really know vmd and tun and cannot comment on Jan's approach to offloading. Some more comments are inline below. Cheers, Stefan > > > Thanks, > > Jan > > > > Index: sys/dev/pv/if_vio.c > > =================================================================== > > RCS file: /cvs/src/sys/dev/pv/if_vio.c,v > > diff -u -p -r1.45 if_vio.c > > --- sys/dev/pv/if_vio.c 1 Aug 2024 11:13:19 -0000 1.45 > > +++ sys/dev/pv/if_vio.c 16 Aug 2024 08:47:15 -0000 > > @@ -59,154 +59,7 @@ > > #define DPRINTF(x...) > > #endif > > > > -/* > > - * if_vioreg.h: > > - */ > > -/* Configuration registers */ > > -#define VIRTIO_NET_CONFIG_MAC 0 /* 8bit x 6byte */ > > -#define VIRTIO_NET_CONFIG_STATUS 6 /* 16bit */ > > - > > -/* Feature bits */ > > -#define VIRTIO_NET_F_CSUM (1ULL<<0) > > -#define VIRTIO_NET_F_GUEST_CSUM (1ULL<<1) > > -#define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS (1ULL<<2) > > -#define VIRTIO_NET_F_MTU (1ULL<<3) > > -#define VIRTIO_NET_F_MAC (1ULL<<5) > > -#define VIRTIO_NET_F_GSO (1ULL<<6) > > -#define VIRTIO_NET_F_GUEST_TSO4 (1ULL<<7) > > -#define VIRTIO_NET_F_GUEST_TSO6 (1ULL<<8) > > -#define VIRTIO_NET_F_GUEST_ECN (1ULL<<9) > > -#define VIRTIO_NET_F_GUEST_UFO (1ULL<<10) > > -#define VIRTIO_NET_F_HOST_TSO4 (1ULL<<11) > > -#define VIRTIO_NET_F_HOST_TSO6 (1ULL<<12) > > -#define VIRTIO_NET_F_HOST_ECN (1ULL<<13) > > -#define VIRTIO_NET_F_HOST_UFO (1ULL<<14) > > -#define VIRTIO_NET_F_MRG_RXBUF (1ULL<<15) > > -#define VIRTIO_NET_F_STATUS (1ULL<<16) > > -#define VIRTIO_NET_F_CTRL_VQ (1ULL<<17) > > -#define VIRTIO_NET_F_CTRL_RX (1ULL<<18) > > -#define VIRTIO_NET_F_CTRL_VLAN (1ULL<<19) > > -#define VIRTIO_NET_F_CTRL_RX_EXTRA (1ULL<<20) > > -#define VIRTIO_NET_F_GUEST_ANNOUNCE (1ULL<<21) > > -#define VIRTIO_NET_F_MQ (1ULL<<22) > > -#define VIRTIO_NET_F_CTRL_MAC_ADDR (1ULL<<23) > > -#define VIRTIO_NET_F_HOST_USO (1ULL<<56) > > -#define VIRTIO_NET_F_HASH_REPORT (1ULL<<57) > > -#define VIRTIO_NET_F_GUEST_HDRLEN (1ULL<<59) > > -#define VIRTIO_NET_F_RSS (1ULL<<60) > > -#define VIRTIO_NET_F_RSC_EXT (1ULL<<61) > > -#define VIRTIO_NET_F_STANDBY (1ULL<<62) > > -#define VIRTIO_NET_F_SPEED_DUPLEX (1ULL<<63) Everything from here ... > > -/* > > - * Config(8) flags. The lowest byte is reserved for generic virtio stuff. > > - */ > > - > > -/* Workaround for vlan related bug in qemu < version 2.0 */ > > -#define CONFFLAG_QEMU_VLAN_BUG (1<<8) > > - > > -static const struct virtio_feature_name virtio_net_feature_names[] = { > > -#if VIRTIO_DEBUG > > - { VIRTIO_NET_F_CSUM, "CSum" }, > > - { VIRTIO_NET_F_GUEST_CSUM, "GuestCSum" }, > > - { VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, "CtrlGuestOffl" }, > > - { VIRTIO_NET_F_MTU, "MTU", }, > > - { VIRTIO_NET_F_MAC, "MAC" }, > > - { VIRTIO_NET_F_GSO, "GSO" }, > > - { VIRTIO_NET_F_GUEST_TSO4, "GuestTSO4" }, > > - { VIRTIO_NET_F_GUEST_TSO6, "GuestTSO6" }, > > - { VIRTIO_NET_F_GUEST_ECN, "GuestECN" }, > > - { VIRTIO_NET_F_GUEST_UFO, "GuestUFO" }, > > - { VIRTIO_NET_F_HOST_TSO4, "HostTSO4" }, > > - { VIRTIO_NET_F_HOST_TSO6, "HostTSO6" }, > > - { VIRTIO_NET_F_HOST_ECN, "HostECN" }, > > - { VIRTIO_NET_F_HOST_UFO, "HostUFO" }, > > - { VIRTIO_NET_F_MRG_RXBUF, "MrgRXBuf" }, > > - { VIRTIO_NET_F_STATUS, "Status" }, > > - { VIRTIO_NET_F_CTRL_VQ, "CtrlVQ" }, > > - { VIRTIO_NET_F_CTRL_RX, "CtrlRX" }, > > - { VIRTIO_NET_F_CTRL_VLAN, "CtrlVLAN" }, > > - { VIRTIO_NET_F_CTRL_RX_EXTRA, "CtrlRXExtra" }, > > - { VIRTIO_NET_F_GUEST_ANNOUNCE, "GuestAnnounce" }, > > - { VIRTIO_NET_F_MQ, "MQ" }, > > - { VIRTIO_NET_F_CTRL_MAC_ADDR, "CtrlMAC" }, > > - { VIRTIO_NET_F_HOST_USO, "HostUso" }, > > - { VIRTIO_NET_F_HASH_REPORT, "HashRpt" }, > > - { VIRTIO_NET_F_GUEST_HDRLEN, "GuestHdrlen" }, > > - { VIRTIO_NET_F_RSS, "RSS" }, > > - { VIRTIO_NET_F_RSC_EXT, "RSSExt" }, > > - { VIRTIO_NET_F_STANDBY, "Stdby" }, > > - { VIRTIO_NET_F_SPEED_DUPLEX, "SpdDplx" }, > > -#endif > > - { 0, NULL } > > -}; ... to here should stay in if_vio.c . > > - > > -/* Status */ > > -#define VIRTIO_NET_S_LINK_UP 1 > > - > > -/* Packet header structure */ > > -struct virtio_net_hdr { > > - uint8_t flags; > > - uint8_t gso_type; > > - uint16_t hdr_len; > > - uint16_t gso_size; > > - uint16_t csum_start; > > - uint16_t csum_offset; > > - > > - /* only present if VIRTIO_NET_F_MRG_RXBUF is negotiated */ > > - uint16_t num_buffers; > > -} __packed; > > - > > -#define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 /* flags */ > > -#define VIRTIO_NET_HDR_F_DATA_VALID 2 /* flags */ > > -#define VIRTIO_NET_HDR_GSO_NONE 0 /* gso_type */ > > -#define VIRTIO_NET_HDR_GSO_TCPV4 1 /* gso_type */ > > -#define VIRTIO_NET_HDR_GSO_UDP 3 /* gso_type */ > > -#define VIRTIO_NET_HDR_GSO_TCPV6 4 /* gso_type */ > > -#define VIRTIO_NET_HDR_GSO_ECN 0x80 /* gso_type, |'ed */ > > - > > -#define VIRTIO_NET_MAX_GSO_LEN (65536+ETHER_HDR_LEN) > > - > > -/* Control virtqueue */ > > -struct virtio_net_ctrl_cmd { > > - uint8_t class; > > - uint8_t command; > > -} __packed; > > -#define VIRTIO_NET_CTRL_RX 0 > > -# define VIRTIO_NET_CTRL_RX_PROMISC 0 > > -# define VIRTIO_NET_CTRL_RX_ALLMULTI 1 > > - > > -#define VIRTIO_NET_CTRL_MAC 1 > > -# define VIRTIO_NET_CTRL_MAC_TABLE_SET 0 > > - > > -#define VIRTIO_NET_CTRL_VLAN 2 > > -# define VIRTIO_NET_CTRL_VLAN_ADD 0 > > -# define VIRTIO_NET_CTRL_VLAN_DEL 1 > > - > > -#define VIRTIO_NET_CTRL_GUEST_OFFLOADS 5 > > -# define VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET 0 > > - > > -struct virtio_net_ctrl_status { > > - uint8_t ack; > > -} __packed; > > -#define VIRTIO_NET_OK 0 > > -#define VIRTIO_NET_ERR 1 > > - > > -struct virtio_net_ctrl_rx { > > - uint8_t onoff; > > -} __packed; > > - > > -struct virtio_net_ctrl_guest_offloads { > > - uint64_t offloads; > > -} __packed; > > - > > -struct virtio_net_ctrl_mac_tbl { > > - uint32_t nentries; > > - uint8_t macs[][ETHER_ADDR_LEN]; > > -} __packed; > > - > > -struct virtio_net_ctrl_vlan { > > - uint16_t id; > > -} __packed; > > +#include "if_vioreg.h" > > > > /* > > * if_viovar.h: > > Index: sys/dev/pv/if_vioreg.h > > =================================================================== > > RCS file: sys/dev/pv/if_vioreg.h > > diff -N sys/dev/pv/if_vioreg.h > > --- /dev/null 1 Jan 1970 00:00:00 -0000 > > +++ sys/dev/pv/if_vioreg.h 16 Aug 2024 08:47:15 -0000 > > @@ -0,0 +1,182 @@ > > +/* $OpenBSD$ */ > > + > > +/* > > + * Copyright (c) 2012 Stefan Fritsch, Alexander Fiveg. > > + * Copyright (c) 2010 Minoura Makoto. > > + * All rights reserved. > > + * > > + * Redistribution and use in source and binary forms, with or without > > + * modification, are permitted provided that the following conditions > > + * are met: > > + * 1. Redistributions of source code must retain the above copyright > > + * notice, this list of conditions and the following disclaimer. > > + * 2. Redistributions in binary form must reproduce the above copyright > > + * notice, this list of conditions and the following disclaimer in the > > + * documentation and/or other materials provided with the distribution. > > + * > > + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR > > + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES > > + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. > > + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, > > + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT > > + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, > > + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY > > + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT > > + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF > > + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. > > + */ > > + > > +/* Configuration registers */ > > +#define VIRTIO_NET_CONFIG_MAC 0 /* 8bit x 6byte */ > > +#define VIRTIO_NET_CONFIG_STATUS 6 /* 16bit */ > > + > > +/* Feature bits */ > > +#define VIRTIO_NET_F_CSUM (1ULL<<0) > > +#define VIRTIO_NET_F_GUEST_CSUM (1ULL<<1) > > +#define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS (1ULL<<2) > > +#define VIRTIO_NET_F_MTU (1ULL<<3) > > +#define VIRTIO_NET_F_MAC (1ULL<<5) > > +#define VIRTIO_NET_F_GSO (1ULL<<6) > > +#define VIRTIO_NET_F_GUEST_TSO4 (1ULL<<7) > > +#define VIRTIO_NET_F_GUEST_TSO6 (1ULL<<8) > > +#define VIRTIO_NET_F_GUEST_ECN (1ULL<<9) > > +#define VIRTIO_NET_F_GUEST_UFO (1ULL<<10) > > +#define VIRTIO_NET_F_HOST_TSO4 (1ULL<<11) > > +#define VIRTIO_NET_F_HOST_TSO6 (1ULL<<12) > > +#define VIRTIO_NET_F_HOST_ECN (1ULL<<13) > > +#define VIRTIO_NET_F_HOST_UFO (1ULL<<14) > > +#define VIRTIO_NET_F_MRG_RXBUF (1ULL<<15) > > +#define VIRTIO_NET_F_STATUS (1ULL<<16) > > +#define VIRTIO_NET_F_CTRL_VQ (1ULL<<17) > > +#define VIRTIO_NET_F_CTRL_RX (1ULL<<18) > > +#define VIRTIO_NET_F_CTRL_VLAN (1ULL<<19) > > +#define VIRTIO_NET_F_CTRL_RX_EXTRA (1ULL<<20) > > +#define VIRTIO_NET_F_GUEST_ANNOUNCE (1ULL<<21) > > +#define VIRTIO_NET_F_MQ (1ULL<<22) > > +#define VIRTIO_NET_F_CTRL_MAC_ADDR (1ULL<<23) > > +#define VIRTIO_NET_F_HOST_USO (1ULL<<56) > > +#define VIRTIO_NET_F_HASH_REPORT (1ULL<<57) > > +#define VIRTIO_NET_F_GUEST_HDRLEN (1ULL<<59) > > +#define VIRTIO_NET_F_RSS (1ULL<<60) > > +#define VIRTIO_NET_F_RSC_EXT (1ULL<<61) > > +#define VIRTIO_NET_F_STANDBY (1ULL<<62) > > +#define VIRTIO_NET_F_SPEED_DUPLEX (1ULL<<63) > > +/* > > + * Config(8) flags. The lowest byte is reserved for generic virtio stuff. > > + */ > > + > > +/* Workaround for vlan related bug in qemu < version 2.0 */ > > +#define CONFFLAG_QEMU_VLAN_BUG (1<<8) > > + > > +static const struct virtio_feature_name virtio_net_feature_names[] = { > > +#if VIRTIO_DEBUG > > + { VIRTIO_NET_F_CSUM, "CSum" }, > > + { VIRTIO_NET_F_GUEST_CSUM, "GuestCSum" }, > > + { VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, "CtrlGuestOffl" }, > > + { VIRTIO_NET_F_MTU, "MTU", }, > > + { VIRTIO_NET_F_MAC, "MAC" }, > > + { VIRTIO_NET_F_GSO, "GSO" }, > > + { VIRTIO_NET_F_GUEST_TSO4, "GuestTSO4" }, > > + { VIRTIO_NET_F_GUEST_TSO6, "GuestTSO6" }, > > + { VIRTIO_NET_F_GUEST_ECN, "GuestECN" }, > > + { VIRTIO_NET_F_GUEST_UFO, "GuestUFO" }, > > + { VIRTIO_NET_F_HOST_TSO4, "HostTSO4" }, > > + { VIRTIO_NET_F_HOST_TSO6, "HostTSO6" }, > > + { VIRTIO_NET_F_HOST_ECN, "HostECN" }, > > + { VIRTIO_NET_F_HOST_UFO, "HostUFO" }, > > + { VIRTIO_NET_F_MRG_RXBUF, "MrgRXBuf" }, > > + { VIRTIO_NET_F_STATUS, "Status" }, > > + { VIRTIO_NET_F_CTRL_VQ, "CtrlVQ" }, > > + { VIRTIO_NET_F_CTRL_RX, "CtrlRX" }, > > + { VIRTIO_NET_F_CTRL_VLAN, "CtrlVLAN" }, > > + { VIRTIO_NET_F_CTRL_RX_EXTRA, "CtrlRXExtra" }, > > + { VIRTIO_NET_F_GUEST_ANNOUNCE, "GuestAnnounce" }, > > + { VIRTIO_NET_F_MQ, "MQ" }, > > + { VIRTIO_NET_F_CTRL_MAC_ADDR, "CtrlMAC" }, > > + { VIRTIO_NET_F_HOST_USO, "HostUso" }, > > + { VIRTIO_NET_F_HASH_REPORT, "HashRpt" }, > > + { VIRTIO_NET_F_GUEST_HDRLEN, "GuestHdrlen" }, > > + { VIRTIO_NET_F_RSS, "RSS" }, > > + { VIRTIO_NET_F_RSC_EXT, "RSSExt" }, > > + { VIRTIO_NET_F_STANDBY, "Stdby" }, > > + { VIRTIO_NET_F_SPEED_DUPLEX, "SpdDplx" }, > > +#endif > > + { 0, NULL } > > +}; > > + > > +/* Status */ > > +#define VIRTIO_NET_S_LINK_UP 1 > > + > > +/* Packet header structure */ > > +struct virtio_net_hdr { > > + uint8_t flags; > > + uint8_t gso_type; > > + uint16_t hdr_len; > > + uint16_t gso_size; > > + uint16_t csum_start; > > + uint16_t csum_offset; > > + > > + /* only present if VIRTIO_NET_F_MRG_RXBUF is negotiated */ > > + uint16_t num_buffers; > > +} __packed; > > + > > +struct virtio_net_hdr_legacy { > > + uint8_t flags; > > + uint8_t gso_type; > > + uint16_t hdr_len; > > + uint16_t gso_size; > > + uint16_t csum_start; > > + uint16_t csum_offset; > > +} __packed; > > + > > +#define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 /* flags */ > > +#define VIRTIO_NET_HDR_F_DATA_VALID 2 /* flags */ > > +#define VIRTIO_NET_HDR_GSO_NONE 0 /* gso_type */ > > +#define VIRTIO_NET_HDR_GSO_TCPV4 1 /* gso_type */ > > +#define VIRTIO_NET_HDR_GSO_UDP 3 /* gso_type */ > > +#define VIRTIO_NET_HDR_GSO_TCPV6 4 /* gso_type */ > > +#define VIRTIO_NET_HDR_GSO_ECN 0x80 /* gso_type, |'ed */ > > + > > +#define VIRTIO_NET_MAX_GSO_LEN (65536+ETHER_HDR_LEN) > > + > > +/* Control virtqueue */ > > +struct virtio_net_ctrl_cmd { > > + uint8_t class; > > + uint8_t command; > > +} __packed; > > +#define VIRTIO_NET_CTRL_RX 0 > > +# define VIRTIO_NET_CTRL_RX_PROMISC 0 > > +# define VIRTIO_NET_CTRL_RX_ALLMULTI 1 > > + > > +#define VIRTIO_NET_CTRL_MAC 1 > > +# define VIRTIO_NET_CTRL_MAC_TABLE_SET 0 > > + > > +#define VIRTIO_NET_CTRL_VLAN 2 > > +# define VIRTIO_NET_CTRL_VLAN_ADD 0 > > +# define VIRTIO_NET_CTRL_VLAN_DEL 1 > > + > > +#define VIRTIO_NET_CTRL_GUEST_OFFLOADS 5 > > +# define VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET 0 > > + > > +struct virtio_net_ctrl_status { > > + uint8_t ack; > > +} __packed; > > +#define VIRTIO_NET_OK 0 > > +#define VIRTIO_NET_ERR 1 > > + > > +struct virtio_net_ctrl_rx { > > + uint8_t onoff; > > +} __packed; > > + > > +struct virtio_net_ctrl_guest_offloads { > > + uint64_t offloads; > > +} __packed; > > + > > +struct virtio_net_ctrl_mac_tbl { > > + uint32_t nentries; > > + uint8_t macs[][ETHER_ADDR_LEN]; > > +} __packed; > > + > > +struct virtio_net_ctrl_vlan { > > + uint16_t id; > > +} __packed; > > Index: sys/net/if_tun.c > > =================================================================== > > RCS file: /cvs/src/sys/net/if_tun.c,v > > diff -u -p -r1.240 if_tun.c > > --- sys/net/if_tun.c 23 Dec 2023 10:52:54 -0000 1.240 > > +++ sys/net/if_tun.c 16 Aug 2024 12:00:46 -0000 > > @@ -63,8 +63,15 @@ > > #include > > > > #include > > +#include > > +#include > > +#include > > +#include > > #include > > > > +#include > > +#include > > + > > #include "bpfilter.h" > > #if NBPFILTER > 0 > > #include > > @@ -92,6 +99,7 @@ struct tun_softc { > > dev_t sc_dev; > > struct refcnt sc_refs; > > unsigned int sc_reading; > > + int sc_vhdrlen; > > }; > > > > #ifdef TUN_DEBUG > > @@ -779,6 +787,17 @@ tun_dev_ioctl(dev_t dev, u_long cmd, voi > > bcopy(data, sc->sc_ac.ac_enaddr, > > sizeof(sc->sc_ac.ac_enaddr)); > > break; > > + case TAPSVNETHDR: > > + if (*(int *)data != 0 && > > + *(int *)data != sizeof(struct virtio_net_hdr) && > > + *(int *)data != sizeof(struct virtio_net_hdr_legacy)) { > > + error = EINVAL; > > + break; > > + } > > + > > + sc->sc_vhdrlen = *(int *)data; > > + > > + break; > > default: > > error = ENOTTY; > > break; > > @@ -828,6 +847,17 @@ tun_dev_read(dev_t dev, struct uio *uio, > > bpf_mtap(ifp->if_bpf, m0, BPF_DIRECTION_OUT); > > #endif > > > > + size_t len = ulmin(uio->uio_resid, sc->sc_vhdrlen); > > + if (len > 0) { > > + struct virtio_net_hdr vhdr; > > + > > + bzero(&vhdr, sizeof(vhdr)); > > + > > + error = uiomove(&vhdr, len, uio); > > + if (error != 0) > > + goto free; > > + } > > + > > m = m0; > > while (uio->uio_resid > 0) { > > size_t len = ulmin(uio->uio_resid, m->m_len); > > @@ -841,7 +871,7 @@ tun_dev_read(dev_t dev, struct uio *uio, > > if (m == NULL) > > break; > > } > > - > > +free: > > m_freem(m0); > > > > put: > > @@ -870,6 +900,7 @@ tun_dev_write(dev_t dev, struct uio *uio > > struct tun_softc *sc; > > struct ifnet *ifp; > > struct mbuf *m0; > > + struct virtio_net_hdr vhdr; > > int error = 0; > > size_t mlen; > > > > @@ -905,9 +936,31 @@ tun_dev_write(dev_t dev, struct uio *uio > > m0->m_pkthdr.len = m0->m_len = mlen; > > m_adj(m0, align); > > > > + if (sc->sc_vhdrlen > 0) { > > + error = uiomove(&vhdr, sc->sc_vhdrlen, uio); > > + if (error != 0) > > + goto drop; > > + } > > + > > error = uiomove(mtod(m0, void *), m0->m_len, uio); > > if (error != 0) > > goto drop; > > + > > + if (sc->sc_vhdrlen > 0) { > > + if (ISSET(vhdr.flags, VIRTIO_NET_HDR_F_NEEDS_CSUM)) { > > + struct ether_extracted ext; > > + > > + ether_extract_headers(m0, &ext); > > + > > + if (ext.tcp) { > > + SET(m0->m_pkthdr.csum_flags, M_TCP_CSUM_IN_OK); > > + SET(m0->m_pkthdr.csum_flags, M_TCP_CSUM_OUT); > > + } else if (ext.udp) { > > + SET(m0->m_pkthdr.csum_flags, M_UDP_CSUM_IN_OK); > > + SET(m0->m_pkthdr.csum_flags, M_UDP_CSUM_OUT); > > + } > > + } > > + } > > > > NET_LOCK(); > > if_vinput(ifp, m0); > > Index: sys/net/if_tun.h > > =================================================================== > > RCS file: /cvs/src/sys/net/if_tun.h,v > > diff -u -p -r1.15 if_tun.h > > --- sys/net/if_tun.h 6 Feb 2007 10:49:40 -0000 1.15 > > +++ sys/net/if_tun.h 16 Aug 2024 12:00:43 -0000 > > @@ -68,4 +68,7 @@ struct tuninfo { > > #define TUNSDEBUG _IOW('t', 94, int) > > #define TUNGDEBUG _IOR('t', 95, int) > > > > +/* ioctl to set the virtio-net header size */ > > +#define TAPSVNETHDR _IOW('t', 96, int) > > + > > #endif /* _NET_IF_TUN_H_ */ > > Index: usr.sbin/vmd/vionet.c > > =================================================================== > > RCS file: /cvs/src/usr.sbin/vmd/vionet.c,v > > diff -u -p -r1.16 vionet.c > > --- usr.sbin/vmd/vionet.c 12 Jul 2024 14:34:08 -0000 1.16 > > +++ usr.sbin/vmd/vionet.c 16 Aug 2024 12:00:43 -0000 > > @@ -18,6 +18,7 @@ > > */ > > #include > > #include > > +#include > > > > #include > > #include > > @@ -46,11 +47,6 @@ > > extern char *__progname; > > extern struct vmd_vm *current_vm; > > > > -struct packet { > > - uint8_t *buf; > > - size_t len; > > -}; > > - > > static void *rx_run_loop(void *); > > static void *tx_run_loop(void *); > > static int vionet_rx(struct vionet_dev *, int); > > @@ -102,13 +98,7 @@ vionet_main(int fd, int fd_vmm) > > struct vm_create_params *vcp; > > ssize_t sz; > > int ret; > > - > > - /* > > - * stdio - needed for read/write to disk fds and channels to the vm. > > - * vmm + proc - needed to create shared vm mappings. > > - */ > > - if (pledge("stdio vmm proc", NULL) == -1) > > - fatal("pledge"); > > + int vhdrlen = sizeof(struct virtio_net_hdr); > > > > /* Initialize iovec arrays. */ > > memset(iov_rx, 0, sizeof(iov_rx)); > > @@ -133,6 +123,17 @@ vionet_main(int fd, int fd_vmm) > > ", vmm fd = %d", __func__, vionet->data_fd, dev.sync_fd, > > dev.async_fd, fd_vmm); > > > > + /* Activate virtio_net_hdr for tap(4) device. */ > > + if (ioctl(vionet->data_fd, TAPSVNETHDR, &vhdrlen) == -1) > > + fatal("%s: TAPSVNETHDR option on tap(4) failed", __func__); > > + > > + /* > > + * stdio - needed for read/write to disk fds and channels to the vm. > > + * vmm + proc - needed to create shared vm mappings. > > + */ > > + if (pledge("stdio vmm proc", NULL) == -1) > > + fatal("pledge"); > > + > > /* Receive our vm information from the vm process. */ > > memset(&vm, 0, sizeof(vm)); > > sz = atomicio(read, dev.sync_fd, &vm, sizeof(vm)); > > @@ -413,6 +414,7 @@ vionet_rx(struct vionet_dev *dev, int fd > > > > iov = &iov_rx[0]; > > iov_cnt = 1; > > + chain_len = 0; > > > > /* > > * First descriptor should be at least as large as the > > @@ -425,22 +427,9 @@ vionet_rx(struct vionet_dev *dev, int fd > > goto reset; > > } > > > > - /* > > - * Insert the virtio_net_hdr and adjust len/base. We do the > > - * pointer math here before it's a void*. > > - */ > > iov->iov_base = hvaddr_mem(desc->addr, iov->iov_len); > > if (iov->iov_base == NULL) > > goto reset; > > - memset(iov->iov_base, 0, sizeof(struct virtio_net_hdr)); > > - > > - /* Tweak the iovec to account for the virtio_net_hdr. */ > > - iov->iov_len -= sizeof(struct virtio_net_hdr); > > - iov->iov_base = hvaddr_mem(desc->addr + > > - sizeof(struct virtio_net_hdr), iov->iov_len); > > - if (iov->iov_base == NULL) > > - goto reset; > > - chain_len = iov->iov_len; > > > > /* > > * Walk the remaining chain and collect remaining addresses > > @@ -491,12 +480,6 @@ vionet_rx(struct vionet_dev *dev, int fd > > if (sz == 0) /* No packets, so bail out for now. */ > > break; > > > > - /* > > - * Account for the prefixed header since it wasn't included > > - * in the copy or zerocopy operations. > > - */ > > - sz += sizeof(struct virtio_net_hdr); > > - > > /* Mark our buffers as used. */ > > used->ring[used->idx & VIONET_QUEUE_MASK].id = hdr_idx; > > used->ring[used->idx & VIONET_QUEUE_MASK].len = sz; > > @@ -533,22 +516,19 @@ vionet_rx_copy(struct vionet_dev *dev, i > > int iov_cnt, size_t chain_len) > > { > > static uint8_t buf[VIONET_HARD_MTU]; > > - struct packet *pkt = NULL; > > struct ether_header *eh = NULL; > > uint8_t *payload = buf; > > size_t i, chunk, nbytes, copied = 0; > > ssize_t sz; > > > > - /* If reading from the tap(4), try to right-size the read. */ > > - if (fd == dev->data_fd) > > - nbytes = MIN(chain_len, VIONET_HARD_MTU); > > - else if (fd == pipe_inject[READ]) > > - nbytes = sizeof(struct packet); > > - else { > > + if (fd != dev->data_fd && fd != pipe_inject[READ]) { > > log_warnx("%s: invalid fd: %d", __func__, fd); > > return (-1); > > } > > > > + /* If reading from the tap(4), try to right-size the read. */ > > + nbytes = MIN(chain_len, VIONET_HARD_MTU); > > + > > /* > > * Try to pull a packet. The fd should be non-blocking and we don't > > * care if we under-read (i.e. sz != nbytes) as we may not have a > > @@ -560,36 +540,18 @@ vionet_rx_copy(struct vionet_dev *dev, i > > log_warn("%s: error reading packet", __func__); > > return (-1); > > } > > + log_warn("%s: EAGAIN", __func__); > > return (0); > > - } else if (fd == dev->data_fd && sz < VIONET_MIN_TXLEN) { > > + } else if (fd == dev->data_fd && (size_t)sz < VIONET_MIN_TXLEN) { > > /* If reading the tap(4), we should get valid ethernet. */ > > log_warnx("%s: invalid packet size", __func__); > > return (0); > > - } else if (fd == pipe_inject[READ] && sz != sizeof(struct packet)) { > > - log_warnx("%s: invalid injected packet object (sz=%ld)", > > - __func__, sz); > > - return (0); > > - } > > - > > - /* Decompose an injected packet, if that's what we're working with. */ > > - if (fd == pipe_inject[READ]) { > > - pkt = (struct packet *)buf; > > - if (pkt->buf == NULL) { > > - log_warnx("%s: invalid injected packet, no buffer", > > - __func__); > > - return (0); > > - } > > - if (sz < VIONET_MIN_TXLEN || sz > VIONET_MAX_TXLEN) { > > - log_warnx("%s: invalid injected packet size", __func__); > > - goto drop; > > - } > > - payload = pkt->buf; > > - sz = (ssize_t)pkt->len; > > } > > > > /* Validate the ethernet header, if required. */ > > if (dev->lockedmac) { > > - eh = (struct ether_header *)(payload); > > + eh = (struct ether_header *)(payload > > + + sizeof(struct virtio_net_hdr)); > > if (!ETHER_IS_MULTICAST(eh->ether_dhost) && > > memcmp(eh->ether_dhost, dev->mac, > > sizeof(eh->ether_dhost)) != 0) > > @@ -610,10 +572,6 @@ vionet_rx_copy(struct vionet_dev *dev, i > > } > > > > drop: > > - /* Free any injected packet buffer. */ > > - if (pkt != NULL) > > - free(pkt->buf); > > - > > return (copied); > > } > > > > @@ -724,7 +682,6 @@ vionet_tx(struct virtio_dev *dev) > > struct virtio_vq_info *vq_info; > > struct ether_header *eh; > > struct iovec *iov; > > - struct packet pkt; > > uint8_t status = 0; > > > > status = vionet->cfg.device_status > > @@ -766,17 +723,15 @@ vionet_tx(struct virtio_dev *dev) > > if (iov->iov_len < sizeof(struct virtio_net_hdr)) { > > log_warnx("%s: invalid descriptor length", __func__); > > goto reset; > > - } else if (iov->iov_len > sizeof(struct virtio_net_hdr)) { > > - /* Chop off the virtio header, leaving packet data. */ > > - iov->iov_len -= sizeof(struct virtio_net_hdr); > > - chain_len += iov->iov_len; > > - iov->iov_base = hvaddr_mem(desc->addr + > > - sizeof(struct virtio_net_hdr), iov->iov_len); > > - if (iov->iov_base == NULL) > > - goto reset; > > - iov_cnt++; > > } > > > > + /* Handel the first descriptor. */ > > + chain_len += iov->iov_len; > > + iov->iov_base = hvaddr_mem(desc->addr, iov->iov_len); > > + if (iov->iov_base == NULL) > > + goto reset; > > + iov_cnt++; > > + > > /* > > * Walk the chain and collect remaining addresses and lengths. > > */ > > @@ -818,6 +773,8 @@ vionet_tx(struct virtio_dev *dev) > > * for this inspection. > > */ > > iov = &iov_tx[0]; > > + if (iov->iov_len == sizeof(struct virtio_net_hdr)) > > + iov = &iov_tx[1]; > > if (vionet->lockedmac) { > > if (iov->iov_len < ETHER_HDR_LEN) { > > log_warnx("%s: insufficient header data", > > @@ -839,6 +796,7 @@ vionet_tx(struct virtio_dev *dev) > > if (dhcpsz > 0) { > > log_debug("%s: detected dhcp request of %zu bytes", > > __func__, dhcpsz); > > + sz = chain_len; > > goto drop; > > } > > } > > @@ -859,20 +817,23 @@ drop: > > > > /* Facilitate DHCP reply injection, if needed. */ > > if (dhcpsz > 0) { > > - pkt.buf = dhcppkt; > > - pkt.len = dhcpsz; > > - sz = write(pipe_inject[WRITE], &pkt, sizeof(pkt)); > > - if (sz == -1 && errno != EAGAIN) { > > + struct virtio_net_hdr vhdr = { 0 }; > > + struct iovec iovec[2] = { > > + { .iov_base = &vhdr, .iov_len = sizeof(vhdr) }, > > + { .iov_base = dhcppkt, .iov_len = dhcpsz } > > + }; > > + > > + sz = writev(pipe_inject[WRITE], iovec, 2); > > + > > + if (sz == -1 && errno != EAGAIN) > > log_warn("%s: packet injection", __func__); > > - free(pkt.buf); > > - } else if (sz == -1 && errno == EAGAIN) { > > + else if (sz == -1 && errno == EAGAIN) > > log_debug("%s: dropping dhcp reply", __func__); > > - free(pkt.buf); > > - } else if (sz != sizeof(pkt)) { > > + else if (sz != (ssize_t)(sizeof(vhdr) + dhcpsz)) > > log_warnx("%s: failed packet injection", > > __func__); > > - free(pkt.buf); > > - } > > + > > + free(dhcppkt); > > log_debug("%s: injected dhcp reply with %ld bytes", > > __func__, sz); > > } > > Index: usr.sbin/vmd/virtio.c > > =================================================================== > > RCS file: /cvs/src/usr.sbin/vmd/virtio.c,v > > diff -u -p -r1.115 virtio.c > > --- usr.sbin/vmd/virtio.c 10 Jul 2024 09:27:33 -0000 1.115 > > +++ usr.sbin/vmd/virtio.c 16 Aug 2024 12:00:46 -0000 > > @@ -60,6 +60,7 @@ SLIST_HEAD(virtio_dev_head, virtio_dev) > > > > #define MAXPHYS (64 * 1024) /* max raw I/O transfer size */ > > > > +#define VIRTIO_NET_F_CSUM (1<<0) > > #define VIRTIO_NET_F_MAC (1<<5) > > > > #define VMMCI_F_TIMESYNC (1<<0) > > @@ -584,6 +585,9 @@ virtio_init(struct vmd_vm *vm, int child > > /* MAC address has been assigned by the parent */ > > memcpy(&dev->vionet.mac, &vmc->vmc_macs[i], 6); > > dev->vionet.cfg.device_feature = VIRTIO_NET_F_MAC; > > + > > + /* enable checksum offloading */ > > + dev->vionet.cfg.device_feature |= VIRTIO_NET_F_CSUM; > > > > dev->vionet.lockedmac = > > vmc->vmc_ifflags[i] & VMIFF_LOCKED ? 1 : 0; > > Index: usr.sbin/vmd/virtio.h > > =================================================================== > > RCS file: /cvs/src/usr.sbin/vmd/virtio.h,v > > diff -u -p -r1.52 virtio.h > > --- usr.sbin/vmd/virtio.h 10 Jul 2024 09:27:33 -0000 1.52 > > +++ usr.sbin/vmd/virtio.h 16 Aug 2024 12:00:43 -0000 > > @@ -49,7 +49,7 @@ > > > > /* Virtio network device is backed by tap(4), so inherit limits */ > > #define VIONET_HARD_MTU TUNMRU > > -#define VIONET_MIN_TXLEN ETHER_HDR_LEN > > +#define VIONET_MIN_TXLEN (sizeof(struct virtio_net_hdr) + ETHER_HDR_LEN) > > #define VIONET_MAX_TXLEN VIONET_HARD_MTU + ETHER_HDR_LEN > > > > /* VMM Control Interface shutdown timeout (in seconds) */ > >