From: Alexander Bluhm <bluhm@openbsd.org>
Subject: Re: vmx TSO pullup headers into first mbuf
To: David Gwynne <david@gwynne.id.au>
Cc: tech@openbsd.org
Date: Thu, 13 Jun 2024 11:44:22 +0200

On Thu, Jun 13, 2024 at 04:42:05PM +1000, David Gwynne wrote:
> ipv4 forwarding used to do this, but it was changed to copy data into a 
> fake mbuf on the stack. this avoided an mbuf allocation "just in case" 
> for every forwarded packet, which was a noticeable speed improvement. 
> sounds like the idea could be useful here too.
> 
> the changes were in src/sys/netinet/ip_input.c r1.167 to 1.169.

The problem is, that IPv6 RFC requires to send 1280 bytes ICMP6
reply.  That is too large for our stack.  And protocol needs large
ICMP6 reply due to unrestricted header chains.

Complexity kills performance in this case.  A 1200 bytes per CPU
cache for the ICMP6 copy comes to mind.  Is copying 1200 bytes
faster than two mbuf allocations with shared cluster?

The shared cluster has another problem.  When pf does NAT or RDR
it also modifies the ICMP6 copy.  pf ignores M_READONLY.  Then the
ICMP6 packet does not contain the original packet and is dropped
by the remote site.

> vmx still needs this code if a weird packet ends up being sent for other 
> reasons though. if the problem is caused by v6 packets, shouldnt the 
> hdrlen consider sizeof(struct ip6_hdr) instead of sizeof(struct ip)?

The clean solution would be to call ether_extract_headers() early.
Note that vmx_load_mbuf() might modify the mbuf, so we would have
to call it twice.  Instead I implemented this heuristic.

In TCP output you get an mbuf with the header and next mbuf cluster
with a copy of the send buffer.  In first mbuf you have
Ether+IP+TCP-Header.  I implemented minimal IPv4 length check to
avoid needless memmove in this common case.

IPv6 packet first mbuf could be too short and slip through this
check.  I guess it does not happen in practice.  If you want correct
values you need IP version and TCP option length.  I want to avoid
additional packet parsing in all cases for a case that should not
happen.  The worst thing that could happen with my current
implementation is that vmx(4) virtual hardware rejects to send and
dropped tx counter increases.

bluhm

> On 13/06/2024 08:11, Alexander Bluhm wrote:
> > Hi,
> >
> > In vmx(4) TSO must pullup headers into first mbuf.
> >
> > Forwarding IPv6 packets from vmx with LRO to vmx with TSO does not
> > work.  vmx(4) has the requirement that all headers are in the first
> > mbuf.  ip6_forward() is quite dumb.  It calls m_copym() to create
> > a mbuf that might be used for sending ICMP6 later.  After passing
> > the forwarded packet down to ether_encap(), m_prepend() is used to
> > restore the ethernet header.  As the mbuf cluster has been copied,
> > it is read only now.  That means m_prepend() does not provide the
> > empty space at the beginning of the cluster, but allocates a new
> > mbuf that contains only the ethernet header.  vmx(4) cannot transmit
> > such a TSO packet and drops it.
> >
> > Solution is to call m_pullup() in vmxnet3_start().  If we ended up
> > in such a miserable condition, use the first mbuf in the chain and
> > move all headers into it.
> >
> > ok?
> >
> > bluhm
> >
> > Index: dev/pci/if_vmx.c
> > ===================================================================
> > RCS file: /data/mirror/openbsd/cvs/src/sys/dev/pci/if_vmx.c,v
> > diff -u -p -r1.87 if_vmx.c
> > --- dev/pci/if_vmx.c	7 Jun 2024 08:44:25 -0000	1.87
> > +++ dev/pci/if_vmx.c	12 Jun 2024 14:34:11 -0000
> > @@ -1619,6 +1619,8 @@ vmxnet3_start(struct ifqueue *ifq)
> >   	rgen = ring->gen;
> >   
> >   	for (;;) {
> > +		int hdrlen;
> > +
> >   		if (free <= NTXSEGS) {
> >   			ifq_set_oactive(ifq);
> >   			break;
> > @@ -1627,6 +1629,29 @@ vmxnet3_start(struct ifqueue *ifq)
> >   		m = ifq_dequeue(ifq);
> >   		if (m == NULL)
> >   			break;
> > +
> > +		/*
> > +		 * Headers for Ether, IP, TCP including options must lay in
> > +		 * first mbuf to support TSO.  Usually our stack gets that
> > +		 * right. To avoid packet parsing here, make a rough estimate
> > +		 * for simple IPv4.  Cases seen in the wild contain only ether
> > +		 * header in separate mbuf.  To support IPv6 with TCP options,
> > +		 * move as much as possible into first mbuf.  Realloc mbuf
> > +		 * before bus dma load.
> > +		 */
> > +		hdrlen = sizeof(struct ether_header) + sizeof(struct ip) +
> > +		    sizeof(struct tcphdr);
> > +		if (ISSET(m->m_pkthdr.csum_flags, M_TCP_TSO) &&
> > +		    m->m_len < hdrlen && hdrlen <= m->m_pkthdr.len) {
> > +			hdrlen = MHLEN;
> > +			hdrlen -= mtod(m, unsigned long) & (sizeof(long) - 1);
> > +			if (hdrlen > m->m_pkthdr.len)
> > +				hdrlen = m->m_pkthdr.len;
> > +			if ((m = m_pullup(m, hdrlen)) == NULL) {
> > +				ifq->ifq_errors++;
> > +				continue;
> > +			}
> > +		}
> >   
> >   		map = ring->dmap[prod];
> >   
> >