Index | Thread | Search

From:
Alexander Bluhm <bluhm@openbsd.org>
Subject:
Re: parallel TCP input
To:
Hrvoje Popovski <hrvoje@srce.hr>
Cc:
tech@openbsd.org
Date:
Fri, 7 Feb 2025 23:47:53 +0100

Download raw body.

Thread
On Fri, Feb 07, 2025 at 10:31:01PM +0100, Hrvoje Popovski wrote:
> On 5.2.2025. 19:25, Alexander Bluhm wrote:
> > Hi,
> > 
> > To run TCP input in parallel, each packet needs the socket lock.
> > As lock and unlock per packet kills throughput for single stream
> > TCP, diff below keeps the socket locked when consecutive packets
> > belong to the same socket.
> > 
> > I keep a list of TCP packets in struct softnet.  So each softnet
> > thread has its own list to accumulate packets.  After processing
> > all packets from an interface input queue, TCP input is called for
> > each TCP packet.  The socket lock is only unlocked and locked if
> > the TCP socket changes between the packets.  This solves the
> > performance issue.
> > 
> > To remember the softnet thread and packet offset, I use the cookie
> > field in the mbuf header.  Hopefully nothing else uses cookies in
> > this path.
> > 
> > Please test with various interfaces and pseudo devices.
> > Are there setups where my idea does not work?
> > How much is TCP stack getting faster?
> > Are there performance regressions?
> > 
> > Note that at most 4 CPUs are used for network.  You might want to
> > increase NET_TASKQ.  Also you need network interfaces that support
> > multiqueue.  ix, ixl, igc, bnxt, vio, vmx, ...
> > 
> > bluhm
> 
> Hi,
> 
> I've tested ix, bnxt and mcx on same machine and it seems that mcx have
> performance problems with 1 tcp stream with or without your diff. With
> this diff mcx get nice performance boost with more than 1 tcp stream
> which is not the case without this diff...

The reason why ix is much better, is that it supports large receive
offload.  This effect is seen best with single stream.

I guess the mcx driver is less than optimal.

> NET_TASKQ=8
> 
> 1 tcp stream
> ix - 9.38 Gbits/sec
> mcx - 4.21 Gbits/sec
> bnxt - 6.95 Gbits/sec
> 
> 8 tcp streams
> ix - 9.41 Gbits/sec
> mcx - 5.15 Gbits/sec
> bnxt - 6.68 Gbits/sec
> 
> 
> 
> 
> with diff
> 1 tcp stream
> ix - 9.38 Gbits/sec
> mcx - 4.20 Gbits/sec
> bnxt - 7.88 Gbits/sec
> 
> 8 tcp streams
> ix - 9.36 Gbits/sec
> mcx - 9.21 Gbits/sec
> bnxt - 9.34 Gbits/sec
> 
> 
> spreading interrupts over queues is done by mixing source and
> destination ip, not src or dst tcp/udp ports? is that right? if so, to
> test this properly I need to fire up some virtual hosts ?

The Intel cards that I am using, seem to spread over IP when using
UDP.  But with TCP they also spread over src ports, so I can use a
single tcpbench -s with tcpbench -n 10 to see an effect.  Running
systat together with the tests, shows the interrupt distribution
nicely.

Thanks for testing.

bluhm