Download raw body.
parallel TCP input
On Fri, Feb 07, 2025 at 10:31:01PM +0100, Hrvoje Popovski wrote: > On 5.2.2025. 19:25, Alexander Bluhm wrote: > > Hi, > > > > To run TCP input in parallel, each packet needs the socket lock. > > As lock and unlock per packet kills throughput for single stream > > TCP, diff below keeps the socket locked when consecutive packets > > belong to the same socket. > > > > I keep a list of TCP packets in struct softnet. So each softnet > > thread has its own list to accumulate packets. After processing > > all packets from an interface input queue, TCP input is called for > > each TCP packet. The socket lock is only unlocked and locked if > > the TCP socket changes between the packets. This solves the > > performance issue. > > > > To remember the softnet thread and packet offset, I use the cookie > > field in the mbuf header. Hopefully nothing else uses cookies in > > this path. > > > > Please test with various interfaces and pseudo devices. > > Are there setups where my idea does not work? > > How much is TCP stack getting faster? > > Are there performance regressions? > > > > Note that at most 4 CPUs are used for network. You might want to > > increase NET_TASKQ. Also you need network interfaces that support > > multiqueue. ix, ixl, igc, bnxt, vio, vmx, ... > > > > bluhm > > Hi, > > I've tested ix, bnxt and mcx on same machine and it seems that mcx have > performance problems with 1 tcp stream with or without your diff. With > this diff mcx get nice performance boost with more than 1 tcp stream > which is not the case without this diff... The reason why ix is much better, is that it supports large receive offload. This effect is seen best with single stream. I guess the mcx driver is less than optimal. > NET_TASKQ=8 > > 1 tcp stream > ix - 9.38 Gbits/sec > mcx - 4.21 Gbits/sec > bnxt - 6.95 Gbits/sec > > 8 tcp streams > ix - 9.41 Gbits/sec > mcx - 5.15 Gbits/sec > bnxt - 6.68 Gbits/sec > > > > > with diff > 1 tcp stream > ix - 9.38 Gbits/sec > mcx - 4.20 Gbits/sec > bnxt - 7.88 Gbits/sec > > 8 tcp streams > ix - 9.36 Gbits/sec > mcx - 9.21 Gbits/sec > bnxt - 9.34 Gbits/sec > > > spreading interrupts over queues is done by mixing source and > destination ip, not src or dst tcp/udp ports? is that right? if so, to > test this properly I need to fire up some virtual hosts ? The Intel cards that I am using, seem to spread over IP when using UDP. But with TCP they also spread over src ports, so I can use a single tcpbench -s with tcpbench -n 10 to see an effect. Running systat together with the tests, shows the interrupt distribution nicely. Thanks for testing. bluhm
parallel TCP input