From: Job Snijders Subject: Add sysctl to disable Nagle's algorithm (RFC 896 - Congestion Control) To: tech@openbsd.org Date: Mon, 13 May 2024 18:41:55 +0000 Dear all, Back in the early 1980s, a suggestion was put forward how to improve TCP congestion control, also known as "Nagle's algorithm". See RFC 896. Nagle's algorithm can cause consecutive small packets from userland applications to be coalesced into a single TCP packet. This happens at the cost of an increase in latency: the sender is locally queuing up data until it either receives an acknowledgement from the remote side or sufficient additional data piled up to send a full-sized segment. This approach might have been advantageous 40 - 50 years ago, when multiple users were concurrently working behind 1200 baud lines. Nagle's algorithm discourages sending tiny segments when the data to be sent increases in small increments. The trade-off being "sacrificing a degree of interactivity" in exchange for "increased throughput". In recent days the applicability and usefulness of Nagle's agorithm in our times came into question. Nagle's algorithm negatively interacts with Delayed Acks (RFC 813), as per Nagle himself: https://news.ycombinator.com/item?id=10608356 and a more complete description: https://datatracker.ietf.org/doc/html/draft-minshall-nagle But some argue "Given the vast amount of work a modern server can do in even a few hundred microseconds, delaying sending data for even one RTT isn’t clearly a win." https://brooker.co.za/blog/2024/05/09/nagle.html In base, various applications have taken it upon themselves to disable Nagle's algorithm: ssh, httpd, iscsid, relayd, bgpd, and unwind. Bluhm and I are not aware of applications that explicitly enable Nagle. The standards say in RFC 9293 section 3.7.4: "A TCP implementation SHOULD implement the Nagle algorithm to coalesce short segments. However, there MUST be a way for an application to disable the Nagle algorithm on an individual connection." So, why not take it a step further and allow for the algorithm to be disabled on the whole system? :-) The below changeset introduces sysctl net.inet.tcp.nodelay, which if set to 1 will simply cause TCP_NODELAY to be set on all TCP sockets. Note that with net.inet.tcp.nodelay set to 1, applications still can inspect and disable TCP_NODELAY using getsockopt() and setsockopt(). Perhaps in the future - after more study & contemplation - we'll to change this sysctl's default from 0 to 1? Kind regards, Job Index: lib/libc/sys/sysctl.2 =================================================================== RCS file: /cvs/src/lib/libc/sys/sysctl.2,v diff -u -p -r1.57 sysctl.2 --- lib/libc/sys/sysctl.2 14 Oct 2023 19:02:16 -0000 1.57 +++ lib/libc/sys/sysctl.2 13 May 2024 17:52:55 -0000 @@ -1774,6 +1774,13 @@ was set on all TCP sockets. .It Li tcp.mssdflt Pq Va net.inet.tcp.mssdflt The maximum segment size that is used as default for non-local connections. The default value is 512. +.It Li tcp.nodelay Pq Va net.inet.tcp.nodelay +If set to 1, act as if the option +.Dv TCP_NODELAY +was set on all TCP sockets; see +.Xr tcp 4 +for more information. +The default value is 0. .It Li tcp.reasslimit Pq Va net.inet.tcp.reasslimit The maximum number of out-of-order TCP segments the system will store for reassembly. Index: sys/netinet/tcp_subr.c =================================================================== RCS file: /cvs/src/sys/netinet/tcp_subr.c,v diff -u -p -r1.201 tcp_subr.c --- sys/netinet/tcp_subr.c 17 Apr 2024 20:48:51 -0000 1.201 +++ sys/netinet/tcp_subr.c 13 May 2024 17:52:55 -0000 @@ -121,6 +121,7 @@ int tcp_do_ecn = 0; /* RFC3168 ECN enab #endif int tcp_do_rfc3390 = 2; /* Increase TCP's Initial Window to 10*mss */ int tcp_do_tso = 1; /* TCP segmentation offload for output */ +int tcp_do_nodelay = 0; /* disable RFC 896 Congestion Control? */ #ifndef TCB_INITIAL_HASH_SIZE #define TCB_INITIAL_HASH_SIZE 128 @@ -445,6 +446,8 @@ tcp_newtcpcb(struct inpcb *inp, int wait tp->sack_enable = tcp_do_sack; tp->t_flags = tcp_do_rfc1323 ? (TF_REQ_SCALE|TF_REQ_TSTMP) : 0; + if (tcp_do_nodelay) + tp->t_flags |= TF_NODELAY; tp->t_inpcb = inp; /* * Init srtt to TCPTV_SRTTBASE (0), so we can tell that we have no Index: sys/netinet/tcp_usrreq.c =================================================================== RCS file: /cvs/src/sys/netinet/tcp_usrreq.c,v diff -u -p -r1.231 tcp_usrreq.c --- sys/netinet/tcp_usrreq.c 12 Apr 2024 16:07:09 -0000 1.231 +++ sys/netinet/tcp_usrreq.c 13 May 2024 17:52:55 -0000 @@ -168,6 +168,7 @@ const struct sysctl_bounded_args tcpctl_ { TCPCTL_RFC3390, &tcp_do_rfc3390, 0, 2 }, { TCPCTL_ALWAYS_KEEPALIVE, &tcp_always_keepalive, 0, 1 }, { TCPCTL_TSO, &tcp_do_tso, 0, 1 }, + { TCPCTL_NODELAY, &tcp_do_nodelay, 0, 1 }, }; struct inpcbtable tcbtable; Index: sys/netinet/tcp_var.h =================================================================== RCS file: /cvs/src/sys/netinet/tcp_var.h,v diff -u -p -r1.178 tcp_var.h --- sys/netinet/tcp_var.h 13 May 2024 01:15:53 -0000 1.178 +++ sys/netinet/tcp_var.h 13 May 2024 17:52:55 -0000 @@ -483,7 +483,8 @@ struct tcpstat { #define TCPCTL_ROOTONLY 24 /* return root only port bitmap */ #define TCPCTL_SYN_HASH_SIZE 25 /* number of buckets in the hash */ #define TCPCTL_TSO 26 /* enable TCP segmentation offload */ -#define TCPCTL_MAXID 27 +#define TCPCTL_NODELAY 27 /* disable RFC 896 Congestion Control */ +#define TCPCTL_MAXID 28 #define TCPCTL_NAMES { \ { 0, 0 }, \ @@ -513,6 +514,7 @@ struct tcpstat { { "rootonly", CTLTYPE_STRUCT }, \ { "synhashsize", CTLTYPE_INT }, \ { "tso", CTLTYPE_INT }, \ + { "nodelay", CTLTYPE_INT }, \ } struct tcp_ident_mapping { @@ -688,6 +690,7 @@ extern int tcp_sackhole_limit; /* max en extern int tcp_do_ecn; /* RFC3168 ECN enabled/disabled? */ extern int tcp_do_rfc3390; /* RFC3390 Increasing TCP's Initial Window */ extern int tcp_do_tso; /* enable TSO for TCP output packets */ +extern int tcp_do_nodelay; /* Nagle's algorithm enabled/disabled? */ extern struct pool tcpqe_pool; extern int tcp_reass_limit; /* max entries for tcp reass queues */