Index | Thread | Search

From:
Job Snijders <job@openbsd.org>
Subject:
Add sysctl to disable Nagle's algorithm (RFC 896 - Congestion Control)
To:
tech@openbsd.org
Date:
Mon, 13 May 2024 18:41:55 +0000

Download raw body.

Thread
Dear all,

Back in the early 1980s, a suggestion was put forward how to improve TCP
congestion control, also known as "Nagle's algorithm". See RFC 896.

Nagle's algorithm can cause consecutive small packets from userland
applications to be coalesced into a single TCP packet. This happens at
the cost of an increase in latency: the sender is locally queuing up
data until it either receives an acknowledgement from the remote side or
sufficient additional data piled up to send a full-sized segment.

This approach might have been advantageous 40 - 50 years ago, when
multiple users were concurrently working behind 1200 baud lines. Nagle's
algorithm discourages sending tiny segments when the data to be sent
increases in small increments.  The trade-off being "sacrificing a
degree of interactivity" in exchange for "increased throughput".

In recent days the applicability and usefulness of Nagle's agorithm in
our times came into question. Nagle's algorithm negatively interacts
with Delayed Acks (RFC 813), as per Nagle himself:
https://news.ycombinator.com/item?id=10608356 and a more complete
description: https://datatracker.ietf.org/doc/html/draft-minshall-nagle

But some argue "Given the vast amount of work a modern server can do in
even a few hundred microseconds, delaying sending data for even one RTT
isn’t clearly a win." https://brooker.co.za/blog/2024/05/09/nagle.html

In base, various applications have taken it upon themselves to disable
Nagle's algorithm: ssh, httpd, iscsid, relayd, bgpd, and unwind. Bluhm
and I are not aware of applications that explicitly enable Nagle.

The standards say in RFC 9293 section 3.7.4: "A TCP implementation
SHOULD implement the Nagle algorithm to coalesce short segments.
However, there MUST be a way for an application to disable the Nagle
algorithm on an individual connection."

So, why not take it a step further and allow for the algorithm to be
disabled on the whole system? :-)

The below changeset introduces sysctl net.inet.tcp.nodelay, which if set
to 1 will simply cause TCP_NODELAY to be set on all TCP sockets.

Note that with net.inet.tcp.nodelay set to 1, applications still can
inspect and disable TCP_NODELAY using getsockopt() and setsockopt().

Perhaps in the future - after more study & contemplation - we'll to
change this sysctl's default from 0 to 1?

Kind regards,

Job


Index: lib/libc/sys/sysctl.2
===================================================================
RCS file: /cvs/src/lib/libc/sys/sysctl.2,v
diff -u -p -r1.57 sysctl.2
--- lib/libc/sys/sysctl.2	14 Oct 2023 19:02:16 -0000	1.57
+++ lib/libc/sys/sysctl.2	13 May 2024 17:52:55 -0000
@@ -1774,6 +1774,13 @@ was set on all TCP sockets.
 .It Li tcp.mssdflt Pq Va net.inet.tcp.mssdflt
 The maximum segment size that is used as default for non-local connections.
 The default value is 512.
+.It Li tcp.nodelay Pq Va net.inet.tcp.nodelay
+If set to 1, act as if the option
+.Dv TCP_NODELAY
+was set on all TCP sockets; see
+.Xr tcp 4
+for more information.
+The default value is 0.
 .It Li tcp.reasslimit Pq Va net.inet.tcp.reasslimit
 The maximum number of out-of-order TCP
 segments the system will store for reassembly.
Index: sys/netinet/tcp_subr.c
===================================================================
RCS file: /cvs/src/sys/netinet/tcp_subr.c,v
diff -u -p -r1.201 tcp_subr.c
--- sys/netinet/tcp_subr.c	17 Apr 2024 20:48:51 -0000	1.201
+++ sys/netinet/tcp_subr.c	13 May 2024 17:52:55 -0000
@@ -121,6 +121,7 @@ int	tcp_do_ecn = 0;		/* RFC3168 ECN enab
 #endif
 int	tcp_do_rfc3390 = 2;	/* Increase TCP's Initial Window to 10*mss */
 int	tcp_do_tso = 1;		/* TCP segmentation offload for output */
+int	tcp_do_nodelay = 0;	/* disable RFC 896 Congestion Control? */
 
 #ifndef TCB_INITIAL_HASH_SIZE
 #define	TCB_INITIAL_HASH_SIZE	128
@@ -445,6 +446,8 @@ tcp_newtcpcb(struct inpcb *inp, int wait
 
 	tp->sack_enable = tcp_do_sack;
 	tp->t_flags = tcp_do_rfc1323 ? (TF_REQ_SCALE|TF_REQ_TSTMP) : 0;
+	if (tcp_do_nodelay)
+		tp->t_flags |= TF_NODELAY;
 	tp->t_inpcb = inp;
 	/*
 	 * Init srtt to TCPTV_SRTTBASE (0), so we can tell that we have no
Index: sys/netinet/tcp_usrreq.c
===================================================================
RCS file: /cvs/src/sys/netinet/tcp_usrreq.c,v
diff -u -p -r1.231 tcp_usrreq.c
--- sys/netinet/tcp_usrreq.c	12 Apr 2024 16:07:09 -0000	1.231
+++ sys/netinet/tcp_usrreq.c	13 May 2024 17:52:55 -0000
@@ -168,6 +168,7 @@ const struct sysctl_bounded_args tcpctl_
 	{ TCPCTL_RFC3390, &tcp_do_rfc3390, 0, 2 },
 	{ TCPCTL_ALWAYS_KEEPALIVE, &tcp_always_keepalive, 0, 1 },
 	{ TCPCTL_TSO, &tcp_do_tso, 0, 1 },
+	{ TCPCTL_NODELAY, &tcp_do_nodelay, 0, 1 },
 };
 
 struct	inpcbtable tcbtable;
Index: sys/netinet/tcp_var.h
===================================================================
RCS file: /cvs/src/sys/netinet/tcp_var.h,v
diff -u -p -r1.178 tcp_var.h
--- sys/netinet/tcp_var.h	13 May 2024 01:15:53 -0000	1.178
+++ sys/netinet/tcp_var.h	13 May 2024 17:52:55 -0000
@@ -483,7 +483,8 @@ struct	tcpstat {
 #define TCPCTL_ROOTONLY	       24 /* return root only port bitmap */
 #define	TCPCTL_SYN_HASH_SIZE   25 /* number of buckets in the hash */
 #define	TCPCTL_TSO	       26 /* enable TCP segmentation offload */
-#define	TCPCTL_MAXID	       27
+#define	TCPCTL_NODELAY	       27 /* disable RFC 896 Congestion Control */
+#define	TCPCTL_MAXID	       28
 
 #define	TCPCTL_NAMES { \
 	{ 0, 0 }, \
@@ -513,6 +514,7 @@ struct	tcpstat {
 	{ "rootonly",	CTLTYPE_STRUCT }, \
 	{ "synhashsize",	CTLTYPE_INT }, \
 	{ "tso",	CTLTYPE_INT }, \
+	{ "nodelay",	CTLTYPE_INT }, \
 }
 
 struct tcp_ident_mapping {
@@ -688,6 +690,7 @@ extern	int tcp_sackhole_limit;	/* max en
 extern	int tcp_do_ecn;		/* RFC3168 ECN enabled/disabled? */
 extern	int tcp_do_rfc3390;	/* RFC3390 Increasing TCP's Initial Window */
 extern	int tcp_do_tso;		/* enable TSO for TCP output packets */
+extern	int tcp_do_nodelay;	/* Nagle's algorithm enabled/disabled? */
 
 extern	struct pool tcpqe_pool;
 extern	int tcp_reass_limit;	/* max entries for tcp reass queues */