From: Henry Ford <henryfordkjv@gmail.com>
Subject: Re: timing-dependent(?) display in xterm(1)
To: Ingo Schwarze <schwarze@usta.de>
Cc: tech@openbsd.org
Date: Wed, 14 May 2025 14:25:06 -0400

On Mon, May 12, 2025 at 09:40:35 AM -0400, Walter Alejandro Iglesias <wai@roquesor.com> wrote:
> On Sat, May 10, 2025 at 11:12:49PM +0200, Ingo Schwarze wrote:
> > I cannot say yet how this happened, but *if* this was caused by
> > an update to xterm(1) from upstream, then that means trusting
> > upstream with xterm(1) is not a good idea.
> 
> The bug was introduced (intentionally?) in xterm 391:
> 
>   https://invisible-island.net/xterm/xterm.log.html#xterm_391
> 
> I've downloaded, compiled and tested 390 and 391 versions.  I could
> reproduce the bug with 391 version, xtrem-390 works as expected.  Test
> with xterm-390:
> 
>   $ printf '\x9bc\n'
>   �c
> 
> 
> -- 
> Walter
 
This behaviour is caused by the following code in xterm(1), which allows
bytes between 0x80 and 0xc0 if they are followed immediately by a byte
between 0x20 and 0x80 (and a UTF-8 sequence was not already started).
These bytes are then processed as normal, i.e if they are C1 control
characters they are acted on.
This only works if the next byte is immediately available to xterm(1),
it will not wait for input to become available.

(from xenocara app/xterm/ptydata.c r1.161 line 98):
>	} else if (c < 0xc0) {
>	    /* We received a continuation byte */
>	    if (utf_count < 1) {
>		if (screen->c1_printable) {
>		    data->utf_data = (IChar) c;
>		} else if ((i + 1) < length
>			   && data->next[i + 1] > 0x20
>			   && data->next[i + 1] < 0x80) {
>		    /*
>		     * Allow for C1 control string if the next byte is
>		     * available for inspection.
>		     */
>		    data->utf_data = (IChar) c;
>		} else {
>		    /*
>		     * We received a continuation byte before receiving a
>		     * sequence state, or a failed attempt to use a C1 control
>		     * string.
>		     */
>		    data->utf_data = (IChar) UCS_REPL;
>		}
>		data->utf_size = (i + 1);
>		break;
 
This was changed in xterm snapshot 390d, the diff for which is
available at: https://github.com/ThomasDickey/xterm-snapshots/commit/69daabcca67cfc021973ae910ec2020ef32d14cf

The relevant portion of the diff is reproduced below.
If this diff is reverted inside the xterm directory of a xenocara
source tree the behaviour returns to way it was pre xterm-391.
 
It seems like the intent here is to disambiguate invalid UTF-8 from
C1 control characters (and other data in the range 0x80-0xc0) by
checking if they are followed by a printable ascii byte.
Unfortunately this is implemented in such a way that the behaviour
depends on the timing with which the bytes are sent.

--- ptydata.c
+++ ptydata.c
@@ -98,13 +98,24 @@ decodeUtf8(TScreen *screen, PtyData *data)
 	} else if (c < 0xc0) {
 	    /* We received a continuation byte */
 	    if (utf_count < 1) {
-		/*
-		 * We received a continuation byte before receiving a sequence
-		 * state.  Or an attempt to use a C1 control string.  Either
-		 * way, it is mapped to the replacement character, unless
-		 * allowed by optional feature.
-		 */
-		data->utf_data = (IChar) (screen->c1_printable ? c : UCS_REPL);
+		if (screen->c1_printable) {
+		    data->utf_data = (IChar) c;
+		} else if ((i + 1) < length
+			   && data->next[i + 1] > 0x20
+			   && data->next[i + 1] < 0x80) {
+		    /*
+		     * Allow for C1 control string if the next byte is
+		     * available for inspection.
+		     */
+		    data->utf_data = (IChar) c;
+		} else {
+		    /*
+		     * We received a continuation byte before receiving a
+		     * sequence state, or a failed attempt to use a C1 control
+		     * string.
+		     */
+		    data->utf_data = (IChar) UCS_REPL;
+		}
 		data->utf_size = (i + 1);
 		break;
 	    } else if (screen->utf8_weblike