Index | Thread | Search

From:
Ingo Schwarze <schwarze@usta.de>
Subject:
Re: timing-dependent(?) display in xterm(1)
To:
Walter Alejandro Iglesias <wai@roquesor.com>
Cc:
tech@openbsd.org
Date:
Sat, 10 May 2025 23:12:49 +0200

Download raw body.

Thread
Hello Walter,

Walter Alejandro Iglesias wrote on Sat, May 10, 2025 at 08:46:42PM +0200:
> On Sat, May 10, 2025 at 04:19:06PM +0200, Ingo Schwarze wrote:

>> while wotking on VI command line editing mode in ksh(1), i stumbled
>> over the following, which i believe might possibly be a quirk
>> in xterm(1).  I'm not yet sure what is going on, hence the question
>> mark after the word "timing-dependent".
>> [...]

> I've been reading this:
>   https://invisible-island.net/xterm/bad-utf8/

Thank you, that is interesting.

It does not talk about my question, though (xterm(1) output being
inconsistent with itself).  To the contrary, that document appears
to implicitly support my assumption that every terminal should
represent every sequence of input bytes in some well-defined way,
even though many differrent ways to handle sequences that are invalid
in UTF-8 exist.  But the implicit assumption seems to be that every
terminal should pick one way, not change its ways depending on the
weather.

> And downloaded this file:
>   https://invisible-island.net/xterm/bad-utf8/UTF-8-test-20150828.txt

Right, that is somewhat similar to the mandoc UTF-8 test suite - except
that it's not a test suite.

> In many cases, under xterm, UTF-8 continuation bytes followed by an
> ASCII character do weird things.  The most weird case is:
> 
>   $ printf "\x9ax\n"

   $ printf "\x9ax\n" 
  x
  ^[[?1;2c $ 1;2c

WOW.  That is horrific.  Note the "1;2c" ends up in the editable area
for the next command line, so if i press ENTER again, i get

  ksh: 1: not found
  ksh: 2c: not found
   $ 

That's not only utterly broken but maybe even a security risk,
in particular conidering that the printed digit-punctuation-salad
looks suspiciously similar to ANSI escape sequences - and then we
have this in /usr/src/gnu/usr.bin/perl/lib/unicore/UnicodeData.txt:

  009A;<control>;Cc;0;BN;;;;;N;SINGLE CHARACTER INTRODUCER;;;;
  009B;<control>;Cc;0;BN;;;;;N;CONTROL SEQUENCE INTRODUCER;;;;

So i suspect that xterm(1) incorrectly reinterprets the invalid
byte 0x9a to U+009A (which would be c2 9a in UTF-8, not a lone 9a)
and then proceeds to use that incorrectly translated character
as a C1 control character.  That is all the more outrageous because
IIRC, xterm(1) promises to never interpret C1 controls in UTF-8 mode.

I believe i tested that this was actually the case some years ago.
It appears this got broken, and in our default configuration,
xterm(1) now not only interprets C1 sequences from UTF-8 - which
is a very unsafe practice in itself - but even *generates* such
sequences from invalid input in invalid ways.

What this boils down to is that at least one security features i
believed we had in our xterm(1) is no longer effective.  At least
that answers my question: re-securing xterm(1) is clearly more
important than improving UTF-8 support in ksh(8).

I cannot say yet how this happened, but *if* this was caused by
an update to xterm(1) from upstream, then that means trusting
upstream with xterm(1) is not a good idea.

Yours,
  Ingo