Download raw body.
timing-dependent(?) display in xterm(1)
On Sat, May 10, 2025 at 11:12:49PM +0200, Ingo Schwarze wrote:
> Hello Walter,
>
> Walter Alejandro Iglesias wrote on Sat, May 10, 2025 at 08:46:42PM +0200:
> > On Sat, May 10, 2025 at 04:19:06PM +0200, Ingo Schwarze wrote:
>
> >> while wotking on VI command line editing mode in ksh(1), i stumbled
> >> over the following, which i believe might possibly be a quirk
> >> in xterm(1). I'm not yet sure what is going on, hence the question
> >> mark after the word "timing-dependent".
> >> [...]
>
> > I've been reading this:
> > https://invisible-island.net/xterm/bad-utf8/
>
> Thank you, that is interesting.
>
> It does not talk about my question,
I should've deleted the last sentence in the quoted text. I forgot that
you're more exigent than a C compiler. :-)
> though (xterm(1) output being
> inconsistent with itself). To the contrary, that document appears
> to implicitly support my assumption that every terminal should
> represent every sequence of input bytes in some well-defined way,
This is exactly what I meant linking that document. It supports your
assumption[1]:
----------------
/* Combine UTF-8 into Unicode */
/* Incomplete characters silently ignored,
* should probably be better represented by U+fffc
* (replacement character).
*/
Actually, the proper Unicode replacement character is U+FFFD, but
that was a start. Shortly after, Kuhn provided a patch to use U+FFFD
(in xterm patch #99).
--------------------------------
> even though many differrent ways to handle sequences that are invalid
> in UTF-8 exist. But the implicit assumption seems to be that every
> terminal should pick one way, not change its ways depending on the
> weather.
>
> > And downloaded this file:
> > https://invisible-island.net/xterm/bad-utf8/UTF-8-test-20150828.txt
>
> Right, that is somewhat similar to the mandoc UTF-8 test suite - except
> that it's not a test suite.
My intention was to kill two birds with one stone.
>
> > In many cases, under xterm, UTF-8 continuation bytes followed by an
> > ASCII character do weird things. The most weird case is:
> >
> > $ printf "\x9ax\n"
>
> $ printf "\x9ax\n"
> x
> ^[[?1;2c $ 1;2c
>
> WOW. That is horrific. Note the "1;2c" ends up in the editable area
> for the next command line, so if i press ENTER again, i get
>
> ksh: 1: not found
> ksh: 2c: not found
> $
>
> That's not only utterly broken but maybe even a security risk,
> in particular conidering that the printed digit-punctuation-salad
> looks suspiciously similar to ANSI escape sequences - and then we
> have this in /usr/src/gnu/usr.bin/perl/lib/unicore/UnicodeData.txt:
>
> 009A;<control>;Cc;0;BN;;;;;N;SINGLE CHARACTER INTRODUCER;;;;
> 009B;<control>;Cc;0;BN;;;;;N;CONTROL SEQUENCE INTRODUCER;;;;
>
> So i suspect that xterm(1) incorrectly reinterprets the invalid
> byte 0x9a to U+009A (which would be c2 9a in UTF-8, not a lone 9a)
> and then proceeds to use that incorrectly translated character
> as a C1 control character. That is all the more outrageous because
> IIRC, xterm(1) promises to never interpret C1 controls in UTF-8 mode.
There you are. Let me confirm your suspicion, according to this:
https://en.wikipedia.org/wiki/C0_and_C1_control_codes
'\x80' is a "Padding Character". Let's see what happens with \x85 which
is defined as "Next Line":
$ printf '\x85x\n'
(empty line)
x
The \x9a is the "Single Character Introducer":
$ PS1='this is my prompt$ '
this is my prompt$ printf '\x9ax\n'
x
^[[?1;2cthis is my prompt$ ?1;2c
(I redefined the prompt first to let you see that it gets printed in
the middle of the output.)
The \x9b is the "Control Sequence Introducer", you can get the same
above output with:
$ printf '\x9bcx'
$ printf '\x9b0cx'
In other terminals you can get the same with the escape character:
$ printf '\033[c'
$ printf '\033[0c'
That sequence of numbers printed is the device attributes:
https://vt100.net/docs/vt510-rm/DA1.html
>
> I believe i tested that this was actually the case some years ago.
> It appears this got broken, and in our default configuration,
> xterm(1) now not only interprets C1 sequences from UTF-8 - which
> is a very unsafe practice in itself - but even *generates* such
> sequences from invalid input in invalid ways.
By the way. This is the output under Linux (xterm and bash):
$ PS1='this is my prompt$ '
this is my prompt$ printf '\x9ax\n'
x
this is my prompt$ ?64;1;2;6;9;15;16;17;18;21;22;28c
>
> What this boils down to is that at least one security features i
> believed we had in our xterm(1) is no longer effective. At least
> that answers my question: re-securing xterm(1) is clearly more
> important than improving UTF-8 support in ksh(8).
This will capture the attention of other specialized minds here. ;-)
>
> I cannot say yet how this happened, but *if* this was caused by
> an update to xterm(1) from upstream, then that means trusting
> upstream with xterm(1) is not a good idea.
[1] We could guess that xterm worked in the way explained in the
document I linked.
My lateral-thinking is a bit helpful after all.
>
> Yours,
> Ingo
--
Walter
timing-dependent(?) display in xterm(1)