From: Omar Polo <op@omarpolo.com>
Subject: Re: [REPOST] ksh: utf8 full width character support for emacs.c
To: Ingo Schwarze <schwarze@usta.de>
Cc: tech@openbsd.org
Date: Thu, 03 Apr 2025 20:41:48 +0200

Ingo Schwarze <schwarze@usta.de> wrote:
> Hello Omar,
> 
> Omar Polo wrote on Sun, Mar 30, 2025 at 08:37:06PM +0200:
> 
> > grapheme clusters (i.e. what a user percieves as a "character")
> > can be more than one code point long.
> 
> That is (more or less) accurate - and ludicrously complicated.
> There is a long annex to the Unicode standard on this topic,
>   Unicode Text Segmentation, Unicode Standard Annex #29
>   https://www.unicode.org/reports/tr29/
> 
> Note that "grapheme clusters" are not the same as "user-percieved
> characters", and there are several different types of grapheme clusters
> (legacy, extenbded, tailored, ...).

ops, i fell into the trap of "someone is wrong on the internet" just to
split out more wrong stuff, apologize ;-)

> [...]
> 
> Taking these three assumptions together allows the following
> vastly simplified modus operandi for line editing:
> 
>  A. Each width-1 and each width-2 code point starts a
>     user-perceived character.
>  B. Zero or more subsequent width-0 code points are part
>     of the same user-perceived character.
>  C. The display width of the user-perceived character
>     is the width of its first code point.
>  D. The cursor can only be placed on these user-perceived characters,
>     and each of them can only be edited as a whole.

fwiw i second that this might be enough for ksh.  it would fix so many of
the current issues with non-ascii text that what remains is more than
tolerable i guess.

> Doing this does *not* require switching from "char" to "wchar_t"
> globally throughout the shell.  Worse, such a switch would destroy
> some functionality that the shell now provides.  For example, try the
> following command:
> 
>    $ echo $(printf "\xff") | hexdump -C
>   0000000  ff 0a
>   0000002
> 
> The printf(1) command produces the byte 0xff.  The shell reads that byte
> and constructs the command "echo weird-byte" which prints the weird byte
> to standard output as desired.  If the shell would internally store its
> input in wchar_t, that would not be possible because that byte cannot
> be stored in wchar_t.  Depending on how we would handle such input,
> this might result in the shell dying from EILSEQ, or printing 0xbfc3
> instead of 0xff, or any number of other weird changes in behavior.

agreed; my favourite part is that you don't even need printf; just
iterating over files names is enough (since they're just bytestrings)

> [...]
> 
> Omar, note that i gave up on refuting what Christian Schulte is saying
> on a point-by-point basis because what he says is typically a mixture of
> half-truths, misunderstandings, misleading claims, and arguments at best
> tangentially related to the topic under discussion.  So unless you are
> in urgent need of wild goose feathers, consider hunting elsewhere...

yeah, as i was saying i just fell for an almost pointless correction.
But this stemmed out in your precise reply, which is always a pleasure to
read!

> But given that you have shown interest, i thought it might make sense
> to explain what is *actually* going on here.

i guess i have to put my code where my mouth is.  I like your outline
for handling text in ksh interactive mode; can't promise anything but
i'll give it an honest spin!

Thank you!