From: Ingo Schwarze <schwarze@usta.de>
Subject: Re: [REPOST] ksh: utf8 full width character support for emacs.c
To: Omar Polo <op@omarpolo.com>
Cc: tech@openbsd.org
Date: Mon, 31 Mar 2025 01:47:14 +0200

Hello Omar,

Omar Polo wrote on Sun, Mar 30, 2025 at 08:37:06PM +0200:

> grapheme clusters (i.e. what a user percieves as a "character")
> can be more than one code point long.

That is (more or less) accurate - and ludicrously complicated.
There is a long annex to the Unicode standard on this topic,
  Unicode Text Segmentation, Unicode Standard Annex #29
  https://www.unicode.org/reports/tr29/

Note that "grapheme clusters" are not the same as "user-percieved
characters", and there are several different types of grapheme clusters
(legacy, extenbded, tailored, ...).


On using wchar_t:

> which i'm not sure it actually fixes anything, given that a wchar_t is
> not a grapheme cluster.
> 
> ("anything" in the context of the thread, i.e. hitting the left arrow on
> the key and seeing the cursor skipping over a character rather than
> being stuck somewhere between the wchar_t LATIN SMALL LETTER A and the
> wchar_t COMBINING ACUTE ACCENT)

You are right that if all you do is switching from char to wchar_t, that
solves almost nothing of the problems the OP's patch aimed to tackle.

Then again, switching to wchar_t *and* using wcwidth(3), when taken
together, can be leveraged to solve *most* of the problem because
the following *vastly simplified* model more or less works in most
situations, at least much better than what we have now:

 1. Assume that user-perceived characters are identical to
    legacy grapheme clusters.
    (Of course, that assumption breaks down when dealing with
     user perceived characters that are larger than grapheme clusters,
     but that is relatively rare, in particular in languages based on
     latin, greek or cyrillic scripts.)
 2. Assume that all width-0 code points are combining characters
    and vice versa.
    (Of course, combining characters exist that are not width-0,
     and width-0 code points exist that are not combining, but
     the detrimental effects on line editing tend to be relatively
     rare and maybe also relatively mild.)
 3. Assume that users want to operate on user-perceived characters
    (e.g. move the curser by one of them at a time, replace them
     one at a time, and so on)
    rather than wanting to edit *parts* of user-perceived characters
    (which does make sense with some scripts and languages, and for
     some languages it depends on the context and editing purpose
     whether it is desirable or not.)

Taking these three assumptions together allows the following
vastly simplified modus operandi for line editing:

 A. Each width-1 and each width-2 code point starts a
    user-perceived character.
 B. Zero or more subsequent width-0 code points are part
    of the same user-perceived character.
 C. The display width of the user-perceived character
    is the width of its first code point.
 D. The cursor can only be placed on these user-perceived characters,
    and each of them can only be edited as a whole.

Doing this does *not* require switching from "char" to "wchar_t"
globally throughout the shell.  Worse, such a switch would destroy
some functionality that the shell now provides.  For example, try the
following command:

   $ echo $(printf "\xff") | hexdump -C
  0000000  ff 0a
  0000002

The printf(1) command produces the byte 0xff.  The shell reads that byte
and constructs the command "echo weird-byte" which prints the weird byte
to standard output as desired.  If the shell would internally store its
input in wchar_t, that would not be possible because that byte cannot
be stored in wchar_t.  Depending on how we would handle such input,
this might result in the shell dying from EILSEQ, or printing 0xbfc3
instead of 0xff, or any number of other weird changes in behavior.

Instead of switching globally to wchar_t, what might be an option
might be to use mbtowc(3) and wcwidth(3) locally only in those places
where information on display widths is needed, without destroying the
ability to properly handle arbitrary bytes.  So far, we did not do
that because it implies quite some complication, and we didn't deem
the priority of width-2 and width-0 line editing at the shell prompt
all that high.  Maybe it's worth doing at some point.  Maybe.
Certainly not in such a dirty way as in the patch the OP posted...

Finally, note that our Perl code contained in the base system
contains some of the information that would be needed to refine
the assumpions 1. to 3. above in some respects - but most of these
refinements would not be supported by any C library, not even by
a C library boasting full POSIX locale support.  Such refinements
typically require specialized Unicode libraries that go far beyond
what any C library can do - and i think in shell command line editing,
going beyond the assumptions 1. to 3. would be quite crazy.
Like, who would expect shell command line editing to work well
for Hangul text processing?


Omar, note that i gave up on refuting what Christian Schulte is saying
on a point-by-point basis because what he says is typically a mixture of
half-truths, misunderstandings, misleading claims, and arguments at best
tangentially related to the topic under discussion.  So unless you are
in urgent need of wild goose feathers, consider hunting elsewhere...

But given that you have shown interest, i thought it might make sense
to explain what is *actually* going on here.

Yours,
  Ingo