Index | Thread | Search

From:
enh <enh@google.com>
Subject:
Re: [REPOST] ksh: utf8 full width character support for emacs.c
To:
Ingo Schwarze <schwarze@usta.de>
Cc:
Omar Polo <op@omarpolo.com>, tech@openbsd.org
Date:
Mon, 31 Mar 2025 08:40:04 -0400

Download raw body.

Thread
On Sun, Mar 30, 2025 at 7:47 PM Ingo Schwarze <schwarze@usta.de> wrote:
>
> Hello Omar,
>
> Omar Polo wrote on Sun, Mar 30, 2025 at 08:37:06PM +0200:
>
> > grapheme clusters (i.e. what a user percieves as a "character")
> > can be more than one code point long.
>
> That is (more or less) accurate - and ludicrously complicated.
> There is a long annex to the Unicode standard on this topic,
>   Unicode Text Segmentation, Unicode Standard Annex #29
>   https://www.unicode.org/reports/tr29/
>
> Note that "grapheme clusters" are not the same as "user-percieved
> characters", and there are several different types of grapheme clusters
> (legacy, extenbded, tailored, ...).
>
>
> On using wchar_t:
>
> > which i'm not sure it actually fixes anything, given that a wchar_t is
> > not a grapheme cluster.
> >
> > ("anything" in the context of the thread, i.e. hitting the left arrow on
> > the key and seeing the cursor skipping over a character rather than
> > being stuck somewhere between the wchar_t LATIN SMALL LETTER A and the
> > wchar_t COMBINING ACUTE ACCENT)
>
> You are right that if all you do is switching from char to wchar_t, that
> solves almost nothing of the problems the OP's patch aimed to tackle.
>
> Then again, switching to wchar_t *and* using wcwidth(3), when taken
> together, can be leveraged to solve *most* of the problem because
> the following *vastly simplified* model more or less works in most
> situations, at least much better than what we have now:
>
>  1. Assume that user-perceived characters are identical to
>     legacy grapheme clusters.
>     (Of course, that assumption breaks down when dealing with
>      user perceived characters that are larger than grapheme clusters,
>      but that is relatively rare, in particular in languages based on
>      latin, greek or cyrillic scripts.)
>  2. Assume that all width-0 code points are combining characters
>     and vice versa.
>     (Of course, combining characters exist that are not width-0,
>      and width-0 code points exist that are not combining, but
>      the detrimental effects on line editing tend to be relatively
>      rare and maybe also relatively mild.)
>  3. Assume that users want to operate on user-perceived characters
>     (e.g. move the curser by one of them at a time, replace them
>      one at a time, and so on)
>     rather than wanting to edit *parts* of user-perceived characters
>     (which does make sense with some scripts and languages, and for
>      some languages it depends on the context and editing purpose
>      whether it is desirable or not.)
>
> Taking these three assumptions together allows the following
> vastly simplified modus operandi for line editing:
>
>  A. Each width-1 and each width-2 code point starts a
>     user-perceived character.
>  B. Zero or more subsequent width-0 code points are part
>     of the same user-perceived character.
>  C. The display width of the user-perceived character
>     is the width of its first code point.
>  D. The cursor can only be placed on these user-perceived characters,
>     and each of them can only be edited as a whole.
>
> Doing this does *not* require switching from "char" to "wchar_t"
> globally throughout the shell.  Worse, such a switch would destroy
> some functionality that the shell now provides.  For example, try the
> following command:
>
>    $ echo $(printf "\xff") | hexdump -C
>   0000000  ff 0a
>   0000002
>
> The printf(1) command produces the byte 0xff.  The shell reads that byte
> and constructs the command "echo weird-byte" which prints the weird byte
> to standard output as desired.  If the shell would internally store its
> input in wchar_t, that would not be possible because that byte cannot
> be stored in wchar_t.  Depending on how we would handle such input,
> this might result in the shell dying from EILSEQ, or printing 0xbfc3
> instead of 0xff, or any number of other weird changes in behavior.
>
> Instead of switching globally to wchar_t, what might be an option
> might be to use mbtowc(3) and wcwidth(3) locally only in those places
> where information on display widths is needed, without destroying the
> ability to properly handle arbitrary bytes.  So far, we did not do
> that because it implies quite some complication, and we didn't deem
> the priority of width-2 and width-0 line editing at the shell prompt
> all that high.  Maybe it's worth doing at some point.  Maybe.
> Certainly not in such a dirty way as in the patch the OP posted...
>
> Finally, note that our Perl code contained in the base system
> contains some of the information that would be needed to refine
> the assumpions 1. to 3. above in some respects - but most of these
> refinements would not be supported by any C library, not even by
> a C library boasting full POSIX locale support.  Such refinements
> typically require specialized Unicode libraries that go far beyond
> what any C library can do - and i think in shell command line editing,
> going beyond the assumptions 1. to 3. would be quite crazy.
> Like, who would expect shell command line editing to work well
> for Hangul text processing?

(funnily enough, in _practice_ this will actually work quite well
because IMEs give you the pre-composed syllables rather than the
individual jamo. and if you really did want to cope with absolutely
everything you could possibly encounter, my comments
https://android-review.googlesource.com/c/platform/bionic/+/2983497
where i looked into this in detail in the context of Android's
icu-based wcwidth() implementation suggest that your earlier "vastly
simplified" algorithm would work fine for that too ... assuming you
don't have the same bug in your wcwidth() that i fixed in Android's
with that change :-) )

> Omar, note that i gave up on refuting what Christian Schulte is saying
> on a point-by-point basis because what he says is typically a mixture of
> half-truths, misunderstandings, misleading claims, and arguments at best
> tangentially related to the topic under discussion.  So unless you are
> in urgent need of wild goose feathers, consider hunting elsewhere...
>
> But given that you have shown interest, i thought it might make sense
> to explain what is *actually* going on here.
>
> Yours,
>   Ingo
>