Download raw body.
[REPOST] ksh: utf8 full width character support for emacs.c
On Sun, Mar 30, 2025 at 7:47 PM Ingo Schwarze <schwarze@usta.de> wrote:
>
> Hello Omar,
>
> Omar Polo wrote on Sun, Mar 30, 2025 at 08:37:06PM +0200:
>
> > grapheme clusters (i.e. what a user percieves as a "character")
> > can be more than one code point long.
>
> That is (more or less) accurate - and ludicrously complicated.
> There is a long annex to the Unicode standard on this topic,
> Unicode Text Segmentation, Unicode Standard Annex #29
> https://www.unicode.org/reports/tr29/
>
> Note that "grapheme clusters" are not the same as "user-percieved
> characters", and there are several different types of grapheme clusters
> (legacy, extenbded, tailored, ...).
>
>
> On using wchar_t:
>
> > which i'm not sure it actually fixes anything, given that a wchar_t is
> > not a grapheme cluster.
> >
> > ("anything" in the context of the thread, i.e. hitting the left arrow on
> > the key and seeing the cursor skipping over a character rather than
> > being stuck somewhere between the wchar_t LATIN SMALL LETTER A and the
> > wchar_t COMBINING ACUTE ACCENT)
>
> You are right that if all you do is switching from char to wchar_t, that
> solves almost nothing of the problems the OP's patch aimed to tackle.
>
> Then again, switching to wchar_t *and* using wcwidth(3), when taken
> together, can be leveraged to solve *most* of the problem because
> the following *vastly simplified* model more or less works in most
> situations, at least much better than what we have now:
>
> 1. Assume that user-perceived characters are identical to
> legacy grapheme clusters.
> (Of course, that assumption breaks down when dealing with
> user perceived characters that are larger than grapheme clusters,
> but that is relatively rare, in particular in languages based on
> latin, greek or cyrillic scripts.)
> 2. Assume that all width-0 code points are combining characters
> and vice versa.
> (Of course, combining characters exist that are not width-0,
> and width-0 code points exist that are not combining, but
> the detrimental effects on line editing tend to be relatively
> rare and maybe also relatively mild.)
> 3. Assume that users want to operate on user-perceived characters
> (e.g. move the curser by one of them at a time, replace them
> one at a time, and so on)
> rather than wanting to edit *parts* of user-perceived characters
> (which does make sense with some scripts and languages, and for
> some languages it depends on the context and editing purpose
> whether it is desirable or not.)
>
> Taking these three assumptions together allows the following
> vastly simplified modus operandi for line editing:
>
> A. Each width-1 and each width-2 code point starts a
> user-perceived character.
> B. Zero or more subsequent width-0 code points are part
> of the same user-perceived character.
> C. The display width of the user-perceived character
> is the width of its first code point.
> D. The cursor can only be placed on these user-perceived characters,
> and each of them can only be edited as a whole.
>
> Doing this does *not* require switching from "char" to "wchar_t"
> globally throughout the shell. Worse, such a switch would destroy
> some functionality that the shell now provides. For example, try the
> following command:
>
> $ echo $(printf "\xff") | hexdump -C
> 0000000 ff 0a
> 0000002
>
> The printf(1) command produces the byte 0xff. The shell reads that byte
> and constructs the command "echo weird-byte" which prints the weird byte
> to standard output as desired. If the shell would internally store its
> input in wchar_t, that would not be possible because that byte cannot
> be stored in wchar_t. Depending on how we would handle such input,
> this might result in the shell dying from EILSEQ, or printing 0xbfc3
> instead of 0xff, or any number of other weird changes in behavior.
>
> Instead of switching globally to wchar_t, what might be an option
> might be to use mbtowc(3) and wcwidth(3) locally only in those places
> where information on display widths is needed, without destroying the
> ability to properly handle arbitrary bytes. So far, we did not do
> that because it implies quite some complication, and we didn't deem
> the priority of width-2 and width-0 line editing at the shell prompt
> all that high. Maybe it's worth doing at some point. Maybe.
> Certainly not in such a dirty way as in the patch the OP posted...
>
> Finally, note that our Perl code contained in the base system
> contains some of the information that would be needed to refine
> the assumpions 1. to 3. above in some respects - but most of these
> refinements would not be supported by any C library, not even by
> a C library boasting full POSIX locale support. Such refinements
> typically require specialized Unicode libraries that go far beyond
> what any C library can do - and i think in shell command line editing,
> going beyond the assumptions 1. to 3. would be quite crazy.
> Like, who would expect shell command line editing to work well
> for Hangul text processing?
(funnily enough, in _practice_ this will actually work quite well
because IMEs give you the pre-composed syllables rather than the
individual jamo. and if you really did want to cope with absolutely
everything you could possibly encounter, my comments
https://android-review.googlesource.com/c/platform/bionic/+/2983497
where i looked into this in detail in the context of Android's
icu-based wcwidth() implementation suggest that your earlier "vastly
simplified" algorithm would work fine for that too ... assuming you
don't have the same bug in your wcwidth() that i fixed in Android's
with that change :-) )
> Omar, note that i gave up on refuting what Christian Schulte is saying
> on a point-by-point basis because what he says is typically a mixture of
> half-truths, misunderstandings, misleading claims, and arguments at best
> tangentially related to the topic under discussion. So unless you are
> in urgent need of wild goose feathers, consider hunting elsewhere...
>
> But given that you have shown interest, i thought it might make sense
> to explain what is *actually* going on here.
>
> Yours,
> Ingo
>
[REPOST] ksh: utf8 full width character support for emacs.c