From: enh Subject: Re: [REPOST] ksh: utf8 full width character support for emacs.c To: Ingo Schwarze Cc: Omar Polo , tech@openbsd.org Date: Mon, 31 Mar 2025 08:40:04 -0400 On Sun, Mar 30, 2025 at 7:47 PM Ingo Schwarze wrote: > > Hello Omar, > > Omar Polo wrote on Sun, Mar 30, 2025 at 08:37:06PM +0200: > > > grapheme clusters (i.e. what a user percieves as a "character") > > can be more than one code point long. > > That is (more or less) accurate - and ludicrously complicated. > There is a long annex to the Unicode standard on this topic, > Unicode Text Segmentation, Unicode Standard Annex #29 > https://www.unicode.org/reports/tr29/ > > Note that "grapheme clusters" are not the same as "user-percieved > characters", and there are several different types of grapheme clusters > (legacy, extenbded, tailored, ...). > > > On using wchar_t: > > > which i'm not sure it actually fixes anything, given that a wchar_t is > > not a grapheme cluster. > > > > ("anything" in the context of the thread, i.e. hitting the left arrow on > > the key and seeing the cursor skipping over a character rather than > > being stuck somewhere between the wchar_t LATIN SMALL LETTER A and the > > wchar_t COMBINING ACUTE ACCENT) > > You are right that if all you do is switching from char to wchar_t, that > solves almost nothing of the problems the OP's patch aimed to tackle. > > Then again, switching to wchar_t *and* using wcwidth(3), when taken > together, can be leveraged to solve *most* of the problem because > the following *vastly simplified* model more or less works in most > situations, at least much better than what we have now: > > 1. Assume that user-perceived characters are identical to > legacy grapheme clusters. > (Of course, that assumption breaks down when dealing with > user perceived characters that are larger than grapheme clusters, > but that is relatively rare, in particular in languages based on > latin, greek or cyrillic scripts.) > 2. Assume that all width-0 code points are combining characters > and vice versa. > (Of course, combining characters exist that are not width-0, > and width-0 code points exist that are not combining, but > the detrimental effects on line editing tend to be relatively > rare and maybe also relatively mild.) > 3. Assume that users want to operate on user-perceived characters > (e.g. move the curser by one of them at a time, replace them > one at a time, and so on) > rather than wanting to edit *parts* of user-perceived characters > (which does make sense with some scripts and languages, and for > some languages it depends on the context and editing purpose > whether it is desirable or not.) > > Taking these three assumptions together allows the following > vastly simplified modus operandi for line editing: > > A. Each width-1 and each width-2 code point starts a > user-perceived character. > B. Zero or more subsequent width-0 code points are part > of the same user-perceived character. > C. The display width of the user-perceived character > is the width of its first code point. > D. The cursor can only be placed on these user-perceived characters, > and each of them can only be edited as a whole. > > Doing this does *not* require switching from "char" to "wchar_t" > globally throughout the shell. Worse, such a switch would destroy > some functionality that the shell now provides. For example, try the > following command: > > $ echo $(printf "\xff") | hexdump -C > 0000000 ff 0a > 0000002 > > The printf(1) command produces the byte 0xff. The shell reads that byte > and constructs the command "echo weird-byte" which prints the weird byte > to standard output as desired. If the shell would internally store its > input in wchar_t, that would not be possible because that byte cannot > be stored in wchar_t. Depending on how we would handle such input, > this might result in the shell dying from EILSEQ, or printing 0xbfc3 > instead of 0xff, or any number of other weird changes in behavior. > > Instead of switching globally to wchar_t, what might be an option > might be to use mbtowc(3) and wcwidth(3) locally only in those places > where information on display widths is needed, without destroying the > ability to properly handle arbitrary bytes. So far, we did not do > that because it implies quite some complication, and we didn't deem > the priority of width-2 and width-0 line editing at the shell prompt > all that high. Maybe it's worth doing at some point. Maybe. > Certainly not in such a dirty way as in the patch the OP posted... > > Finally, note that our Perl code contained in the base system > contains some of the information that would be needed to refine > the assumpions 1. to 3. above in some respects - but most of these > refinements would not be supported by any C library, not even by > a C library boasting full POSIX locale support. Such refinements > typically require specialized Unicode libraries that go far beyond > what any C library can do - and i think in shell command line editing, > going beyond the assumptions 1. to 3. would be quite crazy. > Like, who would expect shell command line editing to work well > for Hangul text processing? (funnily enough, in _practice_ this will actually work quite well because IMEs give you the pre-composed syllables rather than the individual jamo. and if you really did want to cope with absolutely everything you could possibly encounter, my comments https://android-review.googlesource.com/c/platform/bionic/+/2983497 where i looked into this in detail in the context of Android's icu-based wcwidth() implementation suggest that your earlier "vastly simplified" algorithm would work fine for that too ... assuming you don't have the same bug in your wcwidth() that i fixed in Android's with that change :-) ) > Omar, note that i gave up on refuting what Christian Schulte is saying > on a point-by-point basis because what he says is typically a mixture of > half-truths, misunderstandings, misleading claims, and arguments at best > tangentially related to the topic under discussion. So unless you are > in urgent need of wild goose feathers, consider hunting elsewhere... > > But given that you have shown interest, i thought it might make sense > to explain what is *actually* going on here. > > Yours, > Ingo >