From: Ingo Schwarze Subject: Re: [REPOST] ksh: utf8 full width character support for emacs.c To: Omar Polo Cc: tech@openbsd.org Date: Mon, 31 Mar 2025 01:47:14 +0200 Hello Omar, Omar Polo wrote on Sun, Mar 30, 2025 at 08:37:06PM +0200: > grapheme clusters (i.e. what a user percieves as a "character") > can be more than one code point long. That is (more or less) accurate - and ludicrously complicated. There is a long annex to the Unicode standard on this topic, Unicode Text Segmentation, Unicode Standard Annex #29 https://www.unicode.org/reports/tr29/ Note that "grapheme clusters" are not the same as "user-percieved characters", and there are several different types of grapheme clusters (legacy, extenbded, tailored, ...). On using wchar_t: > which i'm not sure it actually fixes anything, given that a wchar_t is > not a grapheme cluster. > > ("anything" in the context of the thread, i.e. hitting the left arrow on > the key and seeing the cursor skipping over a character rather than > being stuck somewhere between the wchar_t LATIN SMALL LETTER A and the > wchar_t COMBINING ACUTE ACCENT) You are right that if all you do is switching from char to wchar_t, that solves almost nothing of the problems the OP's patch aimed to tackle. Then again, switching to wchar_t *and* using wcwidth(3), when taken together, can be leveraged to solve *most* of the problem because the following *vastly simplified* model more or less works in most situations, at least much better than what we have now: 1. Assume that user-perceived characters are identical to legacy grapheme clusters. (Of course, that assumption breaks down when dealing with user perceived characters that are larger than grapheme clusters, but that is relatively rare, in particular in languages based on latin, greek or cyrillic scripts.) 2. Assume that all width-0 code points are combining characters and vice versa. (Of course, combining characters exist that are not width-0, and width-0 code points exist that are not combining, but the detrimental effects on line editing tend to be relatively rare and maybe also relatively mild.) 3. Assume that users want to operate on user-perceived characters (e.g. move the curser by one of them at a time, replace them one at a time, and so on) rather than wanting to edit *parts* of user-perceived characters (which does make sense with some scripts and languages, and for some languages it depends on the context and editing purpose whether it is desirable or not.) Taking these three assumptions together allows the following vastly simplified modus operandi for line editing: A. Each width-1 and each width-2 code point starts a user-perceived character. B. Zero or more subsequent width-0 code points are part of the same user-perceived character. C. The display width of the user-perceived character is the width of its first code point. D. The cursor can only be placed on these user-perceived characters, and each of them can only be edited as a whole. Doing this does *not* require switching from "char" to "wchar_t" globally throughout the shell. Worse, such a switch would destroy some functionality that the shell now provides. For example, try the following command: $ echo $(printf "\xff") | hexdump -C 0000000 ff 0a 0000002 The printf(1) command produces the byte 0xff. The shell reads that byte and constructs the command "echo weird-byte" which prints the weird byte to standard output as desired. If the shell would internally store its input in wchar_t, that would not be possible because that byte cannot be stored in wchar_t. Depending on how we would handle such input, this might result in the shell dying from EILSEQ, or printing 0xbfc3 instead of 0xff, or any number of other weird changes in behavior. Instead of switching globally to wchar_t, what might be an option might be to use mbtowc(3) and wcwidth(3) locally only in those places where information on display widths is needed, without destroying the ability to properly handle arbitrary bytes. So far, we did not do that because it implies quite some complication, and we didn't deem the priority of width-2 and width-0 line editing at the shell prompt all that high. Maybe it's worth doing at some point. Maybe. Certainly not in such a dirty way as in the patch the OP posted... Finally, note that our Perl code contained in the base system contains some of the information that would be needed to refine the assumpions 1. to 3. above in some respects - but most of these refinements would not be supported by any C library, not even by a C library boasting full POSIX locale support. Such refinements typically require specialized Unicode libraries that go far beyond what any C library can do - and i think in shell command line editing, going beyond the assumptions 1. to 3. would be quite crazy. Like, who would expect shell command line editing to work well for Hangul text processing? Omar, note that i gave up on refuting what Christian Schulte is saying on a point-by-point basis because what he says is typically a mixture of half-truths, misunderstandings, misleading claims, and arguments at best tangentially related to the topic under discussion. So unless you are in urgent need of wild goose feathers, consider hunting elsewhere... But given that you have shown interest, i thought it might make sense to explain what is *actually* going on here. Yours, Ingo