From: Omar Polo Subject: Re: [REPOST] ksh: utf8 full width character support for emacs.c To: Ingo Schwarze Cc: tech@openbsd.org Date: Thu, 03 Apr 2025 20:41:48 +0200 Ingo Schwarze wrote: > Hello Omar, > > Omar Polo wrote on Sun, Mar 30, 2025 at 08:37:06PM +0200: > > > grapheme clusters (i.e. what a user percieves as a "character") > > can be more than one code point long. > > That is (more or less) accurate - and ludicrously complicated. > There is a long annex to the Unicode standard on this topic, > Unicode Text Segmentation, Unicode Standard Annex #29 > https://www.unicode.org/reports/tr29/ > > Note that "grapheme clusters" are not the same as "user-percieved > characters", and there are several different types of grapheme clusters > (legacy, extenbded, tailored, ...). ops, i fell into the trap of "someone is wrong on the internet" just to split out more wrong stuff, apologize ;-) > [...] > > Taking these three assumptions together allows the following > vastly simplified modus operandi for line editing: > > A. Each width-1 and each width-2 code point starts a > user-perceived character. > B. Zero or more subsequent width-0 code points are part > of the same user-perceived character. > C. The display width of the user-perceived character > is the width of its first code point. > D. The cursor can only be placed on these user-perceived characters, > and each of them can only be edited as a whole. fwiw i second that this might be enough for ksh. it would fix so many of the current issues with non-ascii text that what remains is more than tolerable i guess. > Doing this does *not* require switching from "char" to "wchar_t" > globally throughout the shell. Worse, such a switch would destroy > some functionality that the shell now provides. For example, try the > following command: > > $ echo $(printf "\xff") | hexdump -C > 0000000 ff 0a > 0000002 > > The printf(1) command produces the byte 0xff. The shell reads that byte > and constructs the command "echo weird-byte" which prints the weird byte > to standard output as desired. If the shell would internally store its > input in wchar_t, that would not be possible because that byte cannot > be stored in wchar_t. Depending on how we would handle such input, > this might result in the shell dying from EILSEQ, or printing 0xbfc3 > instead of 0xff, or any number of other weird changes in behavior. agreed; my favourite part is that you don't even need printf; just iterating over files names is enough (since they're just bytestrings) > [...] > > Omar, note that i gave up on refuting what Christian Schulte is saying > on a point-by-point basis because what he says is typically a mixture of > half-truths, misunderstandings, misleading claims, and arguments at best > tangentially related to the topic under discussion. So unless you are > in urgent need of wild goose feathers, consider hunting elsewhere... yeah, as i was saying i just fell for an almost pointless correction. But this stemmed out in your precise reply, which is always a pleasure to read! > But given that you have shown interest, i thought it might make sense > to explain what is *actually* going on here. i guess i have to put my code where my mouth is. I like your outline for handling text in ksh interactive mode; can't promise anything but i'll give it an honest spin! Thank you!