From: Omar Polo Subject: Re: [REPOST] ksh: utf8 full width character support for emacs.c To: Christian Schulte Cc: tech@openbsd.org Date: Sun, 30 Mar 2025 20:37:06 +0200 Christian Schulte wrote: > Of course there are consequences. char in C has always and will always > correspond to a byte (8 bit). That's the root cause of the issue at > hand. Well. In Java char is 16 bit so it's suffering the same > limitations. There never has been a type in C reflecting a byte other > than char. But char stands for character, not byte. These days a > character in unicode is at least a 20 bit value. sorry for nitpicking, but no. grapheme clusters (i.e. what a user percieves as a "character") can be more than one code point long. you can easily see emojis as long as ten code points. (just using emojis as a example, actual human languages make use of combining characters as well) > Regarding that > "setlocale" thing. There always has been an implicit setlocale from day > one making char to mean 7 bit US-ASCII, although it's historically an 8 > bit value. So all those other encodings making use of that 8 bit > property, like iso8895 and such were always just a work around to > provide more than those ASCII characters. Just give up on that "a > char(acter) is a byte" invariant and accept that a char(acter) these > days is at least a 20 bit value. There is no drawback on e.g. shell > scripts containing unicode and such, if the shell would just give up on > that "a character is a byte" invariant, which it could by giving up on > char in favour of wchar_t. which i'm not sure it actually fixes anything, given that a wchar_t is not a grapheme cluster. ("anything" in the context of the thread, i.e. hitting the left arrow on the key and seeing the cursor skipping over a character rather than being stuck somewhere between the wchar_t LATIN SMALL LETTER A and the wchar_t COMBINING ACUTE ACCENT)