Download raw body.
[REPOST] ksh: utf8 full width character support for emacs.c
On 3/21/25 12:15, Ingo Schwarze wrote: > Hello, > > Gong Zhile wrote on Wed, Mar 19, 2025 at 10:15:42AM +0800: > >> There isn't any wchar_t involved in that patch. It took a UTF-8 rune >> (codepoint) from a cstring and process it. But, as enh has pointed out, >> refactoring it to elevate wcwidth(3) is surely a good idea. > > I wouldn't say "surely", but i would say "more likely". > > You have picked quite a tricky task here. Hurdles include: > * To use wcwidth(3), you need wchar_t values. > * I see no reasonable way how you could get such values > other than by using libc functions like mbtowc(3). > * To use these functions, it is necessary to use setlocale(3) > or functions like newlocale(3)/uselocale(3), which the shell > does not use yet. > * One needs to consider whether using ???locale(3) in the shell > carries any risk, or needs to be restricted to certain areas of > the code. It's not yet clear to me that suddenly running all > the code in the shell under a locale different from the C locale > would have no detrimental consequences. Of course there are consequences. char in C has always and will always correspond to a byte (8 bit). That's the root cause of the issue at hand. Well. In Java char is 16 bit so it's suffering the same limitations. There never has been a type in C reflecting a byte other than char. But char stands for character, not byte. These days a character in unicode is at least a 20 bit value. Regarding that "setlocale" thing. There always has been an implicit setlocale from day one making char to mean 7 bit US-ASCII, although it's historically an 8 bit value. So all those other encodings making use of that 8 bit property, like iso8895 and such were always just a work around to provide more than those ASCII characters. Just give up on that "a char(acter) is a byte" invariant and accept that a char(acter) these days is at least a 20 bit value. There is no drawback on e.g. shell scripts containing unicode and such, if the shell would just give up on that "a character is a byte" invariant, which it could by giving up on char in favour of wchar_t. -- Christian
[REPOST] ksh: utf8 full width character support for emacs.c