From: Christian Schulte <cs@schulte.it>
Subject: Re: [REPOST] ksh: utf8 full width character support for emacs.c
To: tech@openbsd.org
Date: Tue, 18 Mar 2025 08:18:19 +0100

On 3/18/25 06:12, Christian Schulte wrote:
> On 3/16/25 15:49, Gong Zhile wrote:
>> Full width characters are commonly used in Asian language system like Chinese,
>> Japanese and Korean etc. Those characters took double the width of a normal
>> ascii char but x_size only counts them in one unit. When navigating between
>> those characters in emacs line editing mode, the cursor would lose track and
>> mess up the the line making it really difficult to input.
>>
>> I tried to make x_size counts correctly with static variables in func and
>> looking up in a table generated from ‘EastAsianWidth.txt’. Characters mainly
>> count in a size of 2 are: Kanji, Katakana, Hiragana, Hangul, Roman Full-Width
>> Characters, emojis etc.
>>
>> Expected behavior (After patching): cursor should land correctly while
>> navigating between full width characters, line editing commands (like
>> x_transpose)
>> correctly perform.
>>
>> Known issue: When heading off the screen with full width chars, it fails to
>> place the angle bracket correctly. (Not easy to deal with when full width
>> characters crossing xx_cols)
>>
>> Tested on: rxvt-unicode, xterm
> 
> wchar_t on OpenBSD and most other unix like OSes is 32 bit UTF 32.
> Others use 16 bit UTF 16 with surrogate values for everything > 0xffff.
> Some (microcontroller) libraries use 8 bit UTF 8. It's detectable by
> compiling
> 
> wchar_t *s = L"\U0010ffff";
> 
> and see what the compiler will produce.
> 
> UTF 32: wcslen(s) == 1 && *s == 0x10ffff
> UTF 16: wcslen(s) == 2 && s[0] == 0xdbff && s[1] == 0xdfff
> UTF 8: wcslen(s) == 4 && s[0] == 0xf4 && s[1] == 0x8f
> 	&& s[2] == 0xbf && s[3] == 0xbf
> 
> Getting full unicode support would mean to replace everything 8 bit char
> with wchar_t and use wide character string functions instead of the 8
> bit string functions. Everything else will always be a non-portable
> hack. Same for multi byte strings. That could mean everything. Such a
> shell would be cool to have, of course. Quite a refactoring effort. So
> you would end up with the current shell unchanged, and a new shell
> (uksh) to choose as a starting point, just to notice that this will only
> work when every other software will be refactored from char to wchar_t.
> 
>> +
>> +int u8_to_cpt(const char *buf, unsigned long *cpt) {
> 
> If this is supposed to convert UTF 8 to UTF 32, it's wrong.

I did not mean to be disrespectful with this. The same way OpenBSD went
from 32 bit time_t to 64 bit time_t, OpenBSD would need to go from 8 bit
char to 32 bit char to get "full" unicode support. That's not that easy.

-- .
Christian