From: Christian Schulte <cs@schulte.it>
Subject: Re: [REPOST] ksh: utf8 full width character support for emacs.c
To: Gong Zhile <gongzl@stu.hebust.edu.cn>, tech@openbsd.org
Date: Thu, 20 Mar 2025 05:37:44 +0100

On 3/20/25 05:09, Christian Schulte wrote:
> On 3/19/25 03:15, Gong Zhile wrote:
>> On Tue, 2025-03-18 at 06:12 +0100, Christian Schulte wrote:
>>> On 3/16/25 15:49, Gong Zhile wrote:
>>>> Full width characters are commonly used in Asian language system like
>>>> Chinese,
>>>> Japanese and Korean etc. Those characters took double the width of a
>>>> normal
>>>> ascii char but x_size only counts them in one unit. When navigating
>>>> between
>>>> those characters in emacs line editing mode, the cursor would lose track
>>>> and
>>>> mess up the the line making it really difficult to input.
>>>>
>>>> I tried to make x_size counts correctly with static variables in func and
>>>> looking up in a table generated from ‘EastAsianWidth.txt’. Characters
>>>> mainly
>>>> count in a size of 2 are: Kanji, Katakana, Hiragana, Hangul, Roman Full-
>>>> Width
>>>> Characters, emojis etc.
>>>>
>>>> Expected behavior (After patching): cursor should land correctly while
>>>> navigating between full width characters, line editing commands (like
>>>> x_transpose)
>>>> correctly perform.
>>>>
>>>> Known issue: When heading off the screen with full width chars, it fails
>>>> to
>>>> place the angle bracket correctly. (Not easy to deal with when full width
>>>> characters crossing xx_cols)
>>>>
>>>> Tested on: rxvt-unicode, xterm
>>>
>>> wchar_t on OpenBSD and most other unix like OSes is 32 bit UTF 32.
>>> Others use 16 bit UTF 16 with surrogate values for everything > 0xffff.
>>> Some (microcontroller) libraries use 8 bit UTF 8. It's detectable by
>>> compiling
>>>
>>> wchar_t *s = L"\U0010ffff";
>>>
>>> and see what the compiler will produce.
>>>
>>> UTF 32: wcslen(s) == 1 && *s == 0x10ffff
>>> UTF 16: wcslen(s) == 2 && s[0] == 0xdbff && s[1] == 0xdfff
>>> UTF 8: wcslen(s) == 4 && s[0] == 0xf4 && s[1] == 0x8f
>>> 	&& s[2] == 0xbf && s[3] == 0xbf
>>>
>>> Getting full unicode support would mean to replace everything 8 bit char
>>> with wchar_t and use wide character string functions instead of the 8
>>> bit string functions. Everything else will always be a non-portable
>>> hack. Same for multi byte strings. That could mean everything. Such a
>>> shell would be cool to have, of course. Quite a refactoring effort. So
>>> you would end up with the current shell unchanged, and a new shell
>>> (uksh) to choose as a starting point, just to notice that this will only
>>> work when every other software will be refactored from char to wchar_t.
>>
>> In fact, in the current state, ksh's already working quiet well with utf-8,
>> thanks to the earlier work regarding utf-8 support, except the problem messing
>> with multi-column characters. It's simply a step to make it better in utf-8.
> 
> Just keep in mind that UTF 8 is just a hack to overcome the 8 bit limit
> of char not compatible with unicode, which currently is defined to
> require at least 20 bit with the highest codepoint equal to 0x10ffff.
> Nothing mandates those 8 bit char multibyte strings to actually contain
> UTF 8.
> 

For example. This <https://utf8everywhere.org/> is very misleading.

-- 
Christian