From: Christian Schulte Subject: Re: [REPOST] ksh: utf8 full width character support for emacs.c To: Gong Zhile , tech@openbsd.org Date: Fri, 21 Mar 2025 09:53:33 +0100 On 3/19/25 03:15, Gong Zhile wrote: > On Tue, 2025-03-18 at 06:12 +0100, Christian Schulte wrote: >> On 3/16/25 15:49, Gong Zhile wrote: >>> Full width characters are commonly used in Asian language system like >>> Chinese, >>> Japanese and Korean etc. Those characters took double the width of a >>> normal >>> ascii char but x_size only counts them in one unit. When navigating >>> between >>> those characters in emacs line editing mode, the cursor would lose track >>> and >>> mess up the the line making it really difficult to input. >>> >>> I tried to make x_size counts correctly with static variables in func and >>> looking up in a table generated from ‘EastAsianWidth.txt’. Characters >>> mainly >>> count in a size of 2 are: Kanji, Katakana, Hiragana, Hangul, Roman Full- >>> Width >>> Characters, emojis etc. >>> >>> Expected behavior (After patching): cursor should land correctly while >>> navigating between full width characters, line editing commands (like >>> x_transpose) >>> correctly perform. >>> >>> Known issue: When heading off the screen with full width chars, it fails >>> to >>> place the angle bracket correctly. (Not easy to deal with when full width >>> characters crossing xx_cols) >>> >>> Tested on: rxvt-unicode, xterm >> >> wchar_t on OpenBSD and most other unix like OSes is 32 bit UTF 32. >> Others use 16 bit UTF 16 with surrogate values for everything > 0xffff. >> Some (microcontroller) libraries use 8 bit UTF 8. It's detectable by >> compiling >> >> wchar_t *s = L"\U0010ffff"; >> >> and see what the compiler will produce. >> >> UTF 32: wcslen(s) == 1 && *s == 0x10ffff >> UTF 16: wcslen(s) == 2 && s[0] == 0xdbff && s[1] == 0xdfff >> UTF 8: wcslen(s) == 4 && s[0] == 0xf4 && s[1] == 0x8f >> && s[2] == 0xbf && s[3] == 0xbf >> >> Getting full unicode support would mean to replace everything 8 bit char >> with wchar_t and use wide character string functions instead of the 8 >> bit string functions. Everything else will always be a non-portable >> hack. Same for multi byte strings. That could mean everything. Such a >> shell would be cool to have, of course. Quite a refactoring effort. So >> you would end up with the current shell unchanged, and a new shell >> (uksh) to choose as a starting point, just to notice that this will only >> work when every other software will be refactored from char to wchar_t. > > In fact, in the current state, ksh's already working quiet well with utf-8, > thanks to the earlier work regarding utf-8 support, except the problem messing > with multi-column characters. It's simply a step to make it better in utf-8. > >>> + >>> +int u8_to_cpt(const char *buf, unsigned long *cpt) { >> >> If this is supposed to convert UTF 8 to UTF 32, it's wrong. > > There isn't any wchar_t involved in that patch. It took a UTF-8 rune > (codepoint) from a cstring and process it. But, as enh has pointed out, > refactoring it to elevate wcwidth(3) is surely a good idea. That's the point. It should. Take a look at as a starting point. -- Christian