From: Gong Zhile Subject: Re: [REPOST] ksh: utf8 full width character support for emacs.c To: Christian Schulte , tech@openbsd.org Date: Wed, 19 Mar 2025 10:15:42 +0800 On Tue, 2025-03-18 at 06:12 +0100, Christian Schulte wrote: > On 3/16/25 15:49, Gong Zhile wrote: > > Full width characters are commonly used in Asian language system like > > Chinese, > > Japanese and Korean etc. Those characters took double the width of a > > normal > > ascii char but x_size only counts them in one unit. When navigating > > between > > those characters in emacs line editing mode, the cursor would lose track > > and > > mess up the the line making it really difficult to input. > > > > I tried to make x_size counts correctly with static variables in func and > > looking up in a table generated from ‘EastAsianWidth.txt’. Characters > > mainly > > count in a size of 2 are: Kanji, Katakana, Hiragana, Hangul, Roman Full- > > Width > > Characters, emojis etc. > > > > Expected behavior (After patching): cursor should land correctly while > > navigating between full width characters, line editing commands (like > > x_transpose) > > correctly perform. > > > > Known issue: When heading off the screen with full width chars, it fails > > to > > place the angle bracket correctly. (Not easy to deal with when full width > > characters crossing xx_cols) > > > > Tested on: rxvt-unicode, xterm > > wchar_t on OpenBSD and most other unix like OSes is 32 bit UTF 32. > Others use 16 bit UTF 16 with surrogate values for everything > 0xffff. > Some (microcontroller) libraries use 8 bit UTF 8. It's detectable by > compiling > > wchar_t *s = L"\U0010ffff"; > > and see what the compiler will produce. > > UTF 32: wcslen(s) == 1 && *s == 0x10ffff > UTF 16: wcslen(s) == 2 && s[0] == 0xdbff && s[1] == 0xdfff > UTF 8: wcslen(s) == 4 && s[0] == 0xf4 && s[1] == 0x8f > && s[2] == 0xbf && s[3] == 0xbf > > Getting full unicode support would mean to replace everything 8 bit char > with wchar_t and use wide character string functions instead of the 8 > bit string functions. Everything else will always be a non-portable > hack. Same for multi byte strings. That could mean everything. Such a > shell would be cool to have, of course. Quite a refactoring effort. So > you would end up with the current shell unchanged, and a new shell > (uksh) to choose as a starting point, just to notice that this will only > work when every other software will be refactored from char to wchar_t. In fact, in the current state, ksh's already working quiet well with utf-8, thanks to the earlier work regarding utf-8 support, except the problem messing with multi-column characters. It's simply a step to make it better in utf-8. > > + > > +int u8_to_cpt(const char *buf, unsigned long *cpt) { > > If this is supposed to convert UTF 8 to UTF 32, it's wrong. There isn't any wchar_t involved in that patch. It took a UTF-8 rune (codepoint) from a cstring and process it. But, as enh has pointed out, refactoring it to elevate wcwidth(3) is surely a good idea. > > + const unsigned char *ubuf = buf; > > + > > + if (ubuf[0] <= 0x7F) { > > + *cpt = ubuf[0]; > > + return 1; > > + } else if ((ubuf[0] & 0xE0) == 0xC0) { >                               0xC0 > > > + *cpt = ((ubuf[0] & 0x1F) << 6) | (ubuf[1] & 0x3F); > > + return 2; > > + } else if ((ubuf[0] & 0xF0) == 0xE0) { >                               0xE0 > > > + *cpt = ((ubuf[0] & 0x0F) << 12) > > + | ((ubuf[1] & 0x3F) << 6) > > + | (ubuf[2] & 0x3F); > > + return 3; > > + } else if ((ubuf[0] & 0xF8) == 0xF0) { >                               0xF0 > > > + *cpt = ((ubuf[0] & 0x07) << 18) > > + | ((ubuf[1] & 0x3F) << 12) > > + | ((ubuf[2] & 0x3F) << 6) > > + | (ubuf[3] & 0x3F); > > + return 4; > > + } > > + > > + return 0; > > +} > > + > > +#endif > > Index: bin/ksh/unicode.h > > =================================================================== > > --- bin/ksh/unicode.h    (new file) > > +++ bin/ksh/unicode.h    (working copy) > > --- /dev/null 2024-12-17 11:54:03.396000088 +0800 > > +++ bin/ksh/unicode.h 2024-12-17 09:19:00.521730569 +0800 > > @@ -0,0 +1,7 @@ > > +#ifndef UNICODE_H > > +#define UNICODE_H > > + > > +int is_fullwidth(unsigned long); > > +int u8_to_cpt(const char *, unsigned long *); > > + > > +#endif /* UNICODE_H */