From: Gong Zhile <gongzl@stu.hebust.edu.cn>
Subject: Re: [REPOST] ksh: utf8 full width character support for emacs.c
To: Christian Schulte <cs@schulte.it>, tech@openbsd.org
Date: Wed, 19 Mar 2025 10:15:42 +0800

On Tue, 2025-03-18 at 06:12 +0100, Christian Schulte wrote:
> On 3/16/25 15:49, Gong Zhile wrote:
> > Full width characters are commonly used in Asian language system like
> > Chinese,
> > Japanese and Korean etc. Those characters took double the width of a
> > normal
> > ascii char but x_size only counts them in one unit. When navigating
> > between
> > those characters in emacs line editing mode, the cursor would lose track
> > and
> > mess up the the line making it really difficult to input.
> > 
> > I tried to make x_size counts correctly with static variables in func and
> > looking up in a table generated from ‘EastAsianWidth.txt’. Characters
> > mainly
> > count in a size of 2 are: Kanji, Katakana, Hiragana, Hangul, Roman Full-
> > Width
> > Characters, emojis etc.
> > 
> > Expected behavior (After patching): cursor should land correctly while
> > navigating between full width characters, line editing commands (like
> > x_transpose)
> > correctly perform.
> > 
> > Known issue: When heading off the screen with full width chars, it fails
> > to
> > place the angle bracket correctly. (Not easy to deal with when full width
> > characters crossing xx_cols)
> > 
> > Tested on: rxvt-unicode, xterm
> 
> wchar_t on OpenBSD and most other unix like OSes is 32 bit UTF 32.
> Others use 16 bit UTF 16 with surrogate values for everything > 0xffff.
> Some (microcontroller) libraries use 8 bit UTF 8. It's detectable by
> compiling
> 
> wchar_t *s = L"\U0010ffff";
> 
> and see what the compiler will produce.
> 
> UTF 32: wcslen(s) == 1 && *s == 0x10ffff
> UTF 16: wcslen(s) == 2 && s[0] == 0xdbff && s[1] == 0xdfff
> UTF 8: wcslen(s) == 4 && s[0] == 0xf4 && s[1] == 0x8f
> 	&& s[2] == 0xbf && s[3] == 0xbf
> 
> Getting full unicode support would mean to replace everything 8 bit char
> with wchar_t and use wide character string functions instead of the 8
> bit string functions. Everything else will always be a non-portable
> hack. Same for multi byte strings. That could mean everything. Such a
> shell would be cool to have, of course. Quite a refactoring effort. So
> you would end up with the current shell unchanged, and a new shell
> (uksh) to choose as a starting point, just to notice that this will only
> work when every other software will be refactored from char to wchar_t.

In fact, in the current state, ksh's already working quiet well with utf-8,
thanks to the earlier work regarding utf-8 support, except the problem messing
with multi-column characters. It's simply a step to make it better in utf-8.

> > +
> > +int u8_to_cpt(const char *buf, unsigned long *cpt) {
> 
> If this is supposed to convert UTF 8 to UTF 32, it's wrong.

There isn't any wchar_t involved in that patch. It took a UTF-8 rune
(codepoint) from a cstring and process it. But, as enh has pointed out,
refactoring it to elevate wcwidth(3) is surely a good idea.

> > +	const unsigned char *ubuf = buf;
> > +
> > +	if (ubuf[0] <= 0x7F) {
> > +		*cpt = ubuf[0];
> > +		return 1;
> > +	} else if ((ubuf[0] & 0xE0) == 0xC0) {
>                               0xC0
> 
> > +		*cpt = ((ubuf[0] & 0x1F) << 6) | (ubuf[1] & 0x3F);
> > +		return 2;
> > +	} else if ((ubuf[0] & 0xF0) == 0xE0) {
>                               0xE0
> 
> > +		*cpt = ((ubuf[0] & 0x0F) << 12)
> > +			| ((ubuf[1] & 0x3F) << 6)
> > +			| (ubuf[2] & 0x3F);
> > +		return 3;
> > +	} else if ((ubuf[0] & 0xF8) == 0xF0) {
>                               0xF0
> 
> > +		*cpt = ((ubuf[0] & 0x07) << 18)
> > +			| ((ubuf[1] & 0x3F) << 12)
> > +			| ((ubuf[2] & 0x3F) << 6)
> > +			| (ubuf[3] & 0x3F);
> > +		return 4;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +#endif
> > Index: bin/ksh/unicode.h
> > ===================================================================
> > --- bin/ksh/unicode.h    (new file)
> > +++ bin/ksh/unicode.h    (working copy)
> > --- /dev/null	2024-12-17 11:54:03.396000088 +0800
> > +++ bin/ksh/unicode.h	2024-12-17 09:19:00.521730569 +0800
> > @@ -0,0 +1,7 @@
> > +#ifndef UNICODE_H
> > +#define UNICODE_H
> > +
> > +int is_fullwidth(unsigned long);
> > +int u8_to_cpt(const char *, unsigned long *);
> > +
> > +#endif	/* UNICODE_H */