Download raw body.
[REPOST] ksh: utf8 full width character support for emacs.c
On 3/16/25 15:49, Gong Zhile wrote:
> Full width characters are commonly used in Asian language system like Chinese,
> Japanese and Korean etc. Those characters took double the width of a normal
> ascii char but x_size only counts them in one unit. When navigating between
> those characters in emacs line editing mode, the cursor would lose track and
> mess up the the line making it really difficult to input.
>
> I tried to make x_size counts correctly with static variables in func and
> looking up in a table generated from ‘EastAsianWidth.txt’. Characters mainly
> count in a size of 2 are: Kanji, Katakana, Hiragana, Hangul, Roman Full-Width
> Characters, emojis etc.
>
> Expected behavior (After patching): cursor should land correctly while
> navigating between full width characters, line editing commands (like
> x_transpose)
> correctly perform.
>
> Known issue: When heading off the screen with full width chars, it fails to
> place the angle bracket correctly. (Not easy to deal with when full width
> characters crossing xx_cols)
>
> Tested on: rxvt-unicode, xterm
wchar_t on OpenBSD and most other unix like OSes is 32 bit UTF 32.
Others use 16 bit UTF 16 with surrogate values for everything > 0xffff.
Some (microcontroller) libraries use 8 bit UTF 8. It's detectable by
compiling
wchar_t *s = L"\U0010ffff";
and see what the compiler will produce.
UTF 32: wcslen(s) == 1 && *s == 0x10ffff
UTF 16: wcslen(s) == 2 && s[0] == 0xdbff && s[1] == 0xdfff
UTF 8: wcslen(s) == 4 && s[0] == 0xf4 && s[1] == 0x8f
&& s[2] == 0xbf && s[3] == 0xbf
Getting full unicode support would mean to replace everything 8 bit char
with wchar_t and use wide character string functions instead of the 8
bit string functions. Everything else will always be a non-portable
hack. Same for multi byte strings. That could mean everything. Such a
shell would be cool to have, of course. Quite a refactoring effort. So
you would end up with the current shell unchanged, and a new shell
(uksh) to choose as a starting point, just to notice that this will only
work when every other software will be refactored from char to wchar_t.
> +
> +int u8_to_cpt(const char *buf, unsigned long *cpt) {
If this is supposed to convert UTF 8 to UTF 32, it's wrong.
> + const unsigned char *ubuf = buf;
> +
> + if (ubuf[0] <= 0x7F) {
> + *cpt = ubuf[0];
> + return 1;
> + } else if ((ubuf[0] & 0xE0) == 0xC0) {
0xC0
> + *cpt = ((ubuf[0] & 0x1F) << 6) | (ubuf[1] & 0x3F);
> + return 2;
> + } else if ((ubuf[0] & 0xF0) == 0xE0) {
0xE0
> + *cpt = ((ubuf[0] & 0x0F) << 12)
> + | ((ubuf[1] & 0x3F) << 6)
> + | (ubuf[2] & 0x3F);
> + return 3;
> + } else if ((ubuf[0] & 0xF8) == 0xF0) {
0xF0
> + *cpt = ((ubuf[0] & 0x07) << 18)
> + | ((ubuf[1] & 0x3F) << 12)
> + | ((ubuf[2] & 0x3F) << 6)
> + | (ubuf[3] & 0x3F);
> + return 4;
> + }
> +
> + return 0;
> +}
> +
> +#endif
> Index: bin/ksh/unicode.h
> ===================================================================
> --- bin/ksh/unicode.h (new file)
> +++ bin/ksh/unicode.h (working copy)
> --- /dev/null 2024-12-17 11:54:03.396000088 +0800
> +++ bin/ksh/unicode.h 2024-12-17 09:19:00.521730569 +0800
> @@ -0,0 +1,7 @@
> +#ifndef UNICODE_H
> +#define UNICODE_H
> +
> +int is_fullwidth(unsigned long);
> +int u8_to_cpt(const char *, unsigned long *);
> +
> +#endif /* UNICODE_H */
--
Christian
[REPOST] ksh: utf8 full width character support for emacs.c