From: Omar Polo <op@omarpolo.com>
Subject: Re: [REPOST] ksh: utf8 full width character support for emacs.c
To: Christian Schulte <cs@schulte.it>
Cc: tech@openbsd.org
Date: Sun, 30 Mar 2025 20:37:06 +0200

Christian Schulte <cs@schulte.it> wrote:
> Of course there are consequences. char in C has always and will always
> correspond to a byte (8 bit). That's the root cause of the issue at
> hand. Well. In Java char is 16 bit so it's suffering the same
> limitations. There never has been a type in C reflecting a byte other
> than char. But char stands for character, not byte. These days a
> character in unicode is at least a 20 bit value.

sorry for nitpicking, but no.  grapheme clusters (i.e. what a user
percieves as a "character") can be more than one code point long.
you can easily see emojis as long as ten code points.

(just using emojis as a example, actual human languages make use of
combining characters as well)

> Regarding that
> "setlocale" thing. There always has been an implicit setlocale from day
> one making char to mean 7 bit US-ASCII, although it's historically an 8
> bit value. So all those other encodings making use of that 8 bit
> property, like iso8895 and such were always just a work around to
> provide more than those ASCII characters. Just give up on that "a
> char(acter) is a byte" invariant and accept that a char(acter) these
> days is at least a 20 bit value. There is no drawback on e.g. shell
> scripts containing unicode and such, if the shell would just give up on
> that "a character is a byte" invariant, which it could by giving up on
> char in favour of wchar_t.

which i'm not sure it actually fixes anything, given that a wchar_t is
not a grapheme cluster.

("anything" in the context of the thread, i.e. hitting the left arrow on
the key and seeing the cursor skipping over a character rather than
being stuck somewhere between the wchar_t LATIN SMALL LETTER A and the
wchar_t COMBINING ACUTE ACCENT)