From: Ingo Schwarze <schwarze@usta.de>
Subject: Re: [REPOST] ksh: utf8 full width character support for emacs.c
To: enh@google.com
Cc: Omar Polo <op@omarpolo.com>, tech@openbsd.org
Date: Mon, 31 Mar 2025 15:59:04 +0200

Hello Elliott,

enh wrote on Mon, Mar 31, 2025 at 08:40:04AM -0400:
> On Sun, Mar 30, 2025 at 7:47 PM Ingo Schwarze wrote:

>> the following *vastly simplified* model more or less works
>>  1. Assume that user-perceived characters are identical to
>>     legacy grapheme clusters.
>>  2. Assume that all width-0 code points are combining characters
>>     and vice versa.
>>  3. Assume that users want to operate on user-perceived characters
>>     rather than wanting to edit *parts* of user-perceived characters
[...]
>> Finally, note that our Perl code contained in the base system
>> contains some of the information that would be needed to refine
>> the assumpions 1. to 3. above in some respects - but most of these
>> refinements would not be supported by any C library, not even by
>> a C library boasting full POSIX locale support.  Such refinements
>> typically require specialized Unicode libraries that go far beyond
>> what any C library can do - and i think in shell command line editing,
>> going beyond the assumptions 1. to 3. would be quite crazy.
>> Like, who would expect shell command line editing to work well
>> for Hangul text processing?

> (funnily enough, in _practice_ this will actually work quite well
> because IMEs give you the pre-composed syllables rather than the
> individual jamo. and if you really did want to cope with absolutely
> everything you could possibly encounter, my comments
> https://android-review.googlesource.com/c/platform/bionic/+/2983497
> where i looked into this in detail in the context of Android's
> icu-based wcwidth() implementation suggest that your earlier "vastly
> simplified" algorithm would work fine for that too ...

Oh wow.  I always considered the bionic C library as an implementation
geared twoards simplicity, minimalism, and compactness.  It turns out
that is only part of the truth, right?  Providing substantial ICU
components inside libc is hard to call "minimalistic".

I mean, seriously, stuff like the following being available inside
the C library feels like quite advanced technology to me,
certainly significantly exceeding what POSIX requires.

  UCHAR_DEFAULT_IGNORABLE_CODE_POINT
  UCHAR_HANGUL_SYLLABLE_TYPE
  U_HST_VOWEL_JAMO
  U_HST_TRAILING_JAMO

That's the kind of stuff i had in mind when refering to Perl above.
It also illustrates my point that some of the concepts involved in
making this work perfectly  are script- and language-specific.

> assuming you don't have the same bug in your wcwidth() that i fixed
> in Android's with that change :-) )

Thanks for sending this buxfix upstream to us!  :-D
But no, not only do we have the same bug you fixed here,
not only do we have several bugs that were already fixed in your code
before you fixed the bug you are talking about (for example,
we use UCHAR_EAST_ASIAN_WIDTH regardless and do not consider
vowel and trailing jamo as special cases), but i don't think
fixing any of this is in scope for the OpenBSD project,
at least not right now.

I worry if i would propose adding partial ICU internal APIs into
our libc in order to improve wcwidth(3), i might get myself
shot on sight.  OpenBSD ideas of simplicity and minimalism
appear to be a bit more pronounced than Google's.  =:c)

Still, i consider your feedback valuable because it provides an
excellent illustration that perfection with respect to Unicode
is almost impossible, but nonetheless making things better is often
possible and can often use surprisingly radical simplifications of
concepts - Unicode purists might call such simplifications incorrect
and disgusting, OpenBSD purists might call the same simplifications
absurdly bloated, but i don't doubt that there is a range of tasks
where they can actually be useful.

Yours,
  Ingo