From: Ingo Schwarze Subject: Re: [REPOST] ksh: utf8 full width character support for emacs.c To: enh@google.com Cc: Omar Polo , tech@openbsd.org Date: Mon, 31 Mar 2025 15:59:04 +0200 Hello Elliott, enh wrote on Mon, Mar 31, 2025 at 08:40:04AM -0400: > On Sun, Mar 30, 2025 at 7:47 PM Ingo Schwarze wrote: >> the following *vastly simplified* model more or less works >> 1. Assume that user-perceived characters are identical to >> legacy grapheme clusters. >> 2. Assume that all width-0 code points are combining characters >> and vice versa. >> 3. Assume that users want to operate on user-perceived characters >> rather than wanting to edit *parts* of user-perceived characters [...] >> Finally, note that our Perl code contained in the base system >> contains some of the information that would be needed to refine >> the assumpions 1. to 3. above in some respects - but most of these >> refinements would not be supported by any C library, not even by >> a C library boasting full POSIX locale support. Such refinements >> typically require specialized Unicode libraries that go far beyond >> what any C library can do - and i think in shell command line editing, >> going beyond the assumptions 1. to 3. would be quite crazy. >> Like, who would expect shell command line editing to work well >> for Hangul text processing? > (funnily enough, in _practice_ this will actually work quite well > because IMEs give you the pre-composed syllables rather than the > individual jamo. and if you really did want to cope with absolutely > everything you could possibly encounter, my comments > https://android-review.googlesource.com/c/platform/bionic/+/2983497 > where i looked into this in detail in the context of Android's > icu-based wcwidth() implementation suggest that your earlier "vastly > simplified" algorithm would work fine for that too ... Oh wow. I always considered the bionic C library as an implementation geared twoards simplicity, minimalism, and compactness. It turns out that is only part of the truth, right? Providing substantial ICU components inside libc is hard to call "minimalistic". I mean, seriously, stuff like the following being available inside the C library feels like quite advanced technology to me, certainly significantly exceeding what POSIX requires. UCHAR_DEFAULT_IGNORABLE_CODE_POINT UCHAR_HANGUL_SYLLABLE_TYPE U_HST_VOWEL_JAMO U_HST_TRAILING_JAMO That's the kind of stuff i had in mind when refering to Perl above. It also illustrates my point that some of the concepts involved in making this work perfectly are script- and language-specific. > assuming you don't have the same bug in your wcwidth() that i fixed > in Android's with that change :-) ) Thanks for sending this buxfix upstream to us! :-D But no, not only do we have the same bug you fixed here, not only do we have several bugs that were already fixed in your code before you fixed the bug you are talking about (for example, we use UCHAR_EAST_ASIAN_WIDTH regardless and do not consider vowel and trailing jamo as special cases), but i don't think fixing any of this is in scope for the OpenBSD project, at least not right now. I worry if i would propose adding partial ICU internal APIs into our libc in order to improve wcwidth(3), i might get myself shot on sight. OpenBSD ideas of simplicity and minimalism appear to be a bit more pronounced than Google's. =:c) Still, i consider your feedback valuable because it provides an excellent illustration that perfection with respect to Unicode is almost impossible, but nonetheless making things better is often possible and can often use surprisingly radical simplifications of concepts - Unicode purists might call such simplifications incorrect and disgusting, OpenBSD purists might call the same simplifications absurdly bloated, but i don't doubt that there is a range of tasks where they can actually be useful. Yours, Ingo