From: enh Subject: Re: [REPOST] ksh: utf8 full width character support for emacs.c To: Ingo Schwarze Cc: Omar Polo , tech@openbsd.org Date: Mon, 31 Mar 2025 17:57:17 -0400 On Mon, Mar 31, 2025 at 9:59 AM Ingo Schwarze wrote: > > Hello Elliott, > > enh wrote on Mon, Mar 31, 2025 at 08:40:04AM -0400: > > On Sun, Mar 30, 2025 at 7:47 PM Ingo Schwarze wrote: > > >> the following *vastly simplified* model more or less works > >> 1. Assume that user-perceived characters are identical to > >> legacy grapheme clusters. > >> 2. Assume that all width-0 code points are combining characters > >> and vice versa. > >> 3. Assume that users want to operate on user-perceived characters > >> rather than wanting to edit *parts* of user-perceived characters > [...] > >> Finally, note that our Perl code contained in the base system > >> contains some of the information that would be needed to refine > >> the assumpions 1. to 3. above in some respects - but most of these > >> refinements would not be supported by any C library, not even by > >> a C library boasting full POSIX locale support. Such refinements > >> typically require specialized Unicode libraries that go far beyond > >> what any C library can do - and i think in shell command line editing, > >> going beyond the assumptions 1. to 3. would be quite crazy. > >> Like, who would expect shell command line editing to work well > >> for Hangul text processing? > > > (funnily enough, in _practice_ this will actually work quite well > > because IMEs give you the pre-composed syllables rather than the > > individual jamo. and if you really did want to cope with absolutely > > everything you could possibly encounter, my comments > > https://android-review.googlesource.com/c/platform/bionic/+/2983497 > > where i looked into this in detail in the context of Android's > > icu-based wcwidth() implementation suggest that your earlier "vastly > > simplified" algorithm would work fine for that too ... > > Oh wow. I always considered the bionic C library as an implementation > geared twoards simplicity, minimalism, and compactness. It turns out > that is only part of the truth, right? Providing substantial ICU > components inside libc is hard to call "minimalistic". i'm not sure i would have used any of those words? in particular, i've always found "minimalism" a problematic word in all walks of life. relevant to ours is the old joke from the MS Word folks along the lines of "sure, any given user only uses 5% of our features ... the trouble is they're all using a _different 5%". our team used to have a banner reading "correctness, performance, completeness ... in that order", which is probably as close to a philosophy as i remember. but in this specific area of i18n/l10n, bionic is definitely a bit interesting. the "rules" were basically: 1. most of the libc/POSIX api in this area is just plain useless. it was written before anyone knew what the problems were, by people who weren't shipping in nearly enough locales to even have any way to test an idea for fitness. 2. we should avoid luring people into the equivalent of "the bash script trap" where there's enough support for you to get started, but then you suddenly realize your existing solution is part of the problem. better to just make it clear from the beginning that -- if you're serious about i18n/l10n -- you wanted java or icu4c. _but_ wcwidth() is a weird special case. on the one hand -- as you can see from the bionic implementation -- that's _not_ something that icu supports, nor is it something there's a unicode tr for. on the other, it's legitimately quite important even for basic tty stuff. and it's _definitely_ hairy enough that you don't want every piece of tty-based code that tries to line up columns or do line editing have to reinvent that (and stay up to date with each year's new batch of unicode code points). which comes nicely back to "minimalism" and one of the big flaws in any argument based on minimalism --- "for whom?". some would argue that it's "more minimal" to not have this in the OS. others would argue that that just leads to it being reinvented in every would-be caller. neither side is right or wrong; it's a philosophical divide of the kind that "minimalism" tends to obscure. i suspect the closest to "minimalist" we get -- though obviously this doesn't match everyone's definition -- would be "do it properly, or don't do it at all". hence no , but since we do have wcwidth(), we ensure that it's always using the latest unicode data. > I mean, seriously, stuff like the following being available inside > the C library feels like quite advanced technology to me, > certainly significantly exceeding what POSIX requires. > > UCHAR_DEFAULT_IGNORABLE_CODE_POINT > UCHAR_HANGUL_SYLLABLE_TYPE > U_HST_VOWEL_JAMO > U_HST_TRAILING_JAMO if you look at ToT, you'll see it's written in rust now, using icux... i _think_ that makes bionic the first libc to have a piece implemented in rust? :-) (in addition to being a useful proof of concept for both "reimplementing parts of c libraries in rust on Android" _and_ "moving from icu4c to [rust-based] icux", the new version is very slightly faster, but more importantly works in static binaries as well as dynamic binaries [because the rust stuff is factored in a way that lets us just include it in libc rather than needing to dlopen(), and bionic's dlopen() -- like many -- doesn't work in static binaries].) > That's the kind of stuff i had in mind when refering to Perl above. > It also illustrates my point that some of the concepts involved in > making this work perfectly are script- and language-specific. yeah, one nice thing with wcwidth() that made it a good candidate is that [afaik] it's not _locale dependent_. so much of this i18n/l10n stuff is, and then you're really into the weeds. (and, worse than that, often in a way that the c/posix apis just can't represent.) (i'll spare you my comment about the guy i knew who argued with a straight face that _perl_ was "minimal" because you didn't need all those other things out of /bin --- all you needed was perl. that was probably the day i learned that "minimal" tends to mean "just the stuff i like, not the stuff you like" more than anything actually meaningful!) > > assuming you don't have the same bug in your wcwidth() that i fixed > > in Android's with that change :-) ) > > Thanks for sending this buxfix upstream to us! :-D heh. if you're interested in this stuff, it's definitely worth running the bionic tests. (insert lament that there's no c unit testing that everyone agrees on, which means the unit tests are all c++.) > But no, not only do we have the same bug you fixed here, > not only do we have several bugs that were already fixed in your code > before you fixed the bug you are talking about (for example, > we use UCHAR_EAST_ASIAN_WIDTH regardless and do not consider > vowel and trailing jamo as special cases), but i don't think > fixing any of this is in scope for the OpenBSD project, > at least not right now. > > I worry if i would propose adding partial ICU internal APIs into > our libc in order to improve wcwidth(3), i might get myself > shot on sight. OpenBSD ideas of simplicity and minimalism > appear to be a bit more pronounced than Google's. =:c) i mean, you can totally do this with a script -- probably even in sed :-) -- yourself. but since we already have people whose job it is to keep icu updated, and because [having learned the hard way] sometimes being _consistent_ across apis is better than being right [half the time]... icu actually seemed like an obvious choice. i think that's perhaps a real philosophical change that has occured during my time working on bionic --- it's a common failure mode on Android (and presumably other OSes) to think "i work on a $X that just happens to run on $OS" rather than "i work on $OS's $X". it's a subtle wording difference, but a hugely different outlook, and one that can lead to what might seem like odd choices to others. > Still, i consider your feedback valuable because it provides an > excellent illustration that perfection with respect to Unicode > is almost impossible, but nonetheless making things better is often > possible and can often use surprisingly radical simplifications of > concepts - Unicode purists might call such simplifications incorrect > and disgusting, OpenBSD purists might call the same simplifications > absurdly bloated, but i don't doubt that there is a range of tasks > where they can actually be useful. exactly. and in case it wasn't clear, let me be explicit: i joined in to support your pragmatic view. "half an eye is better than no eye" :-) (in case anyone's read me repeatedly argue _against_ supporting grapheme clusters or other notions of user-perceivable "character" over on the toybox mailing list, i think the big thing here is that you _can_ make command-line editing "pretty good in practice" for significantly more people without solving _all_ the problems. *cough* bidi *cough*. whereas trying to teach tr(1), say, about notions of "character" beyond byte, something that comes up on the toybox mailing list now and then ... i'm just not sure who _needs_ that, or for what. plan9 went to a lot of trouble to do that, but afaik "just to show that traditional algorithms could be extended to unicode without scaling issues" -- which, sure, was independently useful work in its own right back when not everything was 8-bit clean! -- rather than because they were needed by actual human users.) > Yours, > Ingo