Index | Thread | Search

From:
enh <enh@google.com>
Subject:
Re: [REPOST] ksh: utf8 full width character support for emacs.c
To:
Ingo Schwarze <schwarze@usta.de>
Cc:
Omar Polo <op@omarpolo.com>, tech@openbsd.org
Date:
Mon, 31 Mar 2025 17:57:17 -0400

Download raw body.

Thread
On Mon, Mar 31, 2025 at 9:59 AM Ingo Schwarze <schwarze@usta.de> wrote:
>
> Hello Elliott,
>
> enh wrote on Mon, Mar 31, 2025 at 08:40:04AM -0400:
> > On Sun, Mar 30, 2025 at 7:47 PM Ingo Schwarze wrote:
>
> >> the following *vastly simplified* model more or less works
> >>  1. Assume that user-perceived characters are identical to
> >>     legacy grapheme clusters.
> >>  2. Assume that all width-0 code points are combining characters
> >>     and vice versa.
> >>  3. Assume that users want to operate on user-perceived characters
> >>     rather than wanting to edit *parts* of user-perceived characters
> [...]
> >> Finally, note that our Perl code contained in the base system
> >> contains some of the information that would be needed to refine
> >> the assumpions 1. to 3. above in some respects - but most of these
> >> refinements would not be supported by any C library, not even by
> >> a C library boasting full POSIX locale support.  Such refinements
> >> typically require specialized Unicode libraries that go far beyond
> >> what any C library can do - and i think in shell command line editing,
> >> going beyond the assumptions 1. to 3. would be quite crazy.
> >> Like, who would expect shell command line editing to work well
> >> for Hangul text processing?
>
> > (funnily enough, in _practice_ this will actually work quite well
> > because IMEs give you the pre-composed syllables rather than the
> > individual jamo. and if you really did want to cope with absolutely
> > everything you could possibly encounter, my comments
> > https://android-review.googlesource.com/c/platform/bionic/+/2983497
> > where i looked into this in detail in the context of Android's
> > icu-based wcwidth() implementation suggest that your earlier "vastly
> > simplified" algorithm would work fine for that too ...
>
> Oh wow.  I always considered the bionic C library as an implementation
> geared twoards simplicity, minimalism, and compactness.  It turns out
> that is only part of the truth, right?  Providing substantial ICU
> components inside libc is hard to call "minimalistic".

i'm not sure i would have used any of those words? in particular, i've
always found "minimalism" a problematic word in all walks of life.
relevant to ours is the old joke from the MS Word folks along the
lines of "sure, any given user only uses 5% of our features ... the
trouble is they're all using a _different 5%".

our team used to have a banner reading "correctness, performance,
completeness ... in that order", which is probably as close to a
philosophy as i remember.

but in this specific area of i18n/l10n, bionic is definitely a bit
interesting. the "rules" were basically:

1. most of the libc/POSIX api in this area is just plain useless. it
was written before anyone knew what the problems were, by people who
weren't shipping in nearly enough locales to even have any way to test
an idea for fitness.

2. we should avoid luring people into the equivalent of "the bash
script trap" where there's enough support for you to get started, but
then you suddenly realize your existing solution is part of the
problem. better to just make it clear from the beginning that -- if
you're serious about i18n/l10n -- you wanted java or icu4c.

_but_ wcwidth() is a weird special case. on the one hand -- as you can
see from the bionic implementation -- that's _not_ something that icu
supports, nor is it something there's a unicode tr for. on the other,
it's legitimately quite important even for basic tty stuff. and it's
_definitely_ hairy enough that you don't want every piece of tty-based
code that tries to line up columns or do line editing have to reinvent
that (and stay up to date with each year's new batch of unicode code
points).

which comes nicely back to "minimalism" and one of the big flaws in
any argument based on minimalism --- "for whom?". some would argue
that it's "more minimal" to not have this in the OS. others would
argue that that just leads to it being reinvented in every would-be
caller. neither side is right or wrong; it's a philosophical divide of
the kind that "minimalism" tends to obscure.

i suspect the closest to "minimalist" we get -- though obviously this
doesn't match everyone's definition -- would be "do it properly, or
don't do it at all". hence no <monetary.h>, but since we do have
wcwidth(), we ensure that it's always using the latest unicode data.

> I mean, seriously, stuff like the following being available inside
> the C library feels like quite advanced technology to me,
> certainly significantly exceeding what POSIX requires.
>
>   UCHAR_DEFAULT_IGNORABLE_CODE_POINT
>   UCHAR_HANGUL_SYLLABLE_TYPE
>   U_HST_VOWEL_JAMO
>   U_HST_TRAILING_JAMO

if you look at ToT, you'll see it's written in rust now, using icux...
i _think_ that makes bionic the first libc to have a piece implemented
in rust? :-)

(in addition to being a useful proof of concept for both
"reimplementing parts of c libraries in rust on Android" _and_ "moving
from icu4c to [rust-based] icux", the new version is very slightly
faster, but more importantly works in static binaries as well as
dynamic binaries [because the rust stuff is factored in a way that
lets us just include it in libc rather than needing to dlopen(), and
bionic's dlopen() -- like many -- doesn't work in static binaries].)

> That's the kind of stuff i had in mind when refering to Perl above.
> It also illustrates my point that some of the concepts involved in
> making this work perfectly  are script- and language-specific.

yeah, one nice thing with wcwidth() that made it a good candidate is
that [afaik] it's not _locale dependent_. so much of this i18n/l10n
stuff is, and then you're really into the weeds. (and, worse than
that, often in a way that the c/posix apis just can't represent.)

(i'll spare you my comment about the guy i knew who argued with a
straight face that _perl_ was "minimal" because you didn't need all
those other things out of /bin --- all you needed was perl. that was
probably the day i learned that "minimal" tends to mean "just the
stuff i like, not the stuff you like" more than anything actually
meaningful!)

> > assuming you don't have the same bug in your wcwidth() that i fixed
> > in Android's with that change :-) )
>
> Thanks for sending this buxfix upstream to us!  :-D

heh. if you're interested in this stuff, it's definitely worth running
the bionic tests.

(insert lament that there's no c unit testing that everyone agrees on,
which means the unit tests are all c++.)

> But no, not only do we have the same bug you fixed here,
> not only do we have several bugs that were already fixed in your code
> before you fixed the bug you are talking about (for example,
> we use UCHAR_EAST_ASIAN_WIDTH regardless and do not consider
> vowel and trailing jamo as special cases), but i don't think
> fixing any of this is in scope for the OpenBSD project,
> at least not right now.
>
> I worry if i would propose adding partial ICU internal APIs into
> our libc in order to improve wcwidth(3), i might get myself
> shot on sight.  OpenBSD ideas of simplicity and minimalism
> appear to be a bit more pronounced than Google's.  =:c)

i mean, you can totally do this with a script -- probably even in sed
:-) -- yourself. but since we already have people whose job it is to
keep icu updated, and because [having learned the hard way] sometimes
being _consistent_ across apis is better than being right [half the
time]... icu actually seemed like an obvious choice.

i think that's perhaps a real philosophical change that has occured
during my time working on bionic --- it's a common failure mode on
Android (and presumably other OSes) to think "i work on a $X that just
happens to run on $OS" rather than "i work on $OS's $X". it's a subtle
wording difference, but a hugely different outlook, and one that can
lead to what might seem like odd choices to others.

> Still, i consider your feedback valuable because it provides an
> excellent illustration that perfection with respect to Unicode
> is almost impossible, but nonetheless making things better is often
> possible and can often use surprisingly radical simplifications of
> concepts - Unicode purists might call such simplifications incorrect
> and disgusting, OpenBSD purists might call the same simplifications
> absurdly bloated, but i don't doubt that there is a range of tasks
> where they can actually be useful.

exactly. and in case it wasn't clear, let me be explicit: i joined in
to support your pragmatic view. "half an eye is better than no eye"
:-)

(in case anyone's read me repeatedly argue _against_ supporting
grapheme clusters or other notions of user-perceivable "character"
over on the toybox mailing list, i think the big thing here is that
you _can_ make command-line editing "pretty good in practice" for
significantly more people without solving _all_ the problems. *cough*
bidi *cough*. whereas trying to teach tr(1), say, about notions of
"character" beyond byte, something that comes up on the toybox mailing
list now and then ... i'm just not sure who _needs_ that, or for what.
plan9 went to a lot of trouble to do that, but afaik "just to show
that traditional algorithms could be extended to unicode without
scaling issues" -- which, sure, was independently useful work in its
own right back when not everything was 8-bit clean! -- rather than
because they were needed by actual human users.)

> Yours,
>   Ingo