Index | Thread | Search

From:
Omar Polo <op@omarpolo.com>
Subject:
Re: [REPOST] ksh: utf8 full width character support for emacs.c
To:
Christian Schulte <cs@schulte.it>
Cc:
tech@openbsd.org
Date:
Sun, 30 Mar 2025 20:37:06 +0200

Download raw body.

Thread
  • Christian Schulte:

    [REPOST] ksh: utf8 full width character support for emacs.c

  • Christian Schulte <cs@schulte.it> wrote:
    > Of course there are consequences. char in C has always and will always
    > correspond to a byte (8 bit). That's the root cause of the issue at
    > hand. Well. In Java char is 16 bit so it's suffering the same
    > limitations. There never has been a type in C reflecting a byte other
    > than char. But char stands for character, not byte. These days a
    > character in unicode is at least a 20 bit value.
    
    sorry for nitpicking, but no.  grapheme clusters (i.e. what a user
    percieves as a "character") can be more than one code point long.
    you can easily see emojis as long as ten code points.
    
    (just using emojis as a example, actual human languages make use of
    combining characters as well)
    
    > Regarding that
    > "setlocale" thing. There always has been an implicit setlocale from day
    > one making char to mean 7 bit US-ASCII, although it's historically an 8
    > bit value. So all those other encodings making use of that 8 bit
    > property, like iso8895 and such were always just a work around to
    > provide more than those ASCII characters. Just give up on that "a
    > char(acter) is a byte" invariant and accept that a char(acter) these
    > days is at least a 20 bit value. There is no drawback on e.g. shell
    > scripts containing unicode and such, if the shell would just give up on
    > that "a character is a byte" invariant, which it could by giving up on
    > char in favour of wchar_t.
    
    which i'm not sure it actually fixes anything, given that a wchar_t is
    not a grapheme cluster.
    
    ("anything" in the context of the thread, i.e. hitting the left arrow on
    the key and seeing the cursor skipping over a character rather than
    being stuck somewhere between the wchar_t LATIN SMALL LETTER A and the
    wchar_t COMBINING ACUTE ACCENT)
    
    
  • Christian Schulte:

    [REPOST] ksh: utf8 full width character support for emacs.c