Index | Thread | Search

From:
Christian Schulte <cs@schulte.it>
Subject:
Re: [REPOST] ksh: utf8 full width character support for emacs.c
To:
tech@openbsd.org
Date:
Sat, 22 Mar 2025 11:30:54 +0100

Download raw body.

Thread
  • Christian Schulte:

    [REPOST] ksh: utf8 full width character support for emacs.c

  • On 3/21/25 12:15, Ingo Schwarze wrote:
    > Hello,
    > 
    > Gong Zhile wrote on Wed, Mar 19, 2025 at 10:15:42AM +0800:
    > 
    >> There isn't any wchar_t involved in that patch. It took a UTF-8 rune
    >> (codepoint) from a cstring and process it. But, as enh has pointed out,
    >> refactoring it to elevate wcwidth(3) is surely a good idea.
    > 
    > I wouldn't say "surely", but i would say "more likely".
    > 
    > You have picked quite a tricky task here.  Hurdles include:
    >  * To use wcwidth(3), you need wchar_t values.
    >  * I see no reasonable way how you could get such values
    >    other than by using libc functions like mbtowc(3).
    >  * To use these functions, it is necessary to use setlocale(3)
    >    or functions like newlocale(3)/uselocale(3), which the shell
    >    does not use yet.
    >  * One needs to consider whether using ???locale(3) in the shell
    >    carries any risk, or needs to be restricted to certain areas of
    >    the code.  It's not yet clear to me that suddenly running all
    >    the code in the shell under a locale different from the C locale
    >    would have no detrimental consequences.
    
    Of course there are consequences. char in C has always and will always
    correspond to a byte (8 bit). That's the root cause of the issue at
    hand. Well. In Java char is 16 bit so it's suffering the same
    limitations. There never has been a type in C reflecting a byte other
    than char. But char stands for character, not byte. These days a
    character in unicode is at least a 20 bit value. Regarding that
    "setlocale" thing. There always has been an implicit setlocale from day
    one making char to mean 7 bit US-ASCII, although it's historically an 8
    bit value. So all those other encodings making use of that 8 bit
    property, like iso8895 and such were always just a work around to
    provide more than those ASCII characters. Just give up on that "a
    char(acter) is a byte" invariant and accept that a char(acter) these
    days is at least a 20 bit value. There is no drawback on e.g. shell
    scripts containing unicode and such, if the shell would just give up on
    that "a character is a byte" invariant, which it could by giving up on
    char in favour of wchar_t.
    
    -- 
    Christian
    
    
  • Christian Schulte:

    [REPOST] ksh: utf8 full width character support for emacs.c