Index | Thread | Search

From:
Omar Polo <op@omarpolo.com>
Subject:
Re: [REPOST] ksh: utf8 full width character support for emacs.c
To:
Ingo Schwarze <schwarze@usta.de>
Cc:
tech@openbsd.org
Date:
Thu, 03 Apr 2025 20:41:48 +0200

Download raw body.

Thread
  • Daniel Tameling:

    [REPOST] ksh: utf8 full width character support for emacs.c

  • Omar Polo:

    [REPOST] ksh: utf8 full width character support for emacs.c

  • ori@eigenstate.org:

    [REPOST] ksh: utf8 full width character support for emacs.c

  • Ingo Schwarze <schwarze@usta.de> wrote:
    > Hello Omar,
    > 
    > Omar Polo wrote on Sun, Mar 30, 2025 at 08:37:06PM +0200:
    > 
    > > grapheme clusters (i.e. what a user percieves as a "character")
    > > can be more than one code point long.
    > 
    > That is (more or less) accurate - and ludicrously complicated.
    > There is a long annex to the Unicode standard on this topic,
    >   Unicode Text Segmentation, Unicode Standard Annex #29
    >   https://www.unicode.org/reports/tr29/
    > 
    > Note that "grapheme clusters" are not the same as "user-percieved
    > characters", and there are several different types of grapheme clusters
    > (legacy, extenbded, tailored, ...).
    
    ops, i fell into the trap of "someone is wrong on the internet" just to
    split out more wrong stuff, apologize ;-)
    
    > [...]
    > 
    > Taking these three assumptions together allows the following
    > vastly simplified modus operandi for line editing:
    > 
    >  A. Each width-1 and each width-2 code point starts a
    >     user-perceived character.
    >  B. Zero or more subsequent width-0 code points are part
    >     of the same user-perceived character.
    >  C. The display width of the user-perceived character
    >     is the width of its first code point.
    >  D. The cursor can only be placed on these user-perceived characters,
    >     and each of them can only be edited as a whole.
    
    fwiw i second that this might be enough for ksh.  it would fix so many of
    the current issues with non-ascii text that what remains is more than
    tolerable i guess.
    
    > Doing this does *not* require switching from "char" to "wchar_t"
    > globally throughout the shell.  Worse, such a switch would destroy
    > some functionality that the shell now provides.  For example, try the
    > following command:
    > 
    >    $ echo $(printf "\xff") | hexdump -C
    >   0000000  ff 0a
    >   0000002
    > 
    > The printf(1) command produces the byte 0xff.  The shell reads that byte
    > and constructs the command "echo weird-byte" which prints the weird byte
    > to standard output as desired.  If the shell would internally store its
    > input in wchar_t, that would not be possible because that byte cannot
    > be stored in wchar_t.  Depending on how we would handle such input,
    > this might result in the shell dying from EILSEQ, or printing 0xbfc3
    > instead of 0xff, or any number of other weird changes in behavior.
    
    agreed; my favourite part is that you don't even need printf; just
    iterating over files names is enough (since they're just bytestrings)
    
    > [...]
    > 
    > Omar, note that i gave up on refuting what Christian Schulte is saying
    > on a point-by-point basis because what he says is typically a mixture of
    > half-truths, misunderstandings, misleading claims, and arguments at best
    > tangentially related to the topic under discussion.  So unless you are
    > in urgent need of wild goose feathers, consider hunting elsewhere...
    
    yeah, as i was saying i just fell for an almost pointless correction.
    But this stemmed out in your precise reply, which is always a pleasure to
    read!
    
    > But given that you have shown interest, i thought it might make sense
    > to explain what is *actually* going on here.
    
    i guess i have to put my code where my mouth is.  I like your outline
    for handling text in ksh interactive mode; can't promise anything but
    i'll give it an honest spin!
    
    Thank you!
    
    
    
  • Daniel Tameling:

    [REPOST] ksh: utf8 full width character support for emacs.c

  • Omar Polo:

    [REPOST] ksh: utf8 full width character support for emacs.c

  • ori@eigenstate.org:

    [REPOST] ksh: utf8 full width character support for emacs.c