From: Daniel Tameling <tamelingdaniel@gmail.com>
Subject: Re: [REPOST] ksh: utf8 full width character support for emacs.c
To: tech@openbsd.org
Date: Thu, 3 Apr 2025 09:10:27 +0200

On Mon, Mar 31, 2025 at 01:47:14AM +0200, Ingo Schwarze wrote:
> 
> Then again, switching to wchar_t *and* using wcwidth(3), when taken
> together, can be leveraged to solve *most* of the problem because
> the following *vastly simplified* model more or less works in most
> situations, at least much better than what we have now:
> 
>  1. Assume that user-perceived characters are identical to
>     legacy grapheme clusters.
>     (Of course, that assumption breaks down when dealing with
>      user perceived characters that are larger than grapheme clusters,
>      but that is relatively rare, in particular in languages based on
>      latin, greek or cyrillic scripts.)
>  2. Assume that all width-0 code points are combining characters
>     and vice versa.
>     (Of course, combining characters exist that are not width-0,
>      and width-0 code points exist that are not combining, but
>      the detrimental effects on line editing tend to be relatively
>      rare and maybe also relatively mild.)
>  3. Assume that users want to operate on user-perceived characters
>     (e.g. move the curser by one of them at a time, replace them
>      one at a time, and so on)
>     rather than wanting to edit *parts* of user-perceived characters
>     (which does make sense with some scripts and languages, and for
>      some languages it depends on the context and editing purpose
>      whether it is desirable or not.)
> 
> Taking these three assumptions together allows the following
> vastly simplified modus operandi for line editing:
> 
>  A. Each width-1 and each width-2 code point starts a
>     user-perceived character.
>  B. Zero or more subsequent width-0 code points are part
>     of the same user-perceived character.
>  C. The display width of the user-perceived character
>     is the width of its first code point.
>  D. The cursor can only be placed on these user-perceived characters,
>     and each of them can only be edited as a whole.
> 

Just to mention it: afaik this is what tmux does. It works reasonably
well, even across different OSes. The code for handling this is in
usr.bin/tmux/utf8.c. It stores the results and uses a cache, both are
unnecessary for a first attempt (keep in mind that tmux also needs to
render command output and not just user input). mbtowc followed by
wcwidth should be fine at first and then one can see how that performs
and what issues remain.

--
Best regards,
Daniel