Index | Thread | Search

From:
Christian Schulte <cs@schulte.it>
Subject:
Re: [REPOST] ksh: utf8 full width character support for emacs.c
To:
William Goodspeed <gongzl@stu.hebust.edu.cn>, tech@openbsd.org
Date:
Tue, 8 Apr 2025 08:39:55 +0200

Download raw body.

Thread
On 4/4/25 17:36, William Goodspeed wrote:
> Hi list,
> 
> Thank everyone for the attention on the issue and the discussion on
> unicode.
> 
> To make it partially work in the simpliest way imho would be replacing
> is_fullwidth and u8_to_cpt with wcwidth(3) and u8_to_wc (with mbtowc).
> To be clear, that doesn't mean it will refactor the entire shell to
> wchar,
> it uses wchar routines only if an unicode occurred to calculate its
> width.
> Therefore, there won't be an performance issue of calling mbtowc(3)
> repeatedly as it only calls it on unicode instead of regular ascii
> chars.
> After testing, that works pretty good for me.
> 
> However there are pros and cons on that solution, and here are some of
> my thoughts:
> 
> 1. setlocale(3) call can't be avoided. It would be
>    `setlocale(LC_ALL, "")' to make it default to the encoding in
>    the user envirnoment. And it's quiet clear that's a must if one
>    wants any i18n from the C library.
> 
> 2. It behaves differently across locales. If C/POSIX (ASCII)
>    or encodings other than utf-8 are choosen, unicode character
>    correctly. [But if we fallback wcwidth of something failed to
>    one column, it will matche the current (broken) behaviour.] But,
>    one would find it parsing utf-8 anyways in other locales.
>    Unfortunately, the issue also exists in the current ksh.
>    (Since OpenBSD only supports ASCII and UTF8, I don't think
>    that's really big of a problem.)
> 
> 3. Also an user-perceived character may not be one rune. But
>    like emojis, in a terminal, they aren't combined together (at least
>    on urxvt with Unifont). So I think it's already really ambiguous
>    to determine the actuall width of a cluster character.
> 
> So in conclusion, I suggest switching to wcwidth(3) and mbtowc(3)
> despite the (maybe) disadvantage of calling setlocale(3). It solves the
> problem without introducing too much complexity and other defects.
>    
> Any thoughts?

Sorry. I cannot help you with this any further. Nothing mandates multi
byte strings to actually contain UTF-8. Hard coding UTF-8 handling in
ksh seems wrong to me. The C standard is pretty clear about this. I feel
very sorry no one competent is replying to you helping you to solve the
issue at hand.

-- 
Christian