Download raw body.
[REPOST] ksh: utf8 full width character support for emacs.c
On 4/4/25 17:36, William Goodspeed wrote: > Hi list, > > Thank everyone for the attention on the issue and the discussion on > unicode. > > To make it partially work in the simpliest way imho would be replacing > is_fullwidth and u8_to_cpt with wcwidth(3) and u8_to_wc (with mbtowc). > To be clear, that doesn't mean it will refactor the entire shell to > wchar, > it uses wchar routines only if an unicode occurred to calculate its > width. > Therefore, there won't be an performance issue of calling mbtowc(3) > repeatedly as it only calls it on unicode instead of regular ascii > chars. > After testing, that works pretty good for me. > > However there are pros and cons on that solution, and here are some of > my thoughts: > > 1. setlocale(3) call can't be avoided. It would be > `setlocale(LC_ALL, "")' to make it default to the encoding in > the user envirnoment. And it's quiet clear that's a must if one > wants any i18n from the C library. > > 2. It behaves differently across locales. If C/POSIX (ASCII) > or encodings other than utf-8 are choosen, unicode character > correctly. [But if we fallback wcwidth of something failed to > one column, it will matche the current (broken) behaviour.] But, > one would find it parsing utf-8 anyways in other locales. > Unfortunately, the issue also exists in the current ksh. > (Since OpenBSD only supports ASCII and UTF8, I don't think > that's really big of a problem.) > > 3. Also an user-perceived character may not be one rune. But > like emojis, in a terminal, they aren't combined together (at least > on urxvt with Unifont). So I think it's already really ambiguous > to determine the actuall width of a cluster character. > > So in conclusion, I suggest switching to wcwidth(3) and mbtowc(3) > despite the (maybe) disadvantage of calling setlocale(3). It solves the > problem without introducing too much complexity and other defects. > > Any thoughts? Sorry. I cannot help you with this any further. Nothing mandates multi byte strings to actually contain UTF-8. Hard coding UTF-8 handling in ksh seems wrong to me. The C standard is pretty clear about this. I feel very sorry no one competent is replying to you helping you to solve the issue at hand. -- Christian
[REPOST] ksh: utf8 full width character support for emacs.c