Download raw body.
[REPOST] ksh: utf8 full width character support for emacs.c
On Tue, Mar 18, 2025 at 06:12:00AM +0100, Christian Schulte wrote:
> > +
> > +int u8_to_cpt(const char *buf, unsigned long *cpt) {
>
> If this is supposed to convert UTF 8 to UTF 32, it's wrong.
>
> > + const unsigned char *ubuf = buf;
> > +
> > + if (ubuf[0] <= 0x7F) {
> > + *cpt = ubuf[0];
> > + return 1;
> > + } else if ((ubuf[0] & 0xE0) == 0xC0) {
> 0xC0
No, this is right. This case is for 2-byte sequences, which are encoded
as 110xxxxx 10xxxxxx. 0xC0 is 11000000. Using it will accept a sequence
111xxxxx 10xxxxxx.
>
> > + *cpt = ((ubuf[0] & 0x1F) << 6) | (ubuf[1] & 0x3F);
> > + return 2;
> > + } else if ((ubuf[0] & 0xF0) == 0xE0) {
> 0xE0
Same.
>
> > + *cpt = ((ubuf[0] & 0x0F) << 12)
> > + | ((ubuf[1] & 0x3F) << 6)
> > + | (ubuf[2] & 0x3F);
> > + return 3;
> > + } else if ((ubuf[0] & 0xF8) == 0xF0) {
> 0xF0
Same.
>
> > + *cpt = ((ubuf[0] & 0x07) << 18)
> > + | ((ubuf[1] & 0x3F) << 12)
> > + | ((ubuf[2] & 0x3F) << 6)
> > + | (ubuf[3] & 0x3F);
> > + return 4;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +#endif
I didn't check the call sites, but this relies on the caller checking
that the input sequence is valid UTF-8 encoding, as it doesn't check
that the continuation bytes are 10xxxxxx.
[REPOST] ksh: utf8 full width character support for emacs.c