From: Lucas Gabriel Vuotto Subject: Re: [REPOST] ksh: utf8 full width character support for emacs.c To: Christian Schulte Cc: tech@openbsd.org Date: Wed, 19 Mar 2025 00:15:13 +0000 On Tue, Mar 18, 2025 at 06:12:00AM +0100, Christian Schulte wrote: > > + > > +int u8_to_cpt(const char *buf, unsigned long *cpt) { > > If this is supposed to convert UTF 8 to UTF 32, it's wrong. > > > + const unsigned char *ubuf = buf; > > + > > + if (ubuf[0] <= 0x7F) { > > + *cpt = ubuf[0]; > > + return 1; > > + } else if ((ubuf[0] & 0xE0) == 0xC0) { > 0xC0 No, this is right. This case is for 2-byte sequences, which are encoded as 110xxxxx 10xxxxxx. 0xC0 is 11000000. Using it will accept a sequence 111xxxxx 10xxxxxx. > > > + *cpt = ((ubuf[0] & 0x1F) << 6) | (ubuf[1] & 0x3F); > > + return 2; > > + } else if ((ubuf[0] & 0xF0) == 0xE0) { > 0xE0 Same. > > > + *cpt = ((ubuf[0] & 0x0F) << 12) > > + | ((ubuf[1] & 0x3F) << 6) > > + | (ubuf[2] & 0x3F); > > + return 3; > > + } else if ((ubuf[0] & 0xF8) == 0xF0) { > 0xF0 Same. > > > + *cpt = ((ubuf[0] & 0x07) << 18) > > + | ((ubuf[1] & 0x3F) << 12) > > + | ((ubuf[2] & 0x3F) << 6) > > + | (ubuf[3] & 0x3F); > > + return 4; > > + } > > + > > + return 0; > > +} > > + > > +#endif I didn't check the call sites, but this relies on the caller checking that the input sequence is valid UTF-8 encoding, as it doesn't check that the continuation bytes are 10xxxxxx.