From: Lucas Gabriel Vuotto <lucas@sexy.is>
Subject: Re: [REPOST] ksh: utf8 full width character support for emacs.c
To: Christian Schulte <cs@schulte.it>
Cc: tech@openbsd.org
Date: Wed, 19 Mar 2025 00:15:13 +0000

On Tue, Mar 18, 2025 at 06:12:00AM +0100, Christian Schulte wrote:
> > +
> > +int u8_to_cpt(const char *buf, unsigned long *cpt) {
> 
> If this is supposed to convert UTF 8 to UTF 32, it's wrong.
> 
> > +	const unsigned char *ubuf = buf;
> > +
> > +	if (ubuf[0] <= 0x7F) {
> > +		*cpt = ubuf[0];
> > +		return 1;
> > +	} else if ((ubuf[0] & 0xE0) == 0xC0) {
>                               0xC0

No, this is right. This case is for 2-byte sequences, which are encoded
as 110xxxxx 10xxxxxx. 0xC0 is 11000000. Using it will accept a sequence
111xxxxx 10xxxxxx.

> 
> > +		*cpt = ((ubuf[0] & 0x1F) << 6) | (ubuf[1] & 0x3F);
> > +		return 2;
> > +	} else if ((ubuf[0] & 0xF0) == 0xE0) {
>                               0xE0

Same.

> 
> > +		*cpt = ((ubuf[0] & 0x0F) << 12)
> > +			| ((ubuf[1] & 0x3F) << 6)
> > +			| (ubuf[2] & 0x3F);
> > +		return 3;
> > +	} else if ((ubuf[0] & 0xF8) == 0xF0) {
>                               0xF0

Same.

> 
> > +		*cpt = ((ubuf[0] & 0x07) << 18)
> > +			| ((ubuf[1] & 0x3F) << 12)
> > +			| ((ubuf[2] & 0x3F) << 6)
> > +			| (ubuf[3] & 0x3F);
> > +		return 4;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +#endif

I didn't check the call sites, but this relies on the caller checking
that the input sequence is valid UTF-8 encoding, as it doesn't check
that the continuation bytes are 10xxxxxx.