From: Stefan Sperling <stsp@stsp.name>
Subject: Re: Implement -E for rs(1)
To: Andy Bradford <amb-sendok-1715569415.nmicbecdkkdmdcombafb@bradfords.org>
Cc: tech@openbsd.org
Date: Thu, 14 Mar 2024 11:27:59 +0100

On Wed, Mar 13, 2024 at 09:03:34PM -0600, Andy Bradford wrote:
> Hello,
> 
> It looks like  -E was never added  to rs(1) for some  reason even though
> the man page and  the code make reference to it. I  recently had need to
> have rs -E work, so I took a stab at implementing it.
> 
> Here is what it does:
> 
> $ echo ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz | ./rs -E 4
> A  B  C  D  E  F  G  H  I  J  K  L  M
> N  O  P  Q  R  S  T  U  V  W  X  Y  Z
> a  b  c  d  e  f  g  h  i  j  k  l  m
> n  o  p  q  r  s  t  u  v  w  x  y  z

Please test the UTF-8 code path as well.

Below is a test case which contains three multibyte-characters, each
with a display width of 1.

Needs xterm(1), since wscons doesn't support UTF-8.

$ export LC_CTYPE=en_US.UTF-8
$ printf '\xc3\xa4bcd\xc3\xa9fgh\xc3\xafjklmnop\n' | ./rs -E 4
ä  ?  b  c  d
é  ?  f  g  h
ï  ?  j  k  l
m  n  o  p  

The expected output would be:

ä  b  c  d
é  f  g  h
ï  j  k  l
m  n  o  p

> @@ -160,14 +163,17 @@
>  		}
>  		p = curline;
>  		while (p != NULL && *p != '\0') {
> -			if (*p == isep) {
> +			if (oneperchar) {
> +				ep->w = mbsavis(&ep->s, p, 1);
> +				p += ep->w;

ep->w corresponds to the return value of wcwidth(). i.e. the amount of
terminal columns the rendered character's font symbol will be using.
This width does not necessarily correspond to the amount of bytes used
to encode the character in the UTF-8 string.

Using display width to advance a pointer is wrong. You will want to use
the amount of input bytes consumed by mbtowc(), which the mbsavis()
interface does not yet expose. You will need to extend mbsavis() to
return both the width and the amount of input bytes used, with a
construct like:

int
mbsavis(char** outp, const char *mbs, int *bytes_used, int nchars)
{
  if (bytes_used != NULL)
    *bytes_used = 0;

  [...]
	} else {
		total_width += width;
		 if (bytes_used != NULL)
			*bytes_used += len;
		while (len-- > 0)
			*dst++ = *src++;
  [...]
  return total_width:
}

Regarding characters of display width > 1, the following command prints
the word "Kanji" in Japanese twice, tabulated by rs.
Each syllable (Kan + Ji) will be using 2 columns on the display.

$ export LC_CTYPE=en_US.UTF-8
$ printf '\xe6\xbc\xa2\xe5\xad\x97\xe6\xbc\xa2\xe5\xad\x97\n' | ./rs -E 4

The current output is:

漢  ?
字  ?
漢  ?
字  ?

I do not speak Japanese but I suspect we should see the symbols
arranged like this instead:

漢  字
漢  字

Getting the array tabulated correctly might be more difficult if a string
contains a mix of characters using 1 column and characters using > 1 column.
If this gets too complicated I would leave this case for later.

> @@ -28,13 +28,16 @@
>  	int	  total_width;  /* Display width of the whole string. */
>  	int	  width;  /* Display width of a single Unicode char. */
>  	int	  len;  /* Length in bytes of UTF-8 encoded string. */
> +	int	  count; /* Count of chars read. */

Above you are leaving count uninitialized. Its value will be whatever
the stack location happens to contain. Instead, 'count' should be set to
either 0 or 1 explicitly to make the code below always work as intended.

> @@ -55,6 +58,7 @@
>  			while (len-- > 0)
>  				*dst++ = *src++;
>  		}
> +		if (nchars > 0 && count++ >= nchars) break;

Use ++count if initialized to 0, keep count++ if initialized to 1.

If count is not initialized, with LC_CTYPE=en_US.UTF-8 in the enviroment
rs -E may produce unexpected output sometimes, even for plain ASCII input.
Two instances of this which I have observed:

$ export LC_CTYPE=en_US.UTF-8
$ echo ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz | ./rs -E 4
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
$ echo ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz | ./rs -E 4
AB  CD  EF  GH  IJ  KL  MN                                                      
OP  QR  ST  UV  WX  YZ  ab                                                      
cd  ef  gh  ij  kl  mn  op                                                      
qr  st  uv  wx  yz