From: Stefan Sperling Subject: Re: Implement -E for rs(1) To: Andy Bradford Cc: tech@openbsd.org Date: Thu, 14 Mar 2024 11:27:59 +0100 On Wed, Mar 13, 2024 at 09:03:34PM -0600, Andy Bradford wrote: > Hello, > > It looks like -E was never added to rs(1) for some reason even though > the man page and the code make reference to it. I recently had need to > have rs -E work, so I took a stab at implementing it. > > Here is what it does: > > $ echo ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz | ./rs -E 4 > A B C D E F G H I J K L M > N O P Q R S T U V W X Y Z > a b c d e f g h i j k l m > n o p q r s t u v w x y z Please test the UTF-8 code path as well. Below is a test case which contains three multibyte-characters, each with a display width of 1. Needs xterm(1), since wscons doesn't support UTF-8. $ export LC_CTYPE=en_US.UTF-8 $ printf '\xc3\xa4bcd\xc3\xa9fgh\xc3\xafjklmnop\n' | ./rs -E 4 ä ? b c d é ? f g h ï ? j k l m n o p The expected output would be: ä b c d é f g h ï j k l m n o p > @@ -160,14 +163,17 @@ > } > p = curline; > while (p != NULL && *p != '\0') { > - if (*p == isep) { > + if (oneperchar) { > + ep->w = mbsavis(&ep->s, p, 1); > + p += ep->w; ep->w corresponds to the return value of wcwidth(). i.e. the amount of terminal columns the rendered character's font symbol will be using. This width does not necessarily correspond to the amount of bytes used to encode the character in the UTF-8 string. Using display width to advance a pointer is wrong. You will want to use the amount of input bytes consumed by mbtowc(), which the mbsavis() interface does not yet expose. You will need to extend mbsavis() to return both the width and the amount of input bytes used, with a construct like: int mbsavis(char** outp, const char *mbs, int *bytes_used, int nchars) { if (bytes_used != NULL) *bytes_used = 0; [...] } else { total_width += width; if (bytes_used != NULL) *bytes_used += len; while (len-- > 0) *dst++ = *src++; [...] return total_width: } Regarding characters of display width > 1, the following command prints the word "Kanji" in Japanese twice, tabulated by rs. Each syllable (Kan + Ji) will be using 2 columns on the display. $ export LC_CTYPE=en_US.UTF-8 $ printf '\xe6\xbc\xa2\xe5\xad\x97\xe6\xbc\xa2\xe5\xad\x97\n' | ./rs -E 4 The current output is: 漢 ? 字 ? 漢 ? 字 ? I do not speak Japanese but I suspect we should see the symbols arranged like this instead: 漢 字 漢 字 Getting the array tabulated correctly might be more difficult if a string contains a mix of characters using 1 column and characters using > 1 column. If this gets too complicated I would leave this case for later. > @@ -28,13 +28,16 @@ > int total_width; /* Display width of the whole string. */ > int width; /* Display width of a single Unicode char. */ > int len; /* Length in bytes of UTF-8 encoded string. */ > + int count; /* Count of chars read. */ Above you are leaving count uninitialized. Its value will be whatever the stack location happens to contain. Instead, 'count' should be set to either 0 or 1 explicitly to make the code below always work as intended. > @@ -55,6 +58,7 @@ > while (len-- > 0) > *dst++ = *src++; > } > + if (nchars > 0 && count++ >= nchars) break; Use ++count if initialized to 0, keep count++ if initialized to 1. If count is not initialized, with LC_CTYPE=en_US.UTF-8 in the enviroment rs -E may produce unexpected output sometimes, even for plain ASCII input. Two instances of this which I have observed: $ export LC_CTYPE=en_US.UTF-8 $ echo ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz | ./rs -E 4 ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz $ echo ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz | ./rs -E 4 AB CD EF GH IJ KL MN OP QR ST UV WX YZ ab cd ef gh ij kl mn op qr st uv wx yz