From: Stefan Sperling <stsp@stsp.name>
Subject: Re: Implement -E for rs(1)
To: Andy Bradford <amb-sendok-1715661111.dafhampkpcdbbbkconfg@bradfords.org>
Cc: tech@openbsd.org
Date: Fri, 15 Mar 2024 10:36:21 +0100

On Thu, Mar 14, 2024 at 10:31:51PM -0600, Andy Bradford wrote:
> With the changes that I made to the code I was able to obtain the output
> that you  expect, but not with  the arguments you suggested.  Here are 3
> iterations:
> 
> $ export LC_CTYPE=en_US.UTF-8
> $ printf '\xe6\xbc\xa2\xe5\xad\x97\xe6\xbc\xa2\xe5\xad\x97\n' | ./rs -E 4
> 漢
> 字
> 漢
> 字
> $ printf '\xe6\xbc\xa2\xe5\xad\x97\xe6\xbc\xa2\xe5\xad\x97\n' | ./rs -E 0 4
> 漢  字  漢  字
> $ printf '\xe6\xbc\xa2\xe5\xad\x97\xe6\xbc\xa2\xe5\xad\x97\n' | ./rs -E 0 2
> 漢  字
> 漢  字
> 
> So you  can see that only  when I specify 2  columns of output do  I get
> what you expect. Is this correct?

I made a mistake in assuming that requesting 4 columns would yield the
output I showed you. The new behaviour you implemented seems correct to me.

> > Getting the array tabulated correctly might be more difficult if a string
> > contains a mix of characters using 1 column and characters using > 1 column.
> 
> Is this what you mean?
> 
> $ export LC_CTYPE=en_US.UTF-8
> $ printf '\xcf\x80\xe6\xbc\xa2\xe5\xad\x97\xe6\xbc\xa2\xe5\xad\x97\xcf\x80\n' | ./rs -E 0 2
> π   漢
> 字  漢
> 字  π
> 

Yes, very nice!

> I did some experimentation with other characters and found some oddities
> that I don't believe are introduced  with my changes but exist somewhere
> else in the library I believe.
> 
> Here it is with my changes:
> 
> $ printf '\xc2\xab\xd2\x88\xd2\x89\xd1\xae25\xc2\xba67\xc2\xba35\xc2\xba\xcf\x86\xcf\x80\xc2\xbb\n' | ./rs -E 0 2
> «  ҈
> ҉   Ѯ
> 2  5
> º  6
> 7  º
> 3  5
> º  φ
> π  »
> 
> And here it is with the rs included with 7.4:
> 
> $ printf '\xc2\xab \xd2\x88 \xd2\x89 \xd1\xae 2 5 \xc2\xba 6 7 \xc2\xba 3 5 \xc2\xba \xcf\x86 \xcf\x80 \xc2\xbb\n' | /usr/bin/rs -E 0 2  
> «  ҈
> ҉   Ѯ
> 2  5
> º  6
> 7  º
> 3  5
> º  φ
> π  »
> 
> Notice  that in  both cases,  the  gutter width  on  line 2  (row 2)  is
> actually 3 spaces  instead of two as  the rest of the  columns appear to
> be. I'm not sure why this is  (perhaps the character somehow got a space
> prepended). I also notice some strange things if I just output that text
> line without  filtering through  rs (the characters  seem to  smash each
> other at the beginning of the string).

The above test case involves combining characters.

With combining characters one can create arbitrary ligatures, made up of
two or more graphemes. For example, the german letter ä can be represented
in its entirety by unicode code point 0xE4 https://unicodeplus.com/U+00E4
or it can be represented by ASCII character code point 'a' (0x61) followed by
the code point for a combining diaeresis (two upper dots):
https://unicodeplus.com/U+0308

Combining characters appear with width 0 to rs(1).

Your test case involves the following combining characters:

 # pkg_add uniutils
 $ printf '\xd2\x88'| ExplicateUTF8 | head -n 3
 The sequence 0xD2     0x88     
              11010010 10001000 
 is a valid UTF-8 character encoding equivalent to UTF32 0x00000488.

'Combining Cyrillic Hundred Thousands Sign (U+0488)'
https://unicodeplus.com/U+0488, rendered as:  ҉

 $ printf '\xd2\x89'| ExplicateUTF8 | head -n 3
 The sequence 0xD2     0x89     
              11010010 10001001 
 is a valid UTF-8 character encoding equivalent to UTF32 0x00000489.

'Combining Cyrillic Millions Sign (U+0489)'
https://unicodeplus.com/U+0489, rendered as:  ҈

During rendering, combining characters are combined with a preceding
character. However, your example uses the combining characters without
any preceding character. I don't think we can expect reasonable behaviour
for such input.

Let's write just those two combining characters to a file:
$ printf '\xd2\x88\xd2\x89' > /tmp/p

In xterm, gnome-terminal, and xfce4-terminal I see no output:

$ cat /tmp/p
$

When I open this file in vim I see a symbol that looks like an overlay
of both combining characters. Which shows that results will be ambiguous
for such input, which could be argured is invalid Unicode, even though the
string uses entirely valid UTF-8 encoding.

If we prepend the 'Cyrillic Capital Letter Ksi (U+046E)' immediately
before the two combinding characters, I see an appropriately embellished
letter Ksi in the terminal:

$ printf '\xd1\xae\xd2\x88\xd2\x89\n' 
Ѯ҈҉
$

In this case rs(1) should see two characters of width 0 followed by a
character of width 1.

> The only  part of my change  that I wasn't  certain how to test  was the
> code found in this condition block:
> 
>                 } else if ((width = wcwidth(wc)) == -1) {
> 
> So if  wcwidth() returns  -1 that  indicates that  the character  is not
> printable; perhaps characters 0x01--0x1f?

Yes, and there are more unprintable ones within the entire range of
unicode code points.

> I think that even in this case
> the code still  wants to advance bytes_used by the  length of the bytes,
> right?

Indeed. Any UTF-8 character byte string, whether printable or not,
can be up to 4 bytes in length.

> Updated patch:

Looks good to me. But I would like to get feedback from at least one
additional developer before committing it.