Download raw body.
Implement -E for rs(1)
On Thu, Mar 14, 2024 at 10:31:51PM -0600, Andy Bradford wrote:
> With the changes that I made to the code I was able to obtain the output
> that you expect, but not with the arguments you suggested. Here are 3
> iterations:
>
> $ export LC_CTYPE=en_US.UTF-8
> $ printf '\xe6\xbc\xa2\xe5\xad\x97\xe6\xbc\xa2\xe5\xad\x97\n' | ./rs -E 4
> 漢
> 字
> 漢
> 字
> $ printf '\xe6\xbc\xa2\xe5\xad\x97\xe6\xbc\xa2\xe5\xad\x97\n' | ./rs -E 0 4
> 漢 字 漢 字
> $ printf '\xe6\xbc\xa2\xe5\xad\x97\xe6\xbc\xa2\xe5\xad\x97\n' | ./rs -E 0 2
> 漢 字
> 漢 字
>
> So you can see that only when I specify 2 columns of output do I get
> what you expect. Is this correct?
I made a mistake in assuming that requesting 4 columns would yield the
output I showed you. The new behaviour you implemented seems correct to me.
> > Getting the array tabulated correctly might be more difficult if a string
> > contains a mix of characters using 1 column and characters using > 1 column.
>
> Is this what you mean?
>
> $ export LC_CTYPE=en_US.UTF-8
> $ printf '\xcf\x80\xe6\xbc\xa2\xe5\xad\x97\xe6\xbc\xa2\xe5\xad\x97\xcf\x80\n' | ./rs -E 0 2
> π 漢
> 字 漢
> 字 π
>
Yes, very nice!
> I did some experimentation with other characters and found some oddities
> that I don't believe are introduced with my changes but exist somewhere
> else in the library I believe.
>
> Here it is with my changes:
>
> $ printf '\xc2\xab\xd2\x88\xd2\x89\xd1\xae25\xc2\xba67\xc2\xba35\xc2\xba\xcf\x86\xcf\x80\xc2\xbb\n' | ./rs -E 0 2
> « ҈
> ҉ Ѯ
> 2 5
> º 6
> 7 º
> 3 5
> º φ
> π »
>
> And here it is with the rs included with 7.4:
>
> $ printf '\xc2\xab \xd2\x88 \xd2\x89 \xd1\xae 2 5 \xc2\xba 6 7 \xc2\xba 3 5 \xc2\xba \xcf\x86 \xcf\x80 \xc2\xbb\n' | /usr/bin/rs -E 0 2
> « ҈
> ҉ Ѯ
> 2 5
> º 6
> 7 º
> 3 5
> º φ
> π »
>
> Notice that in both cases, the gutter width on line 2 (row 2) is
> actually 3 spaces instead of two as the rest of the columns appear to
> be. I'm not sure why this is (perhaps the character somehow got a space
> prepended). I also notice some strange things if I just output that text
> line without filtering through rs (the characters seem to smash each
> other at the beginning of the string).
The above test case involves combining characters.
With combining characters one can create arbitrary ligatures, made up of
two or more graphemes. For example, the german letter ä can be represented
in its entirety by unicode code point 0xE4 https://unicodeplus.com/U+00E4
or it can be represented by ASCII character code point 'a' (0x61) followed by
the code point for a combining diaeresis (two upper dots):
https://unicodeplus.com/U+0308
Combining characters appear with width 0 to rs(1).
Your test case involves the following combining characters:
# pkg_add uniutils
$ printf '\xd2\x88'| ExplicateUTF8 | head -n 3
The sequence 0xD2 0x88
11010010 10001000
is a valid UTF-8 character encoding equivalent to UTF32 0x00000488.
'Combining Cyrillic Hundred Thousands Sign (U+0488)'
https://unicodeplus.com/U+0488, rendered as: ҉
$ printf '\xd2\x89'| ExplicateUTF8 | head -n 3
The sequence 0xD2 0x89
11010010 10001001
is a valid UTF-8 character encoding equivalent to UTF32 0x00000489.
'Combining Cyrillic Millions Sign (U+0489)'
https://unicodeplus.com/U+0489, rendered as: ҈
During rendering, combining characters are combined with a preceding
character. However, your example uses the combining characters without
any preceding character. I don't think we can expect reasonable behaviour
for such input.
Let's write just those two combining characters to a file:
$ printf '\xd2\x88\xd2\x89' > /tmp/p
In xterm, gnome-terminal, and xfce4-terminal I see no output:
$ cat /tmp/p
$
When I open this file in vim I see a symbol that looks like an overlay
of both combining characters. Which shows that results will be ambiguous
for such input, which could be argured is invalid Unicode, even though the
string uses entirely valid UTF-8 encoding.
If we prepend the 'Cyrillic Capital Letter Ksi (U+046E)' immediately
before the two combinding characters, I see an appropriately embellished
letter Ksi in the terminal:
$ printf '\xd1\xae\xd2\x88\xd2\x89\n'
Ѯ҈҉
$
In this case rs(1) should see two characters of width 0 followed by a
character of width 1.
> The only part of my change that I wasn't certain how to test was the
> code found in this condition block:
>
> } else if ((width = wcwidth(wc)) == -1) {
>
> So if wcwidth() returns -1 that indicates that the character is not
> printable; perhaps characters 0x01--0x1f?
Yes, and there are more unprintable ones within the entire range of
unicode code points.
> I think that even in this case
> the code still wants to advance bytes_used by the length of the bytes,
> right?
Indeed. Any UTF-8 character byte string, whether printable or not,
can be up to 4 bytes in length.
> Updated patch:
Looks good to me. But I would like to get feedback from at least one
additional developer before committing it.
Implement -E for rs(1)