From: Stefan Sperling Subject: Re: Implement -E for rs(1) To: Andy Bradford Cc: tech@openbsd.org Date: Fri, 15 Mar 2024 10:36:21 +0100 On Thu, Mar 14, 2024 at 10:31:51PM -0600, Andy Bradford wrote: > With the changes that I made to the code I was able to obtain the output > that you expect, but not with the arguments you suggested. Here are 3 > iterations: > > $ export LC_CTYPE=en_US.UTF-8 > $ printf '\xe6\xbc\xa2\xe5\xad\x97\xe6\xbc\xa2\xe5\xad\x97\n' | ./rs -E 4 > 漢 > 字 > 漢 > 字 > $ printf '\xe6\xbc\xa2\xe5\xad\x97\xe6\xbc\xa2\xe5\xad\x97\n' | ./rs -E 0 4 > 漢 字 漢 字 > $ printf '\xe6\xbc\xa2\xe5\xad\x97\xe6\xbc\xa2\xe5\xad\x97\n' | ./rs -E 0 2 > 漢 字 > 漢 字 > > So you can see that only when I specify 2 columns of output do I get > what you expect. Is this correct? I made a mistake in assuming that requesting 4 columns would yield the output I showed you. The new behaviour you implemented seems correct to me. > > Getting the array tabulated correctly might be more difficult if a string > > contains a mix of characters using 1 column and characters using > 1 column. > > Is this what you mean? > > $ export LC_CTYPE=en_US.UTF-8 > $ printf '\xcf\x80\xe6\xbc\xa2\xe5\xad\x97\xe6\xbc\xa2\xe5\xad\x97\xcf\x80\n' | ./rs -E 0 2 > π 漢 > 字 漢 > 字 π > Yes, very nice! > I did some experimentation with other characters and found some oddities > that I don't believe are introduced with my changes but exist somewhere > else in the library I believe. > > Here it is with my changes: > > $ printf '\xc2\xab\xd2\x88\xd2\x89\xd1\xae25\xc2\xba67\xc2\xba35\xc2\xba\xcf\x86\xcf\x80\xc2\xbb\n' | ./rs -E 0 2 > « ҈ > ҉ Ѯ > 2 5 > º 6 > 7 º > 3 5 > º φ > π » > > And here it is with the rs included with 7.4: > > $ printf '\xc2\xab \xd2\x88 \xd2\x89 \xd1\xae 2 5 \xc2\xba 6 7 \xc2\xba 3 5 \xc2\xba \xcf\x86 \xcf\x80 \xc2\xbb\n' | /usr/bin/rs -E 0 2 > « ҈ > ҉ Ѯ > 2 5 > º 6 > 7 º > 3 5 > º φ > π » > > Notice that in both cases, the gutter width on line 2 (row 2) is > actually 3 spaces instead of two as the rest of the columns appear to > be. I'm not sure why this is (perhaps the character somehow got a space > prepended). I also notice some strange things if I just output that text > line without filtering through rs (the characters seem to smash each > other at the beginning of the string). The above test case involves combining characters. With combining characters one can create arbitrary ligatures, made up of two or more graphemes. For example, the german letter ä can be represented in its entirety by unicode code point 0xE4 https://unicodeplus.com/U+00E4 or it can be represented by ASCII character code point 'a' (0x61) followed by the code point for a combining diaeresis (two upper dots): https://unicodeplus.com/U+0308 Combining characters appear with width 0 to rs(1). Your test case involves the following combining characters: # pkg_add uniutils $ printf '\xd2\x88'| ExplicateUTF8 | head -n 3 The sequence 0xD2 0x88 11010010 10001000 is a valid UTF-8 character encoding equivalent to UTF32 0x00000488. 'Combining Cyrillic Hundred Thousands Sign (U+0488)' https://unicodeplus.com/U+0488, rendered as: ҉ $ printf '\xd2\x89'| ExplicateUTF8 | head -n 3 The sequence 0xD2 0x89 11010010 10001001 is a valid UTF-8 character encoding equivalent to UTF32 0x00000489. 'Combining Cyrillic Millions Sign (U+0489)' https://unicodeplus.com/U+0489, rendered as: ҈ During rendering, combining characters are combined with a preceding character. However, your example uses the combining characters without any preceding character. I don't think we can expect reasonable behaviour for such input. Let's write just those two combining characters to a file: $ printf '\xd2\x88\xd2\x89' > /tmp/p In xterm, gnome-terminal, and xfce4-terminal I see no output: $ cat /tmp/p $ When I open this file in vim I see a symbol that looks like an overlay of both combining characters. Which shows that results will be ambiguous for such input, which could be argured is invalid Unicode, even though the string uses entirely valid UTF-8 encoding. If we prepend the 'Cyrillic Capital Letter Ksi (U+046E)' immediately before the two combinding characters, I see an appropriately embellished letter Ksi in the terminal: $ printf '\xd1\xae\xd2\x88\xd2\x89\n' Ѯ҈҉ $ In this case rs(1) should see two characters of width 0 followed by a character of width 1. > The only part of my change that I wasn't certain how to test was the > code found in this condition block: > > } else if ((width = wcwidth(wc)) == -1) { > > So if wcwidth() returns -1 that indicates that the character is not > printable; perhaps characters 0x01--0x1f? Yes, and there are more unprintable ones within the entire range of unicode code points. > I think that even in this case > the code still wants to advance bytes_used by the length of the bytes, > right? Indeed. Any UTF-8 character byte string, whether printable or not, can be up to 4 bytes in length. > Updated patch: Looks good to me. But I would like to get feedback from at least one additional developer before committing it.