From: Ingo Schwarze Subject: ksh vi mode: UTF-8 handling in 'p' command To: tech@openbsd.org Cc: Walter Alejandro Iglesias Date: Mon, 14 Apr 2025 20:20:16 +0200 [ I moved the thread from bugs@ to tech@ because what we are doing here goes beyond fixing bugs; these issues feel more like missing features. ] Walter pointed out multiple issues with UTF-8 handling in the ksh(1) VI mode command line editor. Here is a patch fixing one of these: UTF-8 handling in the "paste after" (p) command. I'm asking for an OK for the patch below. Is there a risk of overflowing some media or something like that? To my surprise, /usr/src/distrib/special/ksh/Makefile appears to compile in both the VI mode and the emacs mode command line editor. This is not the last patch that will ,make the shell grow by a few bytes. How do we test such patches from a "make release" perspective without causing unsustainable testing effort on multiple platforms, including those i don't have and those that are very slow? Rationale --------- From the user perspective, the cursor is located on a user-perceived character. It is an implementation detail invisible to the user that internally, the cursor actually sits on the first byte of the character. Consequently, * before inserting anything after a UTF-8 character, the cursor needs to be advanced to the first byte of the subsequent character * after inserting a UTF-8 character, the cursor needs to be backed up to the first byte of the last character inserted. I deleted the comments that i included in the first version of this patch. Even though it is arguably not obvious why a loop with one per-increment and one post-increment termination condition is needed here, the idiom for (;;) { if (would overrun) break; increment; if (beginning of character found) break; } in the compact form while (not overrun) if (not at beginning of character after increment) break; is widespread enough in this file that commenting it every time would get too wordy. Background ---------- After looking at the code, i conclude that more work was done on UTF-8 handling in emacs mode than in VI mode. There are multiple gaps regarding UTF-8 handling in VI mode. Even though some of them share similar properties, i decided to fix only a single issue in this patch because in addition to the common aspects, the others also have differing aspects complicating the picture. Let's keep things simple by doing one step at a time. Additional issues i'm aware of include similar code implementing the "paste befire" (P) and "delete to EOL" (D) commands. There also appears to be a potential segfault in the "append" (a) xommand, and then there is the issue that Walter reported first, which i did not yet look at (but his patch is almost certainly incorrect). Again, one thing at a time. Even though quite a few UTF-8 features are already supported in VI mode, the test suite for VI mode contains very few tests involving UTF-8, in stark contrast to emacs mode, where most of the test suite is about UTF-8. I think it makes sense to slowly grow the test suite, in particular when adding new features. In those few cases where the VI mode testsuite already tests UTF-8, the suite is somewhat prone to overtesting. For example, for the input "\0303\0266\0033ro" (i.e. UTF-8 char, escape, replace with ASCII) the suite expects the output " # \0303\0266\bo \b\b", that is, not only printing the ASCII character over the former UTF-8 character and backing up the cursor such that it sits on the ASCII character, but printing an additional blank character and backing up another position. The reason is that in file vi.c, the function display() prints at least two bytes when overwriting a UTF-8 character. While that is clearly more robust and less complicated than trying to figure out whether what we are overprinting is a valid UTF-8 charcter, in which case overprinting just a single column would be sufficient, so even though i do not intend to change the code, from the perspective of the test, it would be just fine if the output were " # \0303\0266\bo\b", there is no good reason to test for the additional " \b". Then again, i do not see how this slight overtesting can be avoided, so for now, i'll stick to the existing style. In addition, i freely admit that the tests are not very readable. Then again, it's not obvious to me how to make them more readable - testing a command line editor is not an easy task. I doubt that adding long comments would help - i fear that would only hide the actual tests in a deluge of explanations. In this case, the three new tests test that 1. Adding an ASCII character after a UTF-8 character does not put the new character in the middle of the UTF-8 sequence; 2. After adding a UTF-8 character, the cursor backs up to the right place; 3. Both work at the same time, i.e. adding a UTF-8 character after another UTF-8 character works. Yours, Ingo Index: bin/ksh/vi.c =================================================================== RCS file: /cvs/src/bin/ksh/vi.c,v diff -u -p -r1.60 vi.c --- bin/ksh/vi.c 12 Mar 2021 02:10:25 -0000 1.60 +++ bin/ksh/vi.c 14 Apr 2025 18:18:45 -0000 @@ -834,12 +834,14 @@ vi_cmd(int argcnt, const char *cmd) case 'p': modified = 1; hnum = hlast; - if (es->linelen != 0) - es->cursor++; + while (es->cursor < es->linelen) + if (!isu8cont(es->cbuf[++es->cursor])) + break; while (putbuf(ybuf, yanklen, 0) == 0 && --argcnt > 0) ; - if (es->cursor != 0) - es->cursor--; + while (es->cursor > 0) + if (!isu8cont(es->cbuf[--es->cursor])) + break; if (argcnt != 0) return -1; break; Index: regress/bin/ksh/edit/vi.sh =================================================================== RCS file: /cvs/src/regress/bin/ksh/edit/vi.sh,v diff -u -p -r1.9 vi.sh --- regress/bin/ksh/edit/vi.sh 2 Sep 2021 07:14:15 -0000 1.9 +++ regress/bin/ksh/edit/vi.sh 14 Apr 2025 18:18:45 -0000 @@ -123,6 +123,11 @@ testseq "abcde\0033hDh2P" " # abcde\b\b # p: Paste after current position. testseq "abcd\0033hDhp" " # abcd\b\b \b\b\b\bacdb\b\b" testseq "abcd\0033hDh2p" " # abcd\b\b \b\b\b\bacdcdb\b\b" +testseq "A\0033xa\0303\0200\0033px" " # A\b \b\0303\0200\b\0303\0200A\b \b\b" +testseq "\0302\0251\0033xaA\0033px" \ + " # \0302\0251\b \b\bA\bA\0302\0251\b \b\b\b" +testseq "\0302\0251\0033xa\0303\0200\0033px" \ + " # \0302\0251\b \b\b\0303\0200\b\0303\0200\0302\0251\b \b\b\b" # R: Replace. testseq "abcd\00332h2Rx\0033" " # abcd\b\b\bxx\b"