From: Ingo Schwarze <schwarze@usta.de>
Subject: ksh vi mode: UTF-8 handling in 'p' command
To: tech@openbsd.org
Cc: Walter Alejandro Iglesias <wai@roquesor.com>
Date: Mon, 14 Apr 2025 20:20:16 +0200

[ I moved the thread from bugs@ to tech@ because what we are doing here
  goes beyond fixing bugs; these issues feel more like missing features. ]

Walter pointed out multiple issues with UTF-8 handling in the ksh(1) 
VI mode command line editor.  Here is a patch fixing one of these:
UTF-8 handling in the "paste after" (p) command.

I'm asking for an OK for the patch below.

Is there a risk of overflowing some media or something like that?
To my surprise, /usr/src/distrib/special/ksh/Makefile appears to
compile in both the VI mode and the emacs mode command line editor.

This is not the last patch that will ,make the shell grow by a few
bytes.  How do we test such patches from a "make release" perspective
without causing unsustainable testing effort on multiple platforms,
including those i don't have and those that are very slow?


Rationale
---------
From the user perspective, the cursor is located on a user-perceived
character.  It is an implementation detail invisible to the user
that internally, the cursor actually sits on the first byte of the
character.

Consequently,
 * before inserting anything after a UTF-8 character, the cursor needs
   to be advanced to the first byte of the subsequent character
 * after inserting a UTF-8 character, the cursor needs to be backed up
   to the first byte of the last character inserted.

I deleted the comments that i included in the first version of this
patch.  Even though it is arguably not obvious why a loop with one
per-increment and one post-increment termination condition is needed
here, the idiom

  for (;;) {
	if (would overrun)
		break;
	increment;
	if (beginning of character found)
		break;
  }

in the compact form

  while (not overrun)
	if (not at beginning of character after increment)
		break;

is widespread enough in this file that commenting it every time
would get too wordy.


Background
----------
After looking at the code, i conclude that more work was done on
UTF-8 handling in emacs mode than in VI mode.  There are multiple
gaps regarding UTF-8 handling in VI mode.  Even though some of them
share similar properties, i decided to fix only a single issue in
this patch because in addition to the common aspects, the others
also have differing aspects complicating the picture.  Let's
keep things simple by doing one step at a time.

Additional issues i'm aware of include similar code implementing
the "paste befire" (P) and "delete to EOL" (D) commands.  There also
appears to be a potential segfault in the "append" (a) xommand, and
then there is the issue that Walter reported first, which i did not
yet look at (but his patch is almost certainly incorrect).
Again, one thing at a time.

Even though quite a few UTF-8 features are already supported
in VI mode, the test suite for VI mode contains very few tests
involving UTF-8, in stark contrast to emacs mode, where most of
the test suite is about UTF-8.  I think it makes sense to slowly
grow the test suite, in particular when adding new features.

In those few cases where the VI mode testsuite already tests UTF-8,
the suite is somewhat prone to overtesting.  For example, for the
input "\0303\0266\0033ro" (i.e. UTF-8 char, escape, replace with ASCII)
the suite expects the output " # \0303\0266\bo \b\b", that is, not only
printing the ASCII character over the former UTF-8 character and backing
up the cursor such that it sits on the ASCII character, but printing
an additional blank character and backing up another position.
The reason is that in file vi.c, the function display() prints at
least two bytes when overwriting a UTF-8 character.  While that is
clearly more robust and less complicated than trying to figure out
whether what we are overprinting is a valid UTF-8 charcter, in which
case overprinting just a single column would be sufficient, so even
though i do not intend to change the code, from the perspective of
the test, it would be just fine if the output were " # \0303\0266\bo\b",
there is no good reason to test for the additional " \b".
Then again, i do not see how this slight overtesting can be avoided,
so for now, i'll stick to the existing style.

In addition, i freely admit that the tests are not very readable.
Then again, it's not obvious to me how to make them more readable -
testing a command line editor is not an easy task.  I doubt that
adding long comments would help - i fear that would only hide
the actual tests in a deluge of explanations.

In this case, the three new tests test that
 1. Adding an ASCII character after a UTF-8 character does not
    put the new character in the middle of the UTF-8 sequence;
 2. After adding a UTF-8 character, the cursor backs up
    to the right place;
 3. Both work at the same time, i.e. adding a UTF-8 character after
    another UTF-8 character works.

Yours,
  Ingo


Index: bin/ksh/vi.c
===================================================================
RCS file: /cvs/src/bin/ksh/vi.c,v
diff -u -p -r1.60 vi.c
--- bin/ksh/vi.c	12 Mar 2021 02:10:25 -0000	1.60
+++ bin/ksh/vi.c	14 Apr 2025 18:18:45 -0000
@@ -834,12 +834,14 @@ vi_cmd(int argcnt, const char *cmd)
 
 		case 'p':
 			modified = 1; hnum = hlast;
-			if (es->linelen != 0)
-				es->cursor++;
+			while (es->cursor < es->linelen)
+				if (!isu8cont(es->cbuf[++es->cursor]))
+					break;
 			while (putbuf(ybuf, yanklen, 0) == 0 && --argcnt > 0)
 				;
-			if (es->cursor != 0)
-				es->cursor--;
+			while (es->cursor > 0)
+				if (!isu8cont(es->cbuf[--es->cursor]))
+					break;
 			if (argcnt != 0)
 				return -1;
 			break;
Index: regress/bin/ksh/edit/vi.sh
===================================================================
RCS file: /cvs/src/regress/bin/ksh/edit/vi.sh,v
diff -u -p -r1.9 vi.sh
--- regress/bin/ksh/edit/vi.sh	2 Sep 2021 07:14:15 -0000	1.9
+++ regress/bin/ksh/edit/vi.sh	14 Apr 2025 18:18:45 -0000
@@ -123,6 +123,11 @@ testseq "abcde\0033hDh2P" " # abcde\b\b 
 # p: Paste after current position.
 testseq "abcd\0033hDhp" " # abcd\b\b  \b\b\b\bacdb\b\b"
 testseq "abcd\0033hDh2p" " # abcd\b\b  \b\b\b\bacdcdb\b\b"
+testseq "A\0033xa\0303\0200\0033px" " # A\b \b\0303\0200\b\0303\0200A\b \b\b"
+testseq "\0302\0251\0033xaA\0033px" \
+	" # \0302\0251\b  \b\bA\bA\0302\0251\b  \b\b\b" 
+testseq "\0302\0251\0033xa\0303\0200\0033px" \
+	" # \0302\0251\b  \b\b\0303\0200\b\0303\0200\0302\0251\b  \b\b\b"
 
 # R: Replace.
 testseq "abcd\00332h2Rx\0033" " # abcd\b\b\bxx\b"