From: Ingo Schwarze <schwarze@usta.de>
Subject: ksh vi mode: prevent display corruption if line starts with utf8cont
To: tech@openbsd.org
Date: Tue, 29 Apr 2025 21:14:30 +0200

Hello,

unfortunately, ksh(1) VI mode contains multiple situations where the cursor
can run out of the buffer when UTF-8 continuation bytes appear in places
where they do not form valid UTF-8 sequences.  In a few cases, i have
even seen the shell die from SIGSEGV as a result.

Before i can really start to track down and fix these one by one,
or address the issue with the 'e' command that Walter reported,
the function display() needs to be fixed.  Right now, invalid UTF-8
at the beginning of the command line can corrupt the display so badly
that working on other issues is seriously hindered.

At this point, the goal is *not* for display() to calculate correct
columniation because that would require full UTF-8 syntax validation
plus use of wcwidth(3).  The goals are
 * correct columniation if all characters are valid and width 1
 * even for arbitary byte sequences, no display corruption
To achieve these goals, the following assumptions are made:
 * UTF-8 continuation bytes are assumed to be width 0
 * all other bytes are assumed to be width 1

Now let's consider the following patch:

 1st chunk:
   This is in the first loop buffering bytes from the command line
   buffer cbuf[] into the display buffer wb1[], counting columns
   in "col" as we go.
   If we find a byte that is not an ASCII control character,
   it advances to the next column if it is not a UTF-8 continuation
   byte (i.e. if it is a printable ASCII byte or a UTF-8 start byte).
   What is missing here is that in the first column, even a UTF-8
   continuation byte starts the column - that's not valid UTF-8
   syntax, but xterm(1) will typically display it as the
   U+FFFD REPLACEMENT CHARACTER, which is width 1.

 2nd chunk:
   This is in the second loop comparing the display buffer wb1[]
   to the former display buffer wb2[] and writing characters
   that changed  to the terminal.
   The test "col > 0" is intended to prevent backing up past the
   left margin, but it is wrong for two reasons.
   First, the test is ineffective because we have col == pwidth
   (the width of the prompt) at the beginning of the loop, not col == 0.
   Secondly, decrementing the column number (col) and backing up the
   byte pointer (twb1) have to be independent of each other.
   Consequently, the block needs to be entered for all UTF-8
   continuation bytes; the column number needs to be decremented
   unless we are still in the starting column; and the byte pointer
   needs to be backed up unless we are still at the beginning of the
   buffer.  Even if we are still in the starting column, it can happen
   that we are no longer at the beginning of the buffer: that happens
   when the buffer starts with more than one UTF-8 continuation byte.

 3rd chunk:
   This is also in the second loop, but regards the handling of
   unchanged bytes.  It skips the column increment for UTF-8
   continuation bytes.
   But we should *not* skip advancing the column for the first byte
   on the command line.

Without these three changes, a line starting with one or more
UTF-8 continuation bytes corrupts the display in so far as
 * the first loop undercounts the number of columns by one
 * the second loop overwrites the last character of the prompt
   with the beginning of the command line buffer.

The new test inputs a three-byte UTF-8 character in an unusual way:
it first inserts the two continuation bytes, then jumps back to the
beginning of the line and inserts the UTF-8 start byte.  It checks
that the display does not get corrupted and that the correctly
assembled three-byte UTF-8 character appears at the end.

In case you wonder:  No additional logic is needed when the window
has moved horizontally (winleft > 0).  That's because rewindow()
makes sure that winleft never points to a UTF-8 continuation byte.
That's guaranteed because newcol() never changes the column number
for UTF-8 continuation bytes.

OK?
  Ingo

P.S.
If someone can tell me about a way to type in arbitrary bytes -
in particular bytes > 0x7f - into xterm(1) or wscons(4), that
would be appreciated to make manual testing easier.


Index: bin/ksh/vi.c
===================================================================
RCS file: /cvs/src/bin/ksh/vi.c,v
diff -u -p -r1.63 vi.c
--- bin/ksh/vi.c	27 Apr 2025 17:06:58 -0000	1.63
+++ bin/ksh/vi.c	29 Apr 2025 16:51:05 -0000
@@ -1869,7 +1869,7 @@ display(char *wb1, char *wb2, int leftsi
 					}
 				} else {
 					*twb1++ = ch;
-					if (!isu8cont(ch))
+					if (col == 0 || !isu8cont(ch))
 						col++;
 				}
 			}
@@ -1908,8 +1908,9 @@ display(char *wb1, char *wb2, int leftsi
 			 * the previous byte was the last one written.
 			 */
 
-			if (col > 0 && isu8cont(*twb1)) {
-				col--;
+			if (isu8cont(*twb1)) {
+				if (col > pwidth)
+					col--;
 				if (lastb >= 0 && twb1 == wb1 + lastb + 1)
 					cur_col = col;
 				else while (twb1 > wb1 && isu8cont(*twb1)) {
@@ -1933,7 +1934,7 @@ display(char *wb1, char *wb2, int leftsi
 			}
 			lastb = *twb1 & 0x80 ? twb1 - wb1 : -1;
 			cur_col++;
-		} else if (isu8cont(*twb1))
+		} else if (twb1 > wb1 && isu8cont(*twb1))
 			continue;
 
 		/*
Index: regress/bin/ksh/edit/vi.sh
===================================================================
RCS file: /cvs/src/regress/bin/ksh/edit/vi.sh,v
diff -u -r1.12 vi.sh
--- regress/bin/ksh/edit/vi.sh	27 Apr 2025 17:06:58 -0000	1.12
+++ regress/bin/ksh/edit/vi.sh	29 Apr 2025 16:51:30 -0000
@@ -103,6 +103,7 @@
 # i: Insert.
 testseq "hello\00332hix" " # hello\b\b\bxllo\b\b\b"
 testseq "hello\00332\b2ix\0033" " # hello\b\b\bxllo\b\b\bxllo\b\b\b\b"
+testseq "\0270\0202A\00330i\0340\0033" " # \0270\0202A\b\b\0340\0270\0202A\b\b"
 
 # I: Insert before first non-blank.
 # ^: Move to the first non-whitespace character.