From: Ingo Schwarze <schwarze@usta.de>
Subject: Re: ksh vi mode: prevent display corruption if line starts with utf8cont
To: tech@openbsd.org
Cc: lucas@openbsd.org
Date: Fri, 16 May 2025 15:18:36 +0200

Hello,

Ingo Schwarze wrote on Tue, Apr 29, 2025 at 09:14:30PM +0200:

> unfortunately, ksh(1) VI mode contains multiple situations where the
> cursor can run out of the buffer when UTF-8 continuation bytes appear
> in places where they do not form valid UTF-8 sequences.  In a few
> cases, i have even seen the shell die from SIGSEGV as a result.
> 
> Before i can really start to track down and fix these one by one,
> or address the issue with the 'e' command that Walter reported,
> the function display() needs to be fixed.  Right now, invalid UTF-8
> at the beginning of the command line can corrupt the display so badly
> that working on other issues is seriously hindered.

After partially exploring various rabbit holes that i will have to
return to later, i finally committed my display() diff yesterday because
i finally convinced myself that it indeed makes things better and that
it is a necessary step, and i also had an OK from lucas@.

I did not yet commit the regression test though because lucas@ pointed
out that some similar sequences of input still misbehave (thanks for
that feedback).  It seems better to first get more aspects of
continuation byte display fixed, then commit a set of related tests
together.

The next step is fixing the function ed_mov_opt(); specifically, the
part of it that advances the cursor from the column with the number
"cur_col" to the column with the number "col".

The problem to solve is this:  If the line starts with a UTF-8
continuation byte, the code ignores it as if it were part of the last
character of the prompt and puts the *next* character right after the
prompt, omitting the UTF-8 continuation byte.

The solution that seems easiest to understand restructures the loop in
several respects:
 * Terminate the loop when we reach the target column (ci == col) *and*
   we do not need to finish a character we started - the latter can be
   the case because we are still at the left margin and never started
   any character in the first place (ci == pwidth) or because the
   current byte is not a continuation byte.
 * To simplify the condition controlling the actual printing, advance
   the column count up front, but without advancing the byte pointer
   yet, such that "ci" always contains the current column, no matter
   whether we are on a start or continuation byte.
 * UTF-8 continuation bytes advance the column count if and only if they
   are at the beginning of the line.  So if after newcol() we are still
   at the left margin, advance.
 * Now we can simply print the byte if and only if it is to the right of
   what we printer earlier (ci > cur_col).
 * Only advance the byte pointer at the very end of the loop, when we
   really want to consider the next byte.

Testing this requires doing rightward movements over various byte
sequences.  It helps to first type a sufficiently long string such that
the '0' command moves left by doing a carriage return and printing a
fresh prompt over the existing prompt.  Then you can move right in small
steps (e.g. with 'l') or large steps, including with '$'.

Note that when the line starts with a UTF-8 continuation byte and
the cursors sits at the beginnin of the line, the 'l' command
arguably still doesn't do the right thing and advances one character
too far - but that's a separate issue to be fixed later.

I'm not planning to commit the two new tests with this patch because
that's still not a complete suite for this kind of issues and there are
still related unfixed issues that hinder regression testing, but i'm
including them in this mail because they serve as a preliminary
demonstration of the partial progress we are making.

OK?
  Ingo


Index: bin/ksh/vi.c
===================================================================
RCS file: /cvs/src/bin/ksh/vi.c,v
diff -u -p -r1.64 vi.c
--- bin/ksh/vi.c	15 May 2025 17:07:13 -0000	1.64
+++ bin/ksh/vi.c	16 May 2025 12:40:46 -0000
@@ -2006,10 +2006,15 @@ ed_mov_opt(int col, char *wb)
 
 	/* Advance the cursor. */
 
-	for (ci = pwidth; ci < col || isu8cont(*wb);
-	     ci = newcol((unsigned char)*wb++, ci))
-		if (ci > cur_col || (ci == cur_col && !isu8cont(*wb)))
+	ci = pwidth;
+	while (ci < col || (ci > pwidth && isu8cont(*wb))) {
+		ci = newcol((unsigned char)*wb, ci);
+		if (ci == pwidth)
+			ci++;
+		if (ci > cur_col)
 			x_putc(*wb);
+		wb++;
+	}
 	cur_col = ci;
 }
 
Index: regress/bin/ksh/edit/vi.sh
===================================================================
RCS file: /cvs/src/regress/bin/ksh/edit/vi.sh,v
diff -u -p -r1.12 vi.sh
--- regress/bin/ksh/edit/vi.sh	27 Apr 2025 17:06:58 -0000	1.12
+++ regress/bin/ksh/edit/vi.sh	16 May 2025 12:41:13 -0000
@@ -103,6 +103,9 @@ testseq "hello\003302flD" " # hello\b\b\
 # i: Insert.
 testseq "hello\00332hix" " # hello\b\b\bxllo\b\b\b"
 testseq "hello\00332\b2ix\0033" " # hello\b\b\bxllo\b\b\bxllo\b\b\b\b"
+testseq "\0270\0202A\00330i\0340\0033" " # \0270\0202A\b\b\0340\0270\0202A\b\b"
+testseq "\0200abcdef\00330ix\0033" \
+	" # \0200abcdef\b\r # x\0200abcdef\r # x\0200\b"
 
 # I: Insert before first non-blank.
 # ^: Move to the first non-whitespace character.