From: Lucas Gabriel Vuotto <lucas@openbsd.org>
Subject: Re: ksh vi mode: prevent display corruption if line starts with utf8cont
To: Ingo Schwarze <schwarze@usta.de>
Cc: tech@openbsd.org
Date: Sun, 18 May 2025 05:50:33 +0000

Hi Ingo,

This reads fines and does solve the bug I showed to you. OK lucas.

On Fri, May 16, 2025 at 03:18:36PM +0200, Ingo Schwarze wrote:
> Hello,
> 
> Ingo Schwarze wrote on Tue, Apr 29, 2025 at 09:14:30PM +0200:
> 
> > unfortunately, ksh(1) VI mode contains multiple situations where the
> > cursor can run out of the buffer when UTF-8 continuation bytes appear
> > in places where they do not form valid UTF-8 sequences.  In a few
> > cases, i have even seen the shell die from SIGSEGV as a result.
> > 
> > Before i can really start to track down and fix these one by one,
> > or address the issue with the 'e' command that Walter reported,
> > the function display() needs to be fixed.  Right now, invalid UTF-8
> > at the beginning of the command line can corrupt the display so badly
> > that working on other issues is seriously hindered.
> 
> After partially exploring various rabbit holes that i will have to
> return to later, i finally committed my display() diff yesterday because
> i finally convinced myself that it indeed makes things better and that
> it is a necessary step, and i also had an OK from lucas@.
> 
> I did not yet commit the regression test though because lucas@ pointed
> out that some similar sequences of input still misbehave (thanks for
> that feedback).  It seems better to first get more aspects of
> continuation byte display fixed, then commit a set of related tests
> together.
> 
> The next step is fixing the function ed_mov_opt(); specifically, the
> part of it that advances the cursor from the column with the number
> "cur_col" to the column with the number "col".
> 
> The problem to solve is this:  If the line starts with a UTF-8
> continuation byte, the code ignores it as if it were part of the last
> character of the prompt and puts the *next* character right after the
> prompt, omitting the UTF-8 continuation byte.
> 
> The solution that seems easiest to understand restructures the loop in
> several respects:
>  * Terminate the loop when we reach the target column (ci == col) *and*
>    we do not need to finish a character we started - the latter can be
>    the case because we are still at the left margin and never started
>    any character in the first place (ci == pwidth) or because the
>    current byte is not a continuation byte.
>  * To simplify the condition controlling the actual printing, advance
>    the column count up front, but without advancing the byte pointer
>    yet, such that "ci" always contains the current column, no matter
>    whether we are on a start or continuation byte.
>  * UTF-8 continuation bytes advance the column count if and only if they
>    are at the beginning of the line.  So if after newcol() we are still
>    at the left margin, advance.
>  * Now we can simply print the byte if and only if it is to the right of
>    what we printer earlier (ci > cur_col).
>  * Only advance the byte pointer at the very end of the loop, when we
>    really want to consider the next byte.
> 
> Testing this requires doing rightward movements over various byte
> sequences.  It helps to first type a sufficiently long string such that
> the '0' command moves left by doing a carriage return and printing a
> fresh prompt over the existing prompt.  Then you can move right in small
> steps (e.g. with 'l') or large steps, including with '$'.
> 
> Note that when the line starts with a UTF-8 continuation byte and
> the cursors sits at the beginnin of the line, the 'l' command
> arguably still doesn't do the right thing and advances one character
> too far - but that's a separate issue to be fixed later.
> 
> I'm not planning to commit the two new tests with this patch because
> that's still not a complete suite for this kind of issues and there are
> still related unfixed issues that hinder regression testing, but i'm
> including them in this mail because they serve as a preliminary
> demonstration of the partial progress we are making.
> 
> OK?
>   Ingo
> 
> 
> 
> Index: bin/ksh/vi.c
> ===================================================================
> RCS file: /cvs/src/bin/ksh/vi.c,v
> diff -u -p -r1.64 vi.c
> --- bin/ksh/vi.c	15 May 2025 17:07:13 -0000	1.64
> +++ bin/ksh/vi.c	16 May 2025 12:40:46 -0000
> @@ -2006,10 +2006,15 @@ ed_mov_opt(int col, char *wb)
>  
>  	/* Advance the cursor. */
>  
> -	for (ci = pwidth; ci < col || isu8cont(*wb);
> -	     ci = newcol((unsigned char)*wb++, ci))
> -		if (ci > cur_col || (ci == cur_col && !isu8cont(*wb)))
> +	ci = pwidth;
> +	while (ci < col || (ci > pwidth && isu8cont(*wb))) {
> +		ci = newcol((unsigned char)*wb, ci);
> +		if (ci == pwidth)
> +			ci++;
> +		if (ci > cur_col)
>  			x_putc(*wb);
> +		wb++;
> +	}
>  	cur_col = ci;
>  }
>  
> Index: regress/bin/ksh/edit/vi.sh
> ===================================================================
> RCS file: /cvs/src/regress/bin/ksh/edit/vi.sh,v
> diff -u -p -r1.12 vi.sh
> --- regress/bin/ksh/edit/vi.sh	27 Apr 2025 17:06:58 -0000	1.12
> +++ regress/bin/ksh/edit/vi.sh	16 May 2025 12:41:13 -0000
> @@ -103,6 +103,9 @@ testseq "hello\003302flD" " # hello\b\b\
>  # i: Insert.
>  testseq "hello\00332hix" " # hello\b\b\bxllo\b\b\b"
>  testseq "hello\00332\b2ix\0033" " # hello\b\b\bxllo\b\b\bxllo\b\b\b\b"
> +testseq "\0270\0202A\00330i\0340\0033" " # \0270\0202A\b\b\0340\0270\0202A\b\b"
> +testseq "\0200abcdef\00330ix\0033" \
> +	" # \0200abcdef\b\r # x\0200abcdef\r # x\0200\b"
>  
>  # I: Insert before first non-blank.
>  # ^: Move to the first non-whitespace character.
>