Index | Thread | Search

From:
Lucas Gabriel Vuotto <lucas@openbsd.org>
Subject:
Re: ksh vi mode: prevent display corruption if line starts with utf8cont
To:
Ingo Schwarze <schwarze@usta.de>
Cc:
tech@openbsd.org
Date:
Sun, 18 May 2025 05:50:33 +0000

Download raw body.

Thread
  • Ingo Schwarze:

    ksh vi mode: prevent display corruption if line starts with utf8cont

    • Lucas Gabriel Vuotto:

      ksh vi mode: prevent display corruption if line starts with utf8cont

  • Hi Ingo,
    
    This reads fines and does solve the bug I showed to you. OK lucas.
    
    On Fri, May 16, 2025 at 03:18:36PM +0200, Ingo Schwarze wrote:
    > Hello,
    > 
    > Ingo Schwarze wrote on Tue, Apr 29, 2025 at 09:14:30PM +0200:
    > 
    > > unfortunately, ksh(1) VI mode contains multiple situations where the
    > > cursor can run out of the buffer when UTF-8 continuation bytes appear
    > > in places where they do not form valid UTF-8 sequences.  In a few
    > > cases, i have even seen the shell die from SIGSEGV as a result.
    > > 
    > > Before i can really start to track down and fix these one by one,
    > > or address the issue with the 'e' command that Walter reported,
    > > the function display() needs to be fixed.  Right now, invalid UTF-8
    > > at the beginning of the command line can corrupt the display so badly
    > > that working on other issues is seriously hindered.
    > 
    > After partially exploring various rabbit holes that i will have to
    > return to later, i finally committed my display() diff yesterday because
    > i finally convinced myself that it indeed makes things better and that
    > it is a necessary step, and i also had an OK from lucas@.
    > 
    > I did not yet commit the regression test though because lucas@ pointed
    > out that some similar sequences of input still misbehave (thanks for
    > that feedback).  It seems better to first get more aspects of
    > continuation byte display fixed, then commit a set of related tests
    > together.
    > 
    > The next step is fixing the function ed_mov_opt(); specifically, the
    > part of it that advances the cursor from the column with the number
    > "cur_col" to the column with the number "col".
    > 
    > The problem to solve is this:  If the line starts with a UTF-8
    > continuation byte, the code ignores it as if it were part of the last
    > character of the prompt and puts the *next* character right after the
    > prompt, omitting the UTF-8 continuation byte.
    > 
    > The solution that seems easiest to understand restructures the loop in
    > several respects:
    >  * Terminate the loop when we reach the target column (ci == col) *and*
    >    we do not need to finish a character we started - the latter can be
    >    the case because we are still at the left margin and never started
    >    any character in the first place (ci == pwidth) or because the
    >    current byte is not a continuation byte.
    >  * To simplify the condition controlling the actual printing, advance
    >    the column count up front, but without advancing the byte pointer
    >    yet, such that "ci" always contains the current column, no matter
    >    whether we are on a start or continuation byte.
    >  * UTF-8 continuation bytes advance the column count if and only if they
    >    are at the beginning of the line.  So if after newcol() we are still
    >    at the left margin, advance.
    >  * Now we can simply print the byte if and only if it is to the right of
    >    what we printer earlier (ci > cur_col).
    >  * Only advance the byte pointer at the very end of the loop, when we
    >    really want to consider the next byte.
    > 
    > Testing this requires doing rightward movements over various byte
    > sequences.  It helps to first type a sufficiently long string such that
    > the '0' command moves left by doing a carriage return and printing a
    > fresh prompt over the existing prompt.  Then you can move right in small
    > steps (e.g. with 'l') or large steps, including with '$'.
    > 
    > Note that when the line starts with a UTF-8 continuation byte and
    > the cursors sits at the beginnin of the line, the 'l' command
    > arguably still doesn't do the right thing and advances one character
    > too far - but that's a separate issue to be fixed later.
    > 
    > I'm not planning to commit the two new tests with this patch because
    > that's still not a complete suite for this kind of issues and there are
    > still related unfixed issues that hinder regression testing, but i'm
    > including them in this mail because they serve as a preliminary
    > demonstration of the partial progress we are making.
    > 
    > OK?
    >   Ingo
    > 
    > 
    > 
    > Index: bin/ksh/vi.c
    > ===================================================================
    > RCS file: /cvs/src/bin/ksh/vi.c,v
    > diff -u -p -r1.64 vi.c
    > --- bin/ksh/vi.c	15 May 2025 17:07:13 -0000	1.64
    > +++ bin/ksh/vi.c	16 May 2025 12:40:46 -0000
    > @@ -2006,10 +2006,15 @@ ed_mov_opt(int col, char *wb)
    >  
    >  	/* Advance the cursor. */
    >  
    > -	for (ci = pwidth; ci < col || isu8cont(*wb);
    > -	     ci = newcol((unsigned char)*wb++, ci))
    > -		if (ci > cur_col || (ci == cur_col && !isu8cont(*wb)))
    > +	ci = pwidth;
    > +	while (ci < col || (ci > pwidth && isu8cont(*wb))) {
    > +		ci = newcol((unsigned char)*wb, ci);
    > +		if (ci == pwidth)
    > +			ci++;
    > +		if (ci > cur_col)
    >  			x_putc(*wb);
    > +		wb++;
    > +	}
    >  	cur_col = ci;
    >  }
    >  
    > Index: regress/bin/ksh/edit/vi.sh
    > ===================================================================
    > RCS file: /cvs/src/regress/bin/ksh/edit/vi.sh,v
    > diff -u -p -r1.12 vi.sh
    > --- regress/bin/ksh/edit/vi.sh	27 Apr 2025 17:06:58 -0000	1.12
    > +++ regress/bin/ksh/edit/vi.sh	16 May 2025 12:41:13 -0000
    > @@ -103,6 +103,9 @@ testseq "hello\003302flD" " # hello\b\b\
    >  # i: Insert.
    >  testseq "hello\00332hix" " # hello\b\b\bxllo\b\b\b"
    >  testseq "hello\00332\b2ix\0033" " # hello\b\b\bxllo\b\b\bxllo\b\b\b\b"
    > +testseq "\0270\0202A\00330i\0340\0033" " # \0270\0202A\b\b\0340\0270\0202A\b\b"
    > +testseq "\0200abcdef\00330ix\0033" \
    > +	" # \0200abcdef\b\r # x\0200abcdef\r # x\0200\b"
    >  
    >  # I: Insert before first non-blank.
    >  # ^: Move to the first non-whitespace character.
    > 
    
    
  • Ingo Schwarze:

    ksh vi mode: prevent display corruption if line starts with utf8cont