From: Lucas Gabriel Vuotto Subject: Re: ksh vi mode: prevent display corruption if line starts with utf8cont To: Ingo Schwarze Cc: tech@openbsd.org Date: Sun, 18 May 2025 05:50:33 +0000 Hi Ingo, This reads fines and does solve the bug I showed to you. OK lucas. On Fri, May 16, 2025 at 03:18:36PM +0200, Ingo Schwarze wrote: > Hello, > > Ingo Schwarze wrote on Tue, Apr 29, 2025 at 09:14:30PM +0200: > > > unfortunately, ksh(1) VI mode contains multiple situations where the > > cursor can run out of the buffer when UTF-8 continuation bytes appear > > in places where they do not form valid UTF-8 sequences. In a few > > cases, i have even seen the shell die from SIGSEGV as a result. > > > > Before i can really start to track down and fix these one by one, > > or address the issue with the 'e' command that Walter reported, > > the function display() needs to be fixed. Right now, invalid UTF-8 > > at the beginning of the command line can corrupt the display so badly > > that working on other issues is seriously hindered. > > After partially exploring various rabbit holes that i will have to > return to later, i finally committed my display() diff yesterday because > i finally convinced myself that it indeed makes things better and that > it is a necessary step, and i also had an OK from lucas@. > > I did not yet commit the regression test though because lucas@ pointed > out that some similar sequences of input still misbehave (thanks for > that feedback). It seems better to first get more aspects of > continuation byte display fixed, then commit a set of related tests > together. > > The next step is fixing the function ed_mov_opt(); specifically, the > part of it that advances the cursor from the column with the number > "cur_col" to the column with the number "col". > > The problem to solve is this: If the line starts with a UTF-8 > continuation byte, the code ignores it as if it were part of the last > character of the prompt and puts the *next* character right after the > prompt, omitting the UTF-8 continuation byte. > > The solution that seems easiest to understand restructures the loop in > several respects: > * Terminate the loop when we reach the target column (ci == col) *and* > we do not need to finish a character we started - the latter can be > the case because we are still at the left margin and never started > any character in the first place (ci == pwidth) or because the > current byte is not a continuation byte. > * To simplify the condition controlling the actual printing, advance > the column count up front, but without advancing the byte pointer > yet, such that "ci" always contains the current column, no matter > whether we are on a start or continuation byte. > * UTF-8 continuation bytes advance the column count if and only if they > are at the beginning of the line. So if after newcol() we are still > at the left margin, advance. > * Now we can simply print the byte if and only if it is to the right of > what we printer earlier (ci > cur_col). > * Only advance the byte pointer at the very end of the loop, when we > really want to consider the next byte. > > Testing this requires doing rightward movements over various byte > sequences. It helps to first type a sufficiently long string such that > the '0' command moves left by doing a carriage return and printing a > fresh prompt over the existing prompt. Then you can move right in small > steps (e.g. with 'l') or large steps, including with '$'. > > Note that when the line starts with a UTF-8 continuation byte and > the cursors sits at the beginnin of the line, the 'l' command > arguably still doesn't do the right thing and advances one character > too far - but that's a separate issue to be fixed later. > > I'm not planning to commit the two new tests with this patch because > that's still not a complete suite for this kind of issues and there are > still related unfixed issues that hinder regression testing, but i'm > including them in this mail because they serve as a preliminary > demonstration of the partial progress we are making. > > OK? > Ingo > > > > Index: bin/ksh/vi.c > =================================================================== > RCS file: /cvs/src/bin/ksh/vi.c,v > diff -u -p -r1.64 vi.c > --- bin/ksh/vi.c 15 May 2025 17:07:13 -0000 1.64 > +++ bin/ksh/vi.c 16 May 2025 12:40:46 -0000 > @@ -2006,10 +2006,15 @@ ed_mov_opt(int col, char *wb) > > /* Advance the cursor. */ > > - for (ci = pwidth; ci < col || isu8cont(*wb); > - ci = newcol((unsigned char)*wb++, ci)) > - if (ci > cur_col || (ci == cur_col && !isu8cont(*wb))) > + ci = pwidth; > + while (ci < col || (ci > pwidth && isu8cont(*wb))) { > + ci = newcol((unsigned char)*wb, ci); > + if (ci == pwidth) > + ci++; > + if (ci > cur_col) > x_putc(*wb); > + wb++; > + } > cur_col = ci; > } > > Index: regress/bin/ksh/edit/vi.sh > =================================================================== > RCS file: /cvs/src/regress/bin/ksh/edit/vi.sh,v > diff -u -p -r1.12 vi.sh > --- regress/bin/ksh/edit/vi.sh 27 Apr 2025 17:06:58 -0000 1.12 > +++ regress/bin/ksh/edit/vi.sh 16 May 2025 12:41:13 -0000 > @@ -103,6 +103,9 @@ testseq "hello\003302flD" " # hello\b\b\ > # i: Insert. > testseq "hello\00332hix" " # hello\b\b\bxllo\b\b\b" > testseq "hello\00332\b2ix\0033" " # hello\b\b\bxllo\b\b\bxllo\b\b\b\b" > +testseq "\0270\0202A\00330i\0340\0033" " # \0270\0202A\b\b\0340\0270\0202A\b\b" > +testseq "\0200abcdef\00330ix\0033" \ > + " # \0200abcdef\b\r # x\0200abcdef\r # x\0200\b" > > # I: Insert before first non-blank. > # ^: Move to the first non-whitespace character. >