From: Ingo Schwarze Subject: Re: ksh vi mode: prevent display corruption if line starts with utf8cont To: tech@openbsd.org Cc: lucas@openbsd.org Date: Fri, 16 May 2025 15:18:36 +0200 Hello, Ingo Schwarze wrote on Tue, Apr 29, 2025 at 09:14:30PM +0200: > unfortunately, ksh(1) VI mode contains multiple situations where the > cursor can run out of the buffer when UTF-8 continuation bytes appear > in places where they do not form valid UTF-8 sequences. In a few > cases, i have even seen the shell die from SIGSEGV as a result. > > Before i can really start to track down and fix these one by one, > or address the issue with the 'e' command that Walter reported, > the function display() needs to be fixed. Right now, invalid UTF-8 > at the beginning of the command line can corrupt the display so badly > that working on other issues is seriously hindered. After partially exploring various rabbit holes that i will have to return to later, i finally committed my display() diff yesterday because i finally convinced myself that it indeed makes things better and that it is a necessary step, and i also had an OK from lucas@. I did not yet commit the regression test though because lucas@ pointed out that some similar sequences of input still misbehave (thanks for that feedback). It seems better to first get more aspects of continuation byte display fixed, then commit a set of related tests together. The next step is fixing the function ed_mov_opt(); specifically, the part of it that advances the cursor from the column with the number "cur_col" to the column with the number "col". The problem to solve is this: If the line starts with a UTF-8 continuation byte, the code ignores it as if it were part of the last character of the prompt and puts the *next* character right after the prompt, omitting the UTF-8 continuation byte. The solution that seems easiest to understand restructures the loop in several respects: * Terminate the loop when we reach the target column (ci == col) *and* we do not need to finish a character we started - the latter can be the case because we are still at the left margin and never started any character in the first place (ci == pwidth) or because the current byte is not a continuation byte. * To simplify the condition controlling the actual printing, advance the column count up front, but without advancing the byte pointer yet, such that "ci" always contains the current column, no matter whether we are on a start or continuation byte. * UTF-8 continuation bytes advance the column count if and only if they are at the beginning of the line. So if after newcol() we are still at the left margin, advance. * Now we can simply print the byte if and only if it is to the right of what we printer earlier (ci > cur_col). * Only advance the byte pointer at the very end of the loop, when we really want to consider the next byte. Testing this requires doing rightward movements over various byte sequences. It helps to first type a sufficiently long string such that the '0' command moves left by doing a carriage return and printing a fresh prompt over the existing prompt. Then you can move right in small steps (e.g. with 'l') or large steps, including with '$'. Note that when the line starts with a UTF-8 continuation byte and the cursors sits at the beginnin of the line, the 'l' command arguably still doesn't do the right thing and advances one character too far - but that's a separate issue to be fixed later. I'm not planning to commit the two new tests with this patch because that's still not a complete suite for this kind of issues and there are still related unfixed issues that hinder regression testing, but i'm including them in this mail because they serve as a preliminary demonstration of the partial progress we are making. OK? Ingo Index: bin/ksh/vi.c =================================================================== RCS file: /cvs/src/bin/ksh/vi.c,v diff -u -p -r1.64 vi.c --- bin/ksh/vi.c 15 May 2025 17:07:13 -0000 1.64 +++ bin/ksh/vi.c 16 May 2025 12:40:46 -0000 @@ -2006,10 +2006,15 @@ ed_mov_opt(int col, char *wb) /* Advance the cursor. */ - for (ci = pwidth; ci < col || isu8cont(*wb); - ci = newcol((unsigned char)*wb++, ci)) - if (ci > cur_col || (ci == cur_col && !isu8cont(*wb))) + ci = pwidth; + while (ci < col || (ci > pwidth && isu8cont(*wb))) { + ci = newcol((unsigned char)*wb, ci); + if (ci == pwidth) + ci++; + if (ci > cur_col) x_putc(*wb); + wb++; + } cur_col = ci; } Index: regress/bin/ksh/edit/vi.sh =================================================================== RCS file: /cvs/src/regress/bin/ksh/edit/vi.sh,v diff -u -p -r1.12 vi.sh --- regress/bin/ksh/edit/vi.sh 27 Apr 2025 17:06:58 -0000 1.12 +++ regress/bin/ksh/edit/vi.sh 16 May 2025 12:41:13 -0000 @@ -103,6 +103,9 @@ testseq "hello\003302flD" " # hello\b\b\ # i: Insert. testseq "hello\00332hix" " # hello\b\b\bxllo\b\b\b" testseq "hello\00332\b2ix\0033" " # hello\b\b\bxllo\b\b\bxllo\b\b\b\b" +testseq "\0270\0202A\00330i\0340\0033" " # \0270\0202A\b\b\0340\0270\0202A\b\b" +testseq "\0200abcdef\00330ix\0033" \ + " # \0200abcdef\b\r # x\0200abcdef\r # x\0200\b" # I: Insert before first non-blank. # ^: Move to the first non-whitespace character.