From: Ingo Schwarze Subject: ksh vi mode: prevent display corruption if line starts with utf8cont To: tech@openbsd.org Date: Tue, 29 Apr 2025 21:14:30 +0200 Hello, unfortunately, ksh(1) VI mode contains multiple situations where the cursor can run out of the buffer when UTF-8 continuation bytes appear in places where they do not form valid UTF-8 sequences. In a few cases, i have even seen the shell die from SIGSEGV as a result. Before i can really start to track down and fix these one by one, or address the issue with the 'e' command that Walter reported, the function display() needs to be fixed. Right now, invalid UTF-8 at the beginning of the command line can corrupt the display so badly that working on other issues is seriously hindered. At this point, the goal is *not* for display() to calculate correct columniation because that would require full UTF-8 syntax validation plus use of wcwidth(3). The goals are * correct columniation if all characters are valid and width 1 * even for arbitary byte sequences, no display corruption To achieve these goals, the following assumptions are made: * UTF-8 continuation bytes are assumed to be width 0 * all other bytes are assumed to be width 1 Now let's consider the following patch: 1st chunk: This is in the first loop buffering bytes from the command line buffer cbuf[] into the display buffer wb1[], counting columns in "col" as we go. If we find a byte that is not an ASCII control character, it advances to the next column if it is not a UTF-8 continuation byte (i.e. if it is a printable ASCII byte or a UTF-8 start byte). What is missing here is that in the first column, even a UTF-8 continuation byte starts the column - that's not valid UTF-8 syntax, but xterm(1) will typically display it as the U+FFFD REPLACEMENT CHARACTER, which is width 1. 2nd chunk: This is in the second loop comparing the display buffer wb1[] to the former display buffer wb2[] and writing characters that changed to the terminal. The test "col > 0" is intended to prevent backing up past the left margin, but it is wrong for two reasons. First, the test is ineffective because we have col == pwidth (the width of the prompt) at the beginning of the loop, not col == 0. Secondly, decrementing the column number (col) and backing up the byte pointer (twb1) have to be independent of each other. Consequently, the block needs to be entered for all UTF-8 continuation bytes; the column number needs to be decremented unless we are still in the starting column; and the byte pointer needs to be backed up unless we are still at the beginning of the buffer. Even if we are still in the starting column, it can happen that we are no longer at the beginning of the buffer: that happens when the buffer starts with more than one UTF-8 continuation byte. 3rd chunk: This is also in the second loop, but regards the handling of unchanged bytes. It skips the column increment for UTF-8 continuation bytes. But we should *not* skip advancing the column for the first byte on the command line. Without these three changes, a line starting with one or more UTF-8 continuation bytes corrupts the display in so far as * the first loop undercounts the number of columns by one * the second loop overwrites the last character of the prompt with the beginning of the command line buffer. The new test inputs a three-byte UTF-8 character in an unusual way: it first inserts the two continuation bytes, then jumps back to the beginning of the line and inserts the UTF-8 start byte. It checks that the display does not get corrupted and that the correctly assembled three-byte UTF-8 character appears at the end. In case you wonder: No additional logic is needed when the window has moved horizontally (winleft > 0). That's because rewindow() makes sure that winleft never points to a UTF-8 continuation byte. That's guaranteed because newcol() never changes the column number for UTF-8 continuation bytes. OK? Ingo P.S. If someone can tell me about a way to type in arbitrary bytes - in particular bytes > 0x7f - into xterm(1) or wscons(4), that would be appreciated to make manual testing easier. Index: bin/ksh/vi.c =================================================================== RCS file: /cvs/src/bin/ksh/vi.c,v diff -u -p -r1.63 vi.c --- bin/ksh/vi.c 27 Apr 2025 17:06:58 -0000 1.63 +++ bin/ksh/vi.c 29 Apr 2025 16:51:05 -0000 @@ -1869,7 +1869,7 @@ display(char *wb1, char *wb2, int leftsi } } else { *twb1++ = ch; - if (!isu8cont(ch)) + if (col == 0 || !isu8cont(ch)) col++; } } @@ -1908,8 +1908,9 @@ display(char *wb1, char *wb2, int leftsi * the previous byte was the last one written. */ - if (col > 0 && isu8cont(*twb1)) { - col--; + if (isu8cont(*twb1)) { + if (col > pwidth) + col--; if (lastb >= 0 && twb1 == wb1 + lastb + 1) cur_col = col; else while (twb1 > wb1 && isu8cont(*twb1)) { @@ -1933,7 +1934,7 @@ display(char *wb1, char *wb2, int leftsi } lastb = *twb1 & 0x80 ? twb1 - wb1 : -1; cur_col++; - } else if (isu8cont(*twb1)) + } else if (twb1 > wb1 && isu8cont(*twb1)) continue; /* Index: regress/bin/ksh/edit/vi.sh =================================================================== RCS file: /cvs/src/regress/bin/ksh/edit/vi.sh,v diff -u -r1.12 vi.sh --- regress/bin/ksh/edit/vi.sh 27 Apr 2025 17:06:58 -0000 1.12 +++ regress/bin/ksh/edit/vi.sh 29 Apr 2025 16:51:30 -0000 @@ -103,6 +103,7 @@ # i: Insert. testseq "hello\00332hix" " # hello\b\b\bxllo\b\b\b" testseq "hello\00332\b2ix\0033" " # hello\b\b\bxllo\b\b\bxllo\b\b\b\b" +testseq "\0270\0202A\00330i\0340\0033" " # \0270\0202A\b\b\0340\0270\0202A\b\b" # I: Insert before first non-blank. # ^: Move to the first non-whitespace character.