From: Ingo Schwarze <schwarze@usta.de>
Subject: ksh(1) VI mode: fix rewindow()
To: tech@openbsd.org
Date: Mon, 26 May 2025 00:54:12 +0200

Hello,

in ksh(1) VI editing mode, when you move the cursor to the left or
right out of the editing window, the function rewindow() gets called.
Its purpose is to calculate a new value for es->winleft, which is
the index of the first byte of the editing buffer es->cbuf[] to be
displayed.  After rewindow() returns, display() redraws the line,
starting at the just selected position.

The algorithm rewindow() currently uses goes like this:
 1. The overall goal is to choose es->winleft in such a way that
    the cursor position (corresponding to es->cursor) ends up
    approximately in the middle of the editing window.
    The task is non-trivial because byte positions in the
    buffer do not directly correspond to column positions.
    In particular, UTF-8 continuation bytes do not take up
    column positions, whereas tabs may take up more than one
    column, and so do ASCII control characters (and high bytes
    if the option vi-show8 is set).  Given that some bytes
    take up more than one column, it is not always possible
    to get the cursor *exactly* to the middle of the window,
    but fortunately, *approximately* is good enough.
 2. The function iterates the whole command line from the very
    beginning up to the cursor position - even if the cursor is far
    out to the right and the beginning is completely irrelevant.
 3. During this iteration, it remembers the last two places
    where the column position is (approximately) an integer
    multiple of half the window width - even though we are not
    really interested in distances from the beginning, but
    rather in distances from the cursor.
 4. Finally, it starts *another* iteration at the last but one
    multiple of winwidth/2 (which may be too far left) and
    goes to the right until it thinks that starting there
    might put the cursor in the middle.

This algorithm has multiple defects.
 a) It is written in such a way that it starts the editing window
    with a UTF-8 continuation byte whenever it gets the chance,
    resulting in garbled output that is not valid UTF-8.
 b) Needlessly iterating the whole buffer is inefficient.
 c) Iterating over bytes again that we already inspected during
    the first iteration is inefficient.
 d) Even with the second iteration in step 4, the result is not
    guaranteed to be optimal because the width of a tab depends
    on the position of the editing window, which the second
    iteration in step 4 does not take into account.
 e) The code, even though relatively short, is hard to follow,
    and it is hard to see why it does what it does.

Fixing everything, including completely fixing defect d),
would probably require increasing the order of the algorithm.
Currently, it is something like O(C + W) where C is the cursor
position in bytes from the beginning of the line and W is half
the width of the editing window.  Reliably finding the optimal
position in every case involving tabs and control characters
would probably require an algorithm of the order O(W^2) because
one would have to try several starting position and do an
iteration calculating the cursor position for each of them,
resulting in a quadratic algorithm.  Well, picking a tricky
search algorithm, one might get that down to O(W * log(W)) -
but either way, an exact solution is likely not worth the
significant complication of the code it would cause.

What i propose is solving a), b), c), and e) by using an
algorithm of the order O(W) - i.e. slightly faster -
that is slightly longer, but much easier to understand.
It doesn't become worse with respect to defect d) -
it may be a handful of bytes off to the left or right in
some cases involving literal tabs, but i consider that
irrelevant for practical purposes.

The key to the new algorithm is iterating *backwards* rather than
forwards - which makes sense because we want to start at the
cursor position and than step left until we have backed up
by about half the window width.

OK?
  Ingo


Index: vi.c
===================================================================
RCS file: /cvs/src/bin/ksh/vi.c,v
diff -u -p -r1.66 vi.c
--- vi.c	19 May 2025 14:27:38 -0000	1.66
+++ vi.c	25 May 2025 21:41:00 -0000
@@ -1789,25 +1789,61 @@ outofwin(void)
 static void
 rewindow(void)
 {
-	int	tcur, tcol;
-	int	holdcur1, holdcol1;
-	int	holdcur2, holdcol2;
-
-	holdcur1 = holdcur2 = tcur = 0;
-	holdcol1 = holdcol2 = tcol = 0;
-	while (tcur < es->cursor) {
-		if (tcol - holdcol2 > winwidth / 2) {
-			holdcur1 = holdcur2;
-			holdcol1 = holdcol2;
-			holdcur2 = tcur;
-			holdcol2 = tcol;
+	int		cur;	/* byte# in the main command line buffer */
+	int		col;	/* corresponding display column */
+	int		tabc;	/* columns a tab character can take up */
+	int		thisc;	/* columns the current character requires */
+	unsigned char	uc;	/* the current byte */
+
+	/* The desired cursor position is near the middle of the window. */
+	cur = es->cursor;
+	col = winwidth / 2;
+	tabc = 0;
+
+	/* Step left to find the desired left margin. */
+	while (cur > 0 && col > 0) {
+		uc = es->cbuf[--cur];
+
+		/* Never start the window on a continuation byte. */
+		if (isu8cont(uc))
+			continue;
+
+		if (uc == '\t') {
+			/*
+			 * If two tabs occur close together,
+			 * count the right one, including optional
+			 * characters between both, as 8 columns.
+			 */
+			if (tabc > 0) {
+				col -= 8;
+				/* Prefer starting after a tab. */
+				if (col <= 0)
+					cur++;
+			}
+
+			/*
+			 * A tab can be preceded by up to 7 characters
+			 * without taking up additional space.
+			 */
+			tabc = 7;
+			continue;
+		}
+		thisc = char_len(uc);
+		if (tabc > 0) {
+			if (tabc > thisc) {
+				/* The character still fits in the tab. */
+				tabc -= thisc;
+				continue;
+			}
+			col -= 8;	/* The tab is now full. */
+			thisc -= tabc;	/* This may produce overflow. */
+			tabc = 0;
 		}
-		tcol = newcol((unsigned char) es->cbuf[tcur++], tcol);
+
+		/* Handle a normal character or the overflow. */
+		col -= thisc;
 	}
-	while (tcol - holdcol1 > winwidth / 2)
-		holdcol1 = newcol((unsigned char) es->cbuf[holdcur1++],
-		    holdcol1);
-	es->winleft = holdcur1;
+	es->winleft = cur;
 }
 
 /* Printing the byte ch at display column col moves to which column? */