From: Walter Alejandro Iglesias <wai@roquesor.com>
Subject: ksh vi mode: UTF-8 handling in 'e' command
To: tech@openbsd.org
Cc: Lucas Gabriel Vuotto <lucas@sexy.is>
Date: Tue, 15 Apr 2025 07:13:11 +0200

About the first issue I reported in the following thread on bugs@:

  https://marc.info/?t=174369335800023&r=1&w=2

Due to my lack of experience in C, that thread became noisy, so,
following Ingo's example, I thought it would be a good idea to repost
the first issue I reported there, with a better explanation of it and
Lucas's patch that fixes it.

What was preventing me from seeing what the code was doing was that I
didn't understand how isu8cont() worked (this isn't the first time I've
been stuck on an ultra-minimalistic Ted Unangst's creation ;-)).

Now that I understand the code, I can explain the problem more clearly.
When the cursor positions or lands on a UTF-8 character, the "e" command
in ksh's vi mode gets stuck on that character and won't advance no
matter how many times you press the "e" key.  This is because the
endword() function (file vi.c) doesn't recognize and skip UTF-8
continuation characters.  The following patch by Lucas Gabriel Vuotto
fixes the problem.


diff refs/heads/master 563a90a52e59962e09d4d2c0897c06024dab84be
commit - 58fd8d0bdc1e6222119987e7aaad111eae245668
commit + 563a90a52e59962e09d4d2c0897c06024dab84be
blob - cdda9cb24b1a4a395547e081ff3adca380d3b6c1
blob + 0b6459cb5f31c2a8e00de8fb837b4e7039d59214
--- bin/ksh/vi.c
+++ bin/ksh/vi.c
@@ -1590,15 +1590,18 @@ backword(int argcnt)
 static int
 endword(int argcnt)
 {
-	int ncursor, skip_space, want_letnum;
+	int ncursor, skip_space, skip_utf8_cont, want_letnum;
 	unsigned char uc;
 
 	ncursor = es->cursor;
 	while (ncursor < es->linelen && argcnt--) {
-		skip_space = 1;
+		skip_space = skip_utf8_cont = 1;
 		want_letnum = -1;
 		while (++ncursor < es->linelen) {
 			uc = es->cbuf[ncursor];
+			if (skip_utf8_cont && isu8cont(uc))
+				continue;
+			skip_utf8_cont = 0;
 			if (isspace(uc)) {
 				if (skip_space)
 					continue;
@@ -1663,6 +1666,9 @@ Endword(int argcnt)
 	ncursor = es->cursor;
 	while (ncursor < es->linelen && argcnt--) {
 		while (++ncursor < es->linelen &&
+		    isu8cont((unsigned char)es->cbuf[ncursor]))
+			;
+		while (++ncursor < es->linelen &&
 		    isspace((unsigned char)es->cbuf[ncursor]))
 			;
 		while (ncursor < es->linelen &&
blob - 2c33d0005da16ffd525336ada48374de632235a9
blob + 348511d26252c3d5f1afe9b9cfecdf8da2bcc272
--- regress/bin/ksh/edit/vi.sh
+++ regress/bin/ksh/edit/vi.sh
@@ -87,6 +87,15 @@ testseq "1.00 two\00330ED" " # 1.00 two\b\r # 1.0     
 # e: Move to end of word.
 testseq "onex two\00330eD" " # onex two\b\r # one     \b\b\b\b\b\b"
 
+# No infinite loop moving to end of {,big} word for non-ASCII UTF-8-ending
+# words.
+# EURO SIGN U+20AC is encoded as bytes 0xe2 0x82 0xac = \0342\0202\0254
+euro='\0342\0202\0254'
+testseq "1.00$euro 2.00 three\00330EED" \
+    " # 1.00$euro 2.00 three\b\r # 1.00$euro 2.0       \b\b\b\b\b\b\b\b"
+testseq "one$euro twox three\00330eeD" \
+    " # one$euro twox three\b\r # one$euro two       \b\b\b\b\b\b\b\b"
+
 # F: Find character backward.
 # ;: Repeat last search.
 # ,: Repeat last search in opposite direction.


-- 
Walter