From: Walter Alejandro Iglesias Subject: ksh vi mode: UTF-8 handling in 'e' command To: tech@openbsd.org Cc: Lucas Gabriel Vuotto Date: Tue, 15 Apr 2025 07:13:11 +0200 About the first issue I reported in the following thread on bugs@: https://marc.info/?t=174369335800023&r=1&w=2 Due to my lack of experience in C, that thread became noisy, so, following Ingo's example, I thought it would be a good idea to repost the first issue I reported there, with a better explanation of it and Lucas's patch that fixes it. What was preventing me from seeing what the code was doing was that I didn't understand how isu8cont() worked (this isn't the first time I've been stuck on an ultra-minimalistic Ted Unangst's creation ;-)). Now that I understand the code, I can explain the problem more clearly. When the cursor positions or lands on a UTF-8 character, the "e" command in ksh's vi mode gets stuck on that character and won't advance no matter how many times you press the "e" key. This is because the endword() function (file vi.c) doesn't recognize and skip UTF-8 continuation characters. The following patch by Lucas Gabriel Vuotto fixes the problem. diff refs/heads/master 563a90a52e59962e09d4d2c0897c06024dab84be commit - 58fd8d0bdc1e6222119987e7aaad111eae245668 commit + 563a90a52e59962e09d4d2c0897c06024dab84be blob - cdda9cb24b1a4a395547e081ff3adca380d3b6c1 blob + 0b6459cb5f31c2a8e00de8fb837b4e7039d59214 --- bin/ksh/vi.c +++ bin/ksh/vi.c @@ -1590,15 +1590,18 @@ backword(int argcnt) static int endword(int argcnt) { - int ncursor, skip_space, want_letnum; + int ncursor, skip_space, skip_utf8_cont, want_letnum; unsigned char uc; ncursor = es->cursor; while (ncursor < es->linelen && argcnt--) { - skip_space = 1; + skip_space = skip_utf8_cont = 1; want_letnum = -1; while (++ncursor < es->linelen) { uc = es->cbuf[ncursor]; + if (skip_utf8_cont && isu8cont(uc)) + continue; + skip_utf8_cont = 0; if (isspace(uc)) { if (skip_space) continue; @@ -1663,6 +1666,9 @@ Endword(int argcnt) ncursor = es->cursor; while (ncursor < es->linelen && argcnt--) { while (++ncursor < es->linelen && + isu8cont((unsigned char)es->cbuf[ncursor])) + ; + while (++ncursor < es->linelen && isspace((unsigned char)es->cbuf[ncursor])) ; while (ncursor < es->linelen && blob - 2c33d0005da16ffd525336ada48374de632235a9 blob + 348511d26252c3d5f1afe9b9cfecdf8da2bcc272 --- regress/bin/ksh/edit/vi.sh +++ regress/bin/ksh/edit/vi.sh @@ -87,6 +87,15 @@ testseq "1.00 two\00330ED" " # 1.00 two\b\r # 1.0 # e: Move to end of word. testseq "onex two\00330eD" " # onex two\b\r # one \b\b\b\b\b\b" +# No infinite loop moving to end of {,big} word for non-ASCII UTF-8-ending +# words. +# EURO SIGN U+20AC is encoded as bytes 0xe2 0x82 0xac = \0342\0202\0254 +euro='\0342\0202\0254' +testseq "1.00$euro 2.00 three\00330EED" \ + " # 1.00$euro 2.00 three\b\r # 1.00$euro 2.0 \b\b\b\b\b\b\b\b" +testseq "one$euro twox three\00330eeD" \ + " # one$euro twox three\b\r # one$euro two \b\b\b\b\b\b\b\b" + # F: Find character backward. # ;: Repeat last search. # ,: Repeat last search in opposite direction. -- Walter