From: Jonas Bechtel <post@jbechtel.de>
Subject: [PATCH] awk: proposal: reduce utf8 scanning
To: millert@openbsd.org
Cc: tech@openbsd.org
Date: Sun, 2 Jun 2024 10:06:11 +0200


Hi,

(CC to tech@openbsd.org, don't know whether it comes through)

I found recent progress for utf8 support in awk which I am using from
now on.

However, I suggest reducing string scanning as follows:

--- run.c.orig	Sun May  5 00:59:21 2024
+++ run.c.new	Sun Jun  2 09:21:14 2024
@@ -606,16 +606,14 @@ int u8_isutf(const char *s)
 	unsigned char c;
 
 	c = s[0];
-	if (c < 128 || awk_mb_cur_max == 1)
-		return 1; /* what if it's 0? */
-
-	n = strlen(s);
-	if (n >= 2 && ((c>>5) & 0x7) == 0x6 && (s[1] & 0xC0) == 0x80) {
+	if (c < 128 || awk_mb_cur_max == 1) {
+		ret = 1; /* what if it's 0? */
+	} else if (((c>>5) & 0x7) == 0x6 && (s[1] & 0xC0) == 0x80) {
 		ret = 2; /* 110xxxxx 10xxxxxx */
-	} else if (n >= 3 && ((c>>4) & 0xF) == 0xE && (s[1] & 0xC0) ==
0x80
+	} else if (((c>>4) & 0xF) == 0xE && (s[1] & 0xC0) == 0x80
 			 && (s[2] & 0xC0) == 0x80) {
 		ret = 3; /* 1110xxxx 10xxxxxx 10xxxxxx */
-	} else if (n >= 4 && ((c>>3) & 0x1F) == 0x1E && (s[1] & 0xC0)
== 0x80
+	} else if (((c>>3) & 0x1F) == 0x1E && (s[1] & 0xC0) == 0x80
 			 && (s[2] & 0xC0) == 0x80 && (s[3] & 0xC0) ==
0x80) { ret = 4; /* 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx */
 	} else {
@@ -1018,7 +1016,7 @@ Cell *substr(Node **a, int nnn)		/*
substr(a[0], a[1], DPRINTF("substr: m=%d, n=%d, s=%s\n", m, n, s);
 	y = gettemp();
 	mb = u8_char2byte(s, m-1); /* byte offset of start char in s */
-	nb = u8_char2byte(s, m-1+n);  /* byte offset of end+1 char in
s */
+	nb = mb+u8_char2byte(&s[mb], n);  /* byte offset of end+1 char
in s */ 
 	temp = s[nb];	/* with thanks to John Linderman */
 	s[nb] = '\0';


a) strlen is not needed because the check for
0b110xxxxx/0b1110xxxx/0b11110xxx and 0b10xxxxxx ensure that no 0 is
read. strlen would read the entire string for every single codepoint 
b) starting position for second scan adjusted


Best Regards
 jbechtel