Index | Thread | Search

From:
Todd C. Miller <millert@openbsd.org>
Subject:
Re: [PATCH] awk: proposal: reduce utf8 scanning
To:
Jonas Bechtel <post@jbechtel.de>
Cc:
tech@openbsd.org
Date:
Sun, 02 Jun 2024 19:31:51 -0600

Download raw body.

Thread
On Sun, 02 Jun 2024 10:06:11 +0200, Jonas Bechtel wrote:

> a) strlen is not needed because the check for
> 0b110xxxxx/0b1110xxxx/0b11110xxx and 0b10xxxxxx ensure that no 0 is
> read. strlen would read the entire string for every single codepoint 

That change looks fine.

> b) starting position for second scan adjusted

Unfortunately, this change causes failures in the awk regression
tests.  You need to add mb to nb if you don't scan the entire string.

The following diff passes the awk regression suite.

 - todd

Index: run.c
===================================================================
RCS file: /cvs/src/usr.bin/awk/run.c,v
diff -u -p -u -r1.87 run.c
--- run.c	3 Jun 2024 00:55:05 -0000	1.87
+++ run.c	3 Jun 2024 01:27:25 -0000
@@ -602,20 +602,18 @@ Cell *intest(Node **a, int n)	/* a[0] is
 /* return length 1..4 if yes, 0 if no */
 static int u8_isutf(const char *s)
 {
-	int n, ret;
+	int ret;
 	unsigned char c;
 
 	c = s[0];
-	if (c < 128 || awk_mb_cur_max == 1)
-		return 1; /* what if it's 0? */
-
-	n = strlen(s);
-	if (n >= 2 && ((c>>5) & 0x7) == 0x6 && (s[1] & 0xC0) == 0x80) {
+	if (c < 128 || awk_mb_cur_max == 1) {
+		ret = 1; /* what if it's 0? */
+	} else if (((c>>5) & 0x7) == 0x6 && (s[1] & 0xC0) == 0x80) {
 		ret = 2; /* 110xxxxx 10xxxxxx */
-	} else if (n >= 3 && ((c>>4) & 0xF) == 0xE && (s[1] & 0xC0) == 0x80
+	} else if (((c>>4) & 0xF) == 0xE && (s[1] & 0xC0) == 0x80
 			 && (s[2] & 0xC0) == 0x80) {
 		ret = 3; /* 1110xxxx 10xxxxxx 10xxxxxx */
-	} else if (n >= 4 && ((c>>3) & 0x1F) == 0x1E && (s[1] & 0xC0) == 0x80
+	} else if (((c>>3) & 0x1F) == 0x1E && (s[1] & 0xC0) == 0x80
 			 && (s[2] & 0xC0) == 0x80 && (s[3] & 0xC0) == 0x80) {
 		ret = 4; /* 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx */
 	} else {
@@ -1018,7 +1016,7 @@ Cell *substr(Node **a, int nnn)		/* subs
 	DPRINTF("substr: m=%d, n=%d, s=%s\n", m, n, s);
 	y = gettemp();
 	mb = u8_char2byte(s, m-1); /* byte offset of start char in s */
-	nb = u8_char2byte(s, m-1+n);  /* byte offset of end+1 char in s */
+	nb = mb + u8_char2byte(&s[mb], n);  /* byte offset of end+1 char in s */
 
 	temp = s[nb];	/* with thanks to John Linderman */
 	s[nb] = '\0';