Mailing List Archive

On Sun, 02 Jun 2024 10:06:11 +0200, Jonas Bechtel wrote: > a) strlen is not needed because the check for > 0b110xxxxx/0b1110xxxx/0b11110xxx and 0b10xxxxxx ensure that no 0 is > read. strlen would read the entire string for every single codepoint That change looks fine. > b) starting position for second scan adjusted Unfortunately, this change causes failures in the awk regression tests. You need to add mb to nb if you don't scan the entire string. The following diff passes the awk regression suite. - todd Index: run.c =================================================================== RCS file: /cvs/src/usr.bin/awk/run.c,v diff -u -p -u -r1.87 run.c --- run.c 3 Jun 2024 00:55:05 -0000 1.87 +++ run.c 3 Jun 2024 01:27:25 -0000 @@ -602,20 +602,18 @@ Cell *intest(Node **a, int n) /* a[0] is /* return length 1..4 if yes, 0 if no */ static int u8_isutf(const char *s) { - int n, ret; + int ret; unsigned char c; c = s[0]; - if (c < 128 || awk_mb_cur_max == 1) - return 1; /* what if it's 0? */ - - n = strlen(s); - if (n >= 2 && ((c>>5) & 0x7) == 0x6 && (s[1] & 0xC0) == 0x80) { + if (c < 128 || awk_mb_cur_max == 1) { + ret = 1; /* what if it's 0? */ + } else if (((c>>5) & 0x7) == 0x6 && (s[1] & 0xC0) == 0x80) { ret = 2; /* 110xxxxx 10xxxxxx */ - } else if (n >= 3 && ((c>>4) & 0xF) == 0xE && (s[1] & 0xC0) == 0x80 + } else if (((c>>4) & 0xF) == 0xE && (s[1] & 0xC0) == 0x80 && (s[2] & 0xC0) == 0x80) { ret = 3; /* 1110xxxx 10xxxxxx 10xxxxxx */ - } else if (n >= 4 && ((c>>3) & 0x1F) == 0x1E && (s[1] & 0xC0) == 0x80 + } else if (((c>>3) & 0x1F) == 0x1E && (s[1] & 0xC0) == 0x80 && (s[2] & 0xC0) == 0x80 && (s[3] & 0xC0) == 0x80) { ret = 4; /* 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx */ } else { @@ -1018,7 +1016,7 @@ Cell *substr(Node **a, int nnn) /* subs DPRINTF("substr: m=%d, n=%d, s=%s\n", m, n, s); y = gettemp(); mb = u8_char2byte(s, m-1); /* byte offset of start char in s */ - nb = u8_char2byte(s, m-1+n); /* byte offset of end+1 char in s */ + nb = mb + u8_char2byte(&s[mb], n); /* byte offset of end+1 char in s */ temp = s[nb]; /* with thanks to John Linderman */ s[nb] = '\0';

2024-06-02 08:06 Jonas Bechtel:
[PATCH] awk: proposal: reduce utf8 scanning
- 2024-06-03 01:31 Todd C. Miller:
  [PATCH] awk: proposal: reduce utf8 scanning
- - 2024-06-04 06:19 Jonas Bechtel:
    [PATCH] awk: proposal: reduce utf8 scanning
  - - 2024-06-04 14:37 Todd C. Miller:
      [PATCH] awk: proposal: reduce utf8 scanning