GNU awk: odd behavior of length(), match() and field splitting with multi-byte characters #4775

Ed-Morton · 2024-07-06T12:36:01Z

Description / Steps to reproduce the issue

gawk in cygwin on Windows is reporting odd lengths and matches of strings when multi-byte characters are involved. Someone on StackOverflow asked about a couple of issues they saw that, so far at least, no-one there can explain and seem to just be bugs. I posted a bug report to the GNU awk bugs mailing list, https://lists.gnu.org/archive/html/bug-gawk/2024-07/msg00000.html, the feedback there is that it's a cygwin or MSYS2 port issue, could you please take a look? I also emailed cygwin at cygwin dot com per the advice from the GNU bug list.

Expected behavior

https://stackoverflow.com/questions/78676444/conditional-replacement-of-arbitrarily-large-strings-that-occur-at-arbitrary-dis#comment138715434_78676444 and https://stackoverflow.com/questions/78676444/conditional-replacement-of-arbitrarily-large-strings-that-occur-at-arbitrary-dis#comment138720207_78676444:

If we output 4 multi-byte characters as 10 bytes using:

$ echo '61F09F948DF09F948E62' | xxd -r -p > file1
$

and run the following gawk command on it we get the output shown:

$ LC_ALL=en_US.utf8 gawk '{print(length($0))}' file1
6
$

i.e. 6 instead of 4. If we run

$ printf 'F0989A9F' | xxd -r -p | LC_ALL=en_US.utf8 awk -F '' '{print NF, length(); for (i=1; i<=NF; i++) print $i}' | cat -A
2 2$
M-pM-^XM-^Z$
M-^_$
$

it shows that what is intended to be single a 4-byte character is being treated as 2 characters, one 3 bytes and the other 1 byte.

https://stackoverflow.com/questions/78690533/why-does-the-match-function-not-work-in-this-particular-situation

If we create some input using:

$ echo '3C6469763E3C6469763E5F3C2F6469763E5F3C68313E6162636465665F3C2F68313E5F3C2F6469763E3C6469763EF09F93853C2F6469763E0A' | xxd -r -p > file2

and then run this on it we get the expected output shown::

$ LC_ALL=en_US.utf8 gawk '{match($0,/^.*_<h1>(.*)_<\/h1>.*$/,a); print a[1]}' file2
abcdef
$

but if we add the IGNORECASE flag we get a blank line output:

$  LC_ALL=en_US.utf8 gawk -vIGNORECASE=1 '{match($0,/^.*_<h1>(.*)_<\/h1>.*$/,a); print a[1]}' file2

$

unless we also remove the end of string delimiter, $, from the end of the regexp:

$ LC_ALL=en_US.utf8 gawk -vIGNORECASE=1 '{match($0,/^.*_<h1>(.*)_<\/h1>.*/,a); print a[1]}' file2
abcdef
$

Actual behavior

See above.

Verification

I have verified that my MSYS2 is up-to-date before submitting the report (see https://www.msys2.org/docs/updating/)

Windows Version

CYGWIN_NT-10.0-22631

MINGW environments affected

Are you willing to submit a PR?

no

The text was updated successfully, but these errors were encountered:

lazka · 2024-07-09T18:58:14Z

Without having had a closer look: A common issue is that Cygwin has a 16bit wchar_t, and lots of code doesn't expect that.

Ed-Morton · 2024-08-20T10:59:46Z

Is there any more information I can provide for someone to be able to look into this bug?

Ed-Morton added the bug label Jul 6, 2024

lazka transferred this issue from msys2/MINGW-packages Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GNU awk: odd behavior of length(), match() and field splitting with multi-byte characters #4775

GNU awk: odd behavior of length(), match() and field splitting with multi-byte characters #4775

Ed-Morton commented Jul 6, 2024

lazka commented Jul 9, 2024 •

edited

Loading

Ed-Morton commented Aug 20, 2024

GNU awk: odd behavior of length(), match() and field splitting with multi-byte characters #4775

GNU awk: odd behavior of length(), match() and field splitting with multi-byte characters #4775

Comments

Ed-Morton commented Jul 6, 2024

Description / Steps to reproduce the issue

Expected behavior

Actual behavior

Verification

Windows Version

MINGW environments affected

Are you willing to submit a PR?

lazka commented Jul 9, 2024 • edited Loading

Ed-Morton commented Aug 20, 2024

lazka commented Jul 9, 2024 •

edited

Loading