Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GNU awk: odd behavior of length(), match() and field splitting with multi-byte characters #4775

Open
2 of 7 tasks
Ed-Morton opened this issue Jul 6, 2024 · 2 comments
Open
2 of 7 tasks
Labels

Comments

@Ed-Morton
Copy link

Description / Steps to reproduce the issue

gawk in cygwin on Windows is reporting odd lengths and matches of strings when multi-byte characters are involved. Someone on StackOverflow asked about a couple of issues they saw that, so far at least, no-one there can explain and seem to just be bugs. I posted a bug report to the GNU awk bugs mailing list, https://lists.gnu.org/archive/html/bug-gawk/2024-07/msg00000.html, the feedback there is that it's a cygwin or MSYS2 port issue, could you please take a look? I also emailed cygwin at cygwin dot com per the advice from the GNU bug list.

Expected behavior

  1. https://stackoverflow.com/questions/78676444/conditional-replacement-of-arbitrarily-large-strings-that-occur-at-arbitrary-dis#comment138715434_78676444 and https://stackoverflow.com/questions/78676444/conditional-replacement-of-arbitrarily-large-strings-that-occur-at-arbitrary-dis#comment138720207_78676444:

If we output 4 multi-byte characters as 10 bytes using:

$ echo '61F09F948DF09F948E62' | xxd -r -p > file1
$

and run the following gawk command on it we get the output shown:

$ LC_ALL=en_US.utf8 gawk '{print(length($0))}' file1
6
$

i.e. 6 instead of 4. If we run

$ printf 'F0989A9F' | xxd -r -p | LC_ALL=en_US.utf8 awk -F '' '{print NF, length(); for (i=1; i<=NF; i++) print $i}' | cat -A
2 2$
M-pM-^XM-^Z$
M-^_$
$

it shows that what is intended to be single a 4-byte character is being treated as 2 characters, one 3 bytes and the other 1 byte.

  1. https://stackoverflow.com/questions/78690533/why-does-the-match-function-not-work-in-this-particular-situation

If we create some input using:

$ echo '3C6469763E3C6469763E5F3C2F6469763E5F3C68313E6162636465665F3C2F68313E5F3C2F6469763E3C6469763EF09F93853C2F6469763E0A' | xxd -r -p > file2

and then run this on it we get the expected output shown::

$ LC_ALL=en_US.utf8 gawk '{match($0,/^.*_<h1>(.*)_<\/h1>.*$/,a); print a[1]}' file2
abcdef
$

but if we add the IGNORECASE flag we get a blank line output:

$  LC_ALL=en_US.utf8 gawk -vIGNORECASE=1 '{match($0,/^.*_<h1>(.*)_<\/h1>.*$/,a); print a[1]}' file2

$

unless we also remove the end of string delimiter, $, from the end of the regexp:

$ LC_ALL=en_US.utf8 gawk -vIGNORECASE=1 '{match($0,/^.*_<h1>(.*)_<\/h1>.*/,a); print a[1]}' file2
abcdef
$

Actual behavior

See above.

Verification

Windows Version

CYGWIN_NT-10.0-22631

MINGW environments affected

  • MINGW64
  • MINGW32
  • UCRT64
  • CLANG64
  • CLANG32
  • CLANGARM64

Are you willing to submit a PR?

no

@Ed-Morton Ed-Morton added the bug label Jul 6, 2024
@lazka
Copy link
Member

lazka commented Jul 9, 2024

Without having had a closer look: A common issue is that Cygwin has a 16bit wchar_t, and lots of code doesn't expect that.

@lazka lazka transferred this issue from msys2/MINGW-packages Jul 9, 2024
@Ed-Morton
Copy link
Author

Is there any more information I can provide for someone to be able to look into this bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants