Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make printf %c work on Unicode codepoints #236

Merged
merged 3 commits into from
Aug 8, 2024

Conversation

oliverkwebb
Copy link

I noticed goawk couldn't handle UTF-8 strings. While there already seems to be ample discussion about this, one of the UTF-8 features that doesn't seem to incur issues about O(n) speed is printf("%c",[CODEPOINT]) which on the one true awk, Ray Gardner's awk, and gawk, will work with unicode codepoints and utf8 strings:

$ for awk in nawk gawk goawk; do $awk 'BEGIN {printf("%c", "ß")}' | xxd; done
00000000: c39f                                     ..
00000000: c39f                                     ..
00000000: c3                                       .
$ for awk in nawk gawk goawk; do $awk 'BEGIN {printf("%c", 400)}' | xxd; done
00000000: c690                                     ..
00000000: c690                                     ..
00000000: 90                                       .

This is standard behavior in both the one true awk (since they added unicode support, this feature is mentioned in the second edition awk book) and gawk. Similar to "\u[CODEPOINT]"

benhoyt added a commit that referenced this pull request Aug 8, 2024
@benhoyt
Copy link
Owner

benhoyt commented Aug 8, 2024

Hi @oliverkwebb -- thanks for the contribution. I didn't want to do this earlier as the GoAWK functions mostly deal in bytes, but I don't think it hurts to pull in the Unicode behaviour for printf's %c. So I think I'll include this. However, I've put up an alternative at #237 that keeps using the []byte and doesn't require decoding the entire string (if it's longer than 1 rune) to []rune first.

I've also fixed/updated the tests, which were breaking, and added a couple more. If you're okay with #237, let's close this one and merge that. Alternatively you could update this branch with those changes if you'd like your name on it.

@oliverkwebb
Copy link
Author

Commit 6b2baea

Copy link
Owner

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good -- just one mistake, probably from an old commit/branch.

@@ -405,25 +405,26 @@ func (p *interp) sprintf(format string, args []value) (string, error) {
case 's':
v = p.toString(a)
case 'd':
v = int64(a.num())
v = int(a.num())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please revert this change (and the similar uint one on line 412)? They should go back to int64 and uint64. They are probably from an old branch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 403d55d

@benhoyt benhoyt changed the title printf %c works on unicode codepoints (behavior in nawk, gawk, wak, etc...,) Make printf %c work on Unicode codepoints Aug 8, 2024
@oliverkwebb
Copy link
Author

(P.S. I'm not sure why you don't have write access to my PR branch, I have "allow edits by maintainers" checked and if there's another setting for it, I can't find it)

@benhoyt benhoyt merged commit f39db9b into benhoyt:master Aug 8, 2024
11 checks passed
@benhoyt
Copy link
Owner

benhoyt commented Aug 8, 2024

Thanks again for the contribution!

@JohnEarnest
Copy link

In gawk, this kind of unicode handling can be disabled with -b or --characters-as-bytes. Would you be willing to entertain introducing an equivalent flag for goawk?

Rather selfishly, forcing unicode handling breaks some capabilities of lila under goawk.

@oliverkwebb
Copy link
Author

oliverkwebb commented Sep 17, 2024 via email

@JohnEarnest
Copy link

I specifically use %c as a method of printing raw binary data, which is mangled by forcing the interpretation of characters as Unicode instead of bytes. I'm willing to accept that Unicode output is a reasonable default for many applications, but it is very inconvenient to not have a mechanism for opting out of this behavior.

@benhoyt
Copy link
Owner

benhoyt commented Sep 17, 2024

Hmm, yeah, I might have jumped the gun on printf %c. GoAWK is now kind of half Unicode chars (printf %c), half bytes (everything else). This isn't great -- it should be all or nothing. But because the implementation of using chars instead of bytes changes the performance characteristics (see #95), I'm not willing to turn it on by default. So what do you think of this:

Have the Go API and the goawk command default to bytes (revert the new printf %c handling by default). But you can opt into chars handling with goawk -c (including printf %c) or setting a new interp.Config field Chars bool to true if using the API. Kind of the reverse of gawk -b.

If at a later date I manage to get the chars-based handling O(1) -- which is tricky and realistically unlikely to happen -- I'd flip the default of the goawk command (but not the Go API).

Thoughts?

@JohnEarnest
Copy link

I'd be fine with Unicode char output being off by default and opt-in with -c; it would be great for all of my foreseeable uses. This would also keep GoAWK's defaults more compatible with the behavior of mawk.

@oliverkwebb
Copy link
Author

Yeah, I'm fine with byte handling being the default and unicode handling with -c (although -c in gawk means POSIX compatibility)

If at a later date I manage to get the chars-based handling O(1) -- which is tricky and realistically unlikely to happen

I was thinking about how one would get unicode handling with O(1), and the only algorithm I could come up with would make any string representation substantially bigger.

The most obvious solution is a byte index <-> unicode character index converter (which is all unicode support in awk really is) on any anonymous string. Which is integrally O(1) because you don't know what's in the string need to look in the string you're indexing to see whats there and count the UTF8 characters/bytes.

<probably a bad idea>
Although, the contents of the string don't have to be unknown before conversion. If you were to keep a lookup table of "this many unicode characters before this one" (You'd need to update that every time the string is changed, so the other string are now operations O(n) if they weren't already) you would have O(1) time complexity (whilst making the string at twice as large (more likely 4x since you need a int32 lookup table for large strings unless you get REALLY clever and cumbersome with how they're stored)).

This is faster in the same way that a hashset is faster than a array (it's not for small values of n because of overhead).
</probably a bad idea>

That's my immediate thoughts on O(1) unicode handling. There are probably different better algorithms for this, But I don't know.

@benhoyt
Copy link
Owner

benhoyt commented Sep 18, 2024

I've done this in #243 -- take a look if you like. I'll merge it and do a new release in the next couple of days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants