Make printf %c work on Unicode codepoints #236

oliverkwebb · 2024-08-08T01:27:41Z

I noticed goawk couldn't handle UTF-8 strings. While there already seems to be ample discussion about this, one of the UTF-8 features that doesn't seem to incur issues about O(n) speed is printf("%c",[CODEPOINT]) which on the one true awk, Ray Gardner's awk, and gawk, will work with unicode codepoints and utf8 strings:

$ for awk in nawk gawk goawk; do $awk 'BEGIN {printf("%c", "ß")}' | xxd; done
00000000: c39f                                     ..
00000000: c39f                                     ..
00000000: c3                                       .
$ for awk in nawk gawk goawk; do $awk 'BEGIN {printf("%c", 400)}' | xxd; done
00000000: c690                                     ..
00000000: c690                                     ..
00000000: 90                                       .

This is standard behavior in both the one true awk (since they added unicode support, this feature is mentioned in the second edition awk book) and gawk. Similar to "\u[CODEPOINT]"

…tc..)

Also update the printf %c tests

benhoyt · 2024-08-08T09:55:47Z

Hi @oliverkwebb -- thanks for the contribution. I didn't want to do this earlier as the GoAWK functions mostly deal in bytes, but I don't think it hurts to pull in the Unicode behaviour for printf's %c. So I think I'll include this. However, I've put up an alternative at #237 that keeps using the []byte and doesn't require decoding the entire string (if it's longer than 1 rune) to []rune first.

I've also fixed/updated the tests, which were breaking, and added a couple more. If you're okay with #237, let's close this one and merge that. Alternatively you could update this branch with those changes if you'd like your name on it.

oliverkwebb · 2024-08-08T16:34:28Z

Commit 6b2baea

benhoyt

Looks good -- just one mistake, probably from an old commit/branch.

benhoyt · 2024-08-08T20:26:15Z

interp/functions.go

@@ -405,25 +405,26 @@ func (p *interp) sprintf(format string, args []value) (string, error) {
 		case 's':
 			v = p.toString(a)
 		case 'd':
-			v = int64(a.num())
+			v = int(a.num())


Can you please revert this change (and the similar uint one on line 412)? They should go back to int64 and uint64. They are probably from an old branch.

Commit 403d55d

oliverkwebb · 2024-08-08T20:35:38Z

(P.S. I'm not sure why you don't have write access to my PR branch, I have "allow edits by maintainers" checked and if there's another setting for it, I can't find it)

benhoyt · 2024-08-08T20:45:58Z

Thanks again for the contribution!

JohnEarnest · 2024-09-17T15:51:17Z

In gawk, this kind of unicode handling can be disabled with -b or --characters-as-bytes. Would you be willing to entertain introducing an equivalent flag for goawk?

Rather selfishly, forcing unicode handling breaks some capabilities of lila under goawk.

oliverkwebb · 2024-09-17T16:06:25Z

On Tue, Sep 17, 2024 at 10:51, John Earnest ***@***.***(mailto:On Tue, Sep 17, 2024 at 10:51, John Earnest <<a href=)> wrote: In gawk, this kind of unicode handling can be disabled with -b or --characters-as-bytes. Would you be willing to entertain introducing an equivalent flag for goawk?

This patch doesn’t cover general purpose Unicode handling (which from what I understand was intentionally left out due to O(n) time complexity). It doesn’t effect `substr match toupper tolower index split` (the 6 functions in awk where Unicode matters.) Although, it Might break the ability to print bytes like 0xff, since it assumes it’s a Unicode codepoint. So there is a minor behavior break with this patch (does it outweigh the benefit of being able to print Unicode code points?)

…

Rather selfishly, forcing unicode handling breaks some capabilities of [lila](https://github.com/JohnEarnest/Decker/tree/main/tools/awk) under goawk.

JohnEarnest · 2024-09-17T16:19:01Z

I specifically use %c as a method of printing raw binary data, which is mangled by forcing the interpretation of characters as Unicode instead of bytes. I'm willing to accept that Unicode output is a reasonable default for many applications, but it is very inconvenient to not have a mechanism for opting out of this behavior.

benhoyt · 2024-09-17T21:28:25Z

Hmm, yeah, I might have jumped the gun on printf %c. GoAWK is now kind of half Unicode chars (printf %c), half bytes (everything else). This isn't great -- it should be all or nothing. But because the implementation of using chars instead of bytes changes the performance characteristics (see #95), I'm not willing to turn it on by default. So what do you think of this:

Have the Go API and the goawk command default to bytes (revert the new printf %c handling by default). But you can opt into chars handling with goawk -c (including printf %c) or setting a new interp.Config field Chars bool to true if using the API. Kind of the reverse of gawk -b.

If at a later date I manage to get the chars-based handling O(1) -- which is tricky and realistically unlikely to happen -- I'd flip the default of the goawk command (but not the Go API).

Thoughts?

JohnEarnest · 2024-09-17T21:42:47Z

I'd be fine with Unicode char output being off by default and opt-in with -c; it would be great for all of my foreseeable uses. This would also keep GoAWK's defaults more compatible with the behavior of mawk.

oliverkwebb · 2024-09-17T22:08:21Z

Yeah, I'm fine with byte handling being the default and unicode handling with -c (although -c in gawk means POSIX compatibility)

If at a later date I manage to get the chars-based handling O(1) -- which is tricky and realistically unlikely to happen

I was thinking about how one would get unicode handling with O(1), and the only algorithm I could come up with would make any string representation substantially bigger.

The most obvious solution is a byte index <-> unicode character index converter (which is all unicode support in awk really is) on any anonymous string. Which is integrally O(1) because you don't know what's in the string need to look in the string you're indexing to see whats there and count the UTF8 characters/bytes.

<probably a bad idea>
Although, the contents of the string don't have to be unknown before conversion. If you were to keep a lookup table of "this many unicode characters before this one" (You'd need to update that every time the string is changed, so the other string are now operations O(n) if they weren't already) you would have O(1) time complexity (whilst making the string at twice as large (more likely 4x since you need a int32 lookup table for large strings unless you get REALLY clever and cumbersome with how they're stored)).

This is faster in the same way that a hashset is faster than a array (it's not for small values of n because of overhead).
</probably a bad idea>

That's my immediate thoughts on O(1) unicode handling. There are probably different better algorithms for this, But I don't know.

benhoyt · 2024-09-18T11:21:57Z

I've done this in #243 -- take a look if you like. I'll merge it and do a new release in the next couple of days.

printf %c works on unicode codepoints (behavior in nawk, gawk, wak, e…

8d09605

…tc..)

benhoyt added a commit that referenced this pull request Aug 8, 2024

Alternative fix for #236 that avoids decoding entire string

62cc4de

Also update the printf %c tests

benhoyt mentioned this pull request Aug 8, 2024

Alternative fix for #236 that avoids decoding entire string #237

Closed

Add benhoyt's changes

6b2baea

benhoyt requested changes Aug 8, 2024

View reviewed changes

Un-revert changes from older branch

403d55d

benhoyt changed the title ~~printf %c works on unicode codepoints (behavior in nawk, gawk, wak, etc...,)~~ Make printf %c work on Unicode codepoints Aug 8, 2024

benhoyt merged commit f39db9b into benhoyt:master Aug 8, 2024
11 checks passed

BrewTestBot mentioned this pull request Sep 14, 2024

goawk 1.28.0 Homebrew/homebrew-core#189927

Merged

benhoyt mentioned this pull request Sep 18, 2024

Add Unicode chars mode: "goawk -c" or Config.Chars=true #243

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make printf %c work on Unicode codepoints #236

Make printf %c work on Unicode codepoints #236

oliverkwebb commented Aug 8, 2024

benhoyt commented Aug 8, 2024

oliverkwebb commented Aug 8, 2024

benhoyt left a comment

benhoyt Aug 8, 2024

oliverkwebb Aug 8, 2024

oliverkwebb commented Aug 8, 2024

benhoyt commented Aug 8, 2024

JohnEarnest commented Sep 17, 2024

oliverkwebb commented Sep 17, 2024 via email

JohnEarnest commented Sep 17, 2024

benhoyt commented Sep 17, 2024

JohnEarnest commented Sep 17, 2024

oliverkwebb commented Sep 17, 2024

benhoyt commented Sep 18, 2024

Make printf %c work on Unicode codepoints #236

Make printf %c work on Unicode codepoints #236

Conversation

oliverkwebb commented Aug 8, 2024

benhoyt commented Aug 8, 2024

oliverkwebb commented Aug 8, 2024

benhoyt left a comment

Choose a reason for hiding this comment

benhoyt Aug 8, 2024

Choose a reason for hiding this comment

oliverkwebb Aug 8, 2024

Choose a reason for hiding this comment

oliverkwebb commented Aug 8, 2024

benhoyt commented Aug 8, 2024

JohnEarnest commented Sep 17, 2024

oliverkwebb commented Sep 17, 2024 via email

JohnEarnest commented Sep 17, 2024

benhoyt commented Sep 17, 2024

JohnEarnest commented Sep 17, 2024

oliverkwebb commented Sep 17, 2024

benhoyt commented Sep 18, 2024