-
-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make printf %c work on Unicode codepoints #236
Conversation
Also update the printf %c tests
Hi @oliverkwebb -- thanks for the contribution. I didn't want to do this earlier as the GoAWK functions mostly deal in bytes, but I don't think it hurts to pull in the Unicode behaviour for printf's %c. So I think I'll include this. However, I've put up an alternative at #237 that keeps using the I've also fixed/updated the tests, which were breaking, and added a couple more. If you're okay with #237, let's close this one and merge that. Alternatively you could update this branch with those changes if you'd like your name on it. |
Commit 6b2baea |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good -- just one mistake, probably from an old commit/branch.
interp/functions.go
Outdated
@@ -405,25 +405,26 @@ func (p *interp) sprintf(format string, args []value) (string, error) { | |||
case 's': | |||
v = p.toString(a) | |||
case 'd': | |||
v = int64(a.num()) | |||
v = int(a.num()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please revert this change (and the similar uint
one on line 412)? They should go back to int64
and uint64
. They are probably from an old branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit 403d55d
(P.S. I'm not sure why you don't have write access to my PR branch, I have "allow edits by maintainers" checked and if there's another setting for it, I can't find it) |
Thanks again for the contribution! |
In Rather selfishly, forcing unicode handling breaks some capabilities of lila under goawk. |
On Tue, Sep 17, 2024 at 10:51, John Earnest ***@***.***(mailto:On Tue, Sep 17, 2024 at 10:51, John Earnest <<a href=)> wrote:
In gawk, this kind of unicode handling can be disabled with -b or --characters-as-bytes. Would you be willing to entertain introducing an equivalent flag for goawk?
This patch doesn’t cover general purpose Unicode handling (which from what I understand was intentionally left out due to O(n) time complexity). It doesn’t effect `substr match toupper tolower index split` (the 6 functions in awk where Unicode matters.)
Although, it Might break the ability to print bytes like 0xff, since it assumes it’s a Unicode codepoint. So there is a minor behavior break with this patch (does it outweigh the benefit of being able to print Unicode code points?)
… Rather selfishly, forcing unicode handling breaks some capabilities of [lila](https://github.com/JohnEarnest/Decker/tree/main/tools/awk) under goawk.
|
I specifically use |
Hmm, yeah, I might have jumped the gun on Have the Go API and the If at a later date I manage to get the chars-based handling O(1) -- which is tricky and realistically unlikely to happen -- I'd flip the default of the Thoughts? |
I'd be fine with Unicode char output being off by default and opt-in with |
Yeah, I'm fine with byte handling being the default and unicode handling with
I was thinking about how one would get unicode handling with O(1), and the only algorithm I could come up with would make any string representation substantially bigger. The most obvious solution is a byte index <-> unicode character index converter (which is all unicode support in awk really is) on any anonymous string. Which is integrally O(1) because you don't know what's in the string need to look in the string you're indexing to see whats there and count the UTF8 characters/bytes. <probably a bad idea> This is faster in the same way that a hashset is faster than a array (it's not for small values of n because of overhead). That's my immediate thoughts on O(1) unicode handling. There are probably different better algorithms for this, But I don't know. |
I've done this in #243 -- take a look if you like. I'll merge it and do a new release in the next couple of days. |
I noticed goawk couldn't handle UTF-8 strings. While there already seems to be ample discussion about this, one of the UTF-8 features that doesn't seem to incur issues about O(n) speed is
printf("%c",[CODEPOINT])
which on the one true awk, Ray Gardner's awk, and gawk, will work with unicode codepoints and utf8 strings:This is standard behavior in both the one true awk (since they added unicode support, this feature is mentioned in the second edition awk book) and gawk. Similar to "\u[CODEPOINT]"