Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement char extension in Rust tools #254

Merged
merged 8 commits into from
Jul 5, 2023
Merged

Conversation

Pat-Lafon
Copy link
Contributor

I've gone and implemented the char extension #253 in bril-rs and it's corresponding Rust tools: bril2txt, bril2json, and brilirs.

The main implementation difference is that I've used a u16 as the internal representation of a char which I believe is an equivalent representation. It is possible that implementation differences could lead to slightly different behavior though the rust tools currently pass all char tests.

I have a couple of notes for what they are worth:

  • With Control characters, the following doesn't "round trip" bril2json < test/interp/char/control_char.bril | bril2txt in the sense that the outputted text file doesn't have the control characters escaped. In this example, the resulting file isn't valid. Similarly, bril2json < test/interp/char/control_char.bril will output some of the control values as \u000b which doesn't seem to cause an issue but might be surprising given that the larger set of escape characters is unsupported(maybe this is fine, I'm not sure).
  • bril2json < ../test/interp/char/char_args.bril | brili '\n' e y errors out and probably shouldn't?
  • Javascript strings allow for "lone surrogates" which is one half of a surrogate pair that is an invalid character on it's own. Maybe brili's check for the range of int2char should be more strict to disallow the following program? `
@main() {
  i1: int = const 56193;

  c2: char = int2char i1;

  print c2;
}

@tmaradc do you have any thoughts on this?

@Pat-Lafon Pat-Lafon force-pushed the char branch 2 times, most recently from fa015ba to d404fda Compare May 30, 2023 19:30
@sampsyo
Copy link
Owner

sampsyo commented May 30, 2023

Awesome; thanks for getting this going! And for catching those problems in the text formatter; it does make sense to escape characters on output IMO (so we end up with a valid source file).

Before we merge this with a u16 as the internal type on the Rust side, what do you think about changing the specification to instead have each char represent a single Unicode scalar codepoint instead of a UTF-16 value? This is subtly different, but different in important ways that I would like because I think UTF-16 is kind of bad. The upshot is that the Bril char type would be an exact match for the Rust char type. We would also not worry about surrogates (or lone surrogates) because those are a UTF-16 artifact and not Unicode scalars.

I could be wrong, but I think the existing JS implementation would still be correct-ish; it just would not support Unicode scalars beyond the BMP. We could create a bug for supporting beyond-BMP characters in the JavaScript stuff and consider this a bug (but not too worrisome of a bug).

@tmaradc
Copy link
Contributor

tmaradc commented May 31, 2023

Nice! I'm really glad that the char extension will have support in other tools.
Thank you for pointing out these notes!

  • With Control characters, the following doesn't "round trip" bril2json < test/interp/char/control_char.bril | bril2txt in the sense that the outputted text file doesn't have the control characters escaped. In this example, the resulting file isn't valid. Similarly, bril2json < test/interp/char/control_char.bril will output some of the control values as \u000b which doesn't seem to cause an issue but might be surprising given that the larger set of escape characters is unsupported(maybe this is fine, I'm not sure).

  • bril2json < ../test/interp/char/char_args.bril | brili '\n' e y errors out and probably shouldn't?

The idea was to use the control characters escaped only in the text format since this is the human-friendly version, so I left it out from brili. But I can see that it would be better to treat them in brili as the bril2txt does.

  • Javascript strings allow for "lone surrogates" which is one half of a surrogate pair that is an invalid character on it's own. Maybe brili's check for the range of int2char should be more strict to disallow the following program? `

Totally agree!

I'm willing to make these changes!

@Pat-Lafon
Copy link
Contributor Author

Before we merge this with a u16 as the internal type on the Rust side, what do you think about changing the specification to instead have each char represent a single Unicode scalar codepoint instead of a UTF-16 value? This is subtly different, but different in important ways that I would like because I think UTF-16 is kind of bad. The upshot is that the Bril char type would be an exact match for the Rust char type. We would also not worry about surrogates (or lone surrogates) because those are a UTF-16 artifact and not Unicode scalars.

It's certainly easier to implement on my end so that would be fine with me. I've added a commit with this change.

I could be wrong, but I think the existing JS implementation would still be correct-ish; it just would not support Unicode scalars beyond the BMP. We could create a bug for supporting beyond-BMP characters in the JavaScript stuff and consider this a bug (but not too worrisome of a bug).

Would need to disallow lone "Surrogate code points" and allow larger "Unicode scalar values" as with the failing char-error/badconversion.bril which I think would be valid.

@Pat-Lafon
Copy link
Contributor Author

The idea was to use the control characters escaped only in the text format since this is the human-friendly version, so I left it out from brili. But I can see that it would be better to treat them in brili as the bril2txt does.

This does make sense though I would side on being consistent with the handling of control characters if possible.

  • Javascript strings allow for "lone surrogates" which is one half of a surrogate pair that is an invalid character on it's own. Maybe brili's check for the range of int2char should be more strict to disallow the following program? `

Totally agree!

Maybe representing brili's char as a u32 would be easiest given Javascript's weirdness with unicode? Not totally sure.

@sampsyo
Copy link
Owner

sampsyo commented Jun 1, 2023

bril2json < ../test/interp/char/char_args.bril | brili '\n' e y errors out and probably shouldn't?

Here is something I don't quite understand yet, about shells in general: for that command line, does the brili executable see the command line argument \n (two bytes in UTF-8, a \ and then an n) or does it see a single newline character as its first argument? If it's the former, I think the program should crash because the argument is not a single character. If it's the latter, sounds good to me and the program should succeed.

The reason is that, IMO, the backslash-escapes should only be the domain of the text format, which brili is not supposed to know about. It should work on raw characters. To the extent that the shell turns \n into newlines, that's the shell's domain.

@tmaradc tmaradc mentioned this pull request Jun 7, 2023
@Pat-Lafon
Copy link
Contributor Author

Here is something I don't quite understand yet, about shells in general: for that command line, does the brili executable see the command line argument \n (two bytes in UTF-8, a \ and then an n) or does it see a single newline character as its first argument? If it's the former, I think the program should crash because the argument is not a single character. If it's the latter, sounds good to me and the program should succeed.

I've found out you can do something like $'\n' to pass the actual newline character(atleast in bash), your above comment is seen as two separate characters which is what happens with the text representation.

The reason is that, IMO, the backslash-escapes should only be the domain of the text format, which brili is not supposed to know about. It should work on raw characters. To the extent that the shell turns \n into newlines, that's the shell's domain.

I don't necessarily agree since the text format and command line arguments are generally both forms of user-provided input and I think they should have similar semantics.

However it's kind of a niche case so an error is probably fine. Usually that makes this a "bad program" and then I would fall back to implementations (such as efficient compilers) are free to do whatever they want to bad programs.

@sampsyo
Copy link
Owner

sampsyo commented Jun 9, 2023

Ah, sure, I dig that notion. It is pretty cool that you can do $'\n' on the shell though… if it's that convenient to pass stuff in like this, maybe that obviates the need for friendly escape parsing in the interpreter?

@Pat-Lafon Pat-Lafon force-pushed the char branch 2 times, most recently from 6137ccc to a2889d2 Compare June 29, 2023 22:27
@Pat-Lafon Pat-Lafon force-pushed the char branch 2 times, most recently from 6928f0e to 8e80c7f Compare June 30, 2023 18:21
Copy link
Owner

@sampsyo sampsyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good so far! Is the reason store-char.bril is disabled because the new type doesn't interact with the memory extension yet?

bril-rs/brillvm/Makefile Outdated Show resolved Hide resolved
@Pat-Lafon
Copy link
Contributor Author

Looks pretty good so far! Is the reason store-char.bril is disabled because the new type doesn't interact with the memory extension yet?

Specifically, the brillvm and brilift tools do not support the char extension atm(so they use that hack to ignore store-char.bril in the mixed test suite).

This extension should be fully supported and compose well with the memory extension in the (rust)bril2json, bril2txt, and brilirs tools(this pr).

@sampsyo
Copy link
Owner

sampsyo commented Jul 5, 2023

Ah, right; that explains it! Many thanks as ever!!!

@sampsyo sampsyo merged commit 1095460 into sampsyo:main Jul 5, 2023
@Pat-Lafon Pat-Lafon deleted the char branch July 5, 2023 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants