illegal encoding warning #2485

bob-carpenter · 2018-03-09T20:22:51Z

Summary:

We should alert users when they use non-ASCII characters outside of comments. This can be handled in the preprocessor and flagged as such.

Related issues:

Obviously if we allow UTF-8 encoded Unicode, it's a different set of illegal byte sequences we have to flag (not all byte sequences form legal UTF-8).

Current Version:

v2.17.1

VMatthijs · 2018-12-13T12:53:16Z

When trying to compile

model {
  real £; 
}

stanc3 now says

Syntax error at file "test.stan", line 2, character 6, lexing error:
   -------------------------------------------------
     1:  model {
     2:    real £;
               ^
     3:  }
   -------------------------------------------------

Invalid input "\194"

bob-carpenter · 2018-12-13T13:53:12Z

Is there a way to render it as you did in the program? I I think "\194" is going to be very confusing for our users!

…

On Dec 13, 2018, at 7:53 AM, Matthijs Vákár ***@***.***> wrote: When trying to compile model { real £; } stanc3 now says Syntax error at file "test.stan", line 2, character 6, lexing error: ------------------------------------------------- 1: model { 2: real £; ^ 3: } ------------------------------------------------- Invalid input "\194" — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

VMatthijs · 2018-12-13T15:03:20Z

That's fair.

Is the lazy solution of just having

Syntax error at file "test.stan", line 2, column 6, lexing error:
   -------------------------------------------------
     1:  model {
     2:    real £;
               ^
     3:  }
   -------------------------------------------------

Invalid character found.

OK?

bob-carpenter · 2018-12-13T15:27:54Z

I liked the first version better. It's not clear why it's illegal input with this message. Can we say a "non-ASCII character" or something like that? Or do you get the same error if you say real 1; or real _abc; neither of which is legal in Stan.

…

On Dec 13, 2018, at 10:03 AM, Matthijs Vákár ***@***.***> wrote: That's fair. Is the lazy solution of just having Syntax error at file "test.stan", line 2, column 6, lexing error: ------------------------------------------------- 1: model { 2: real £; ^ 3: } ------------------------------------------------- Invalid input OK? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

VMatthijs · 2018-12-13T15:31:34Z

It's not only non-ASCII that should trigger lexer errors, right? For instance, shouldn't $ also do that? The difficulty is that ocamllex captures the problematic token and returns it to the error message, but ocamllex cannot deal with unicode properly. It might be possible to just capture the original string using the positions specified by ocamllex.

bob-carpenter · 2018-12-13T16:10:38Z

On Dec 13, 2018, at 10:31 AM, Matthijs Vákár ***@***.***> wrote: It's not only non-ASCII that should trigger lexer errors, right? For instance, shouldn't $ also do that?

That's why I asked---I didn't know how the error was being triggered. Are there situations where we know it's an identifier, so we could say something like expected variable name but found illegal identifier '$'

…

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

VMatthijs · 2018-12-13T16:48:18Z

We could for ASCII characters if that is preferable.

For non-ASCII, we could try replacing ocamllex with sedlex, which seems to accept unicode. Then, we could also print the £ in the error. I didn't use sedlex to start with, because I thought our conclusion was that we don't care about unicode, because C++ can't handle it. That's why I stuck with ocamllex, as it's tried and tested technology with lots of examples out there.

I can open an issue for it in stanc3.

For the time being, should I move things back to the original error?

VMatthijs · 2018-12-13T21:52:21Z

I looked into sedlex a bit more and I think I'm tempted to stay with ocamllex for now. The switch would be some work as their syntax and wiring is quite different and I have a preference to stick with the more standard technology until more people are using sedlex or we've spoken to people who have first hand experience with it.

bob-carpenter · 2018-12-13T23:36:35Z

The original version was this:

  -------------------------------------------------
     1:  model {
     2:    real £;
               ^
     3:  }
   -------------------------------------------------

Invalid input "\194"

Why is this printed as '£' in the code and '\194' in the error message? Nobody's going to understand what the \194 means.

bob-carpenter · 2018-12-14T00:04:08Z

That's fine, but I'd still like to know why £ renders in the code but not in the error message. Is that just a side effect of using sedlex? Anyway, dealing with illegal character input isn't a big deal. We could always just check all the ASCII code points are under 128 and throw an error right away if they're not.

…

On Dec 13, 2018, at 4:52 PM, Matthijs Vákár ***@***.***> wrote: I looked into sedlex a bit more and I think I'm tempted to stay with ocamllex for now. The switch would be some work as their syntax and wiring is quite different and I have a preference to stick with the more standard technology until more people are using sedlex or we've spoken to people who have first hand experience with it. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

VMatthijs · 2018-12-14T13:49:08Z

The original version was this:
  -------------------------------------------------
     1:  model {
     2:    real £;
               ^
     3:  }
   -------------------------------------------------

Invalid input "\194"
Why is this printed as '£' in the code and '\194' in the error message? Nobody's going to understand what the \194 means.

Because OCaml can handle unicode. The snippet of code printed above is just extracted from the original file using ocamls IO, so it's fine. The '\194' in the error message is returned by ocamllex which tried to extract it from the original file, but failed as ocamllex isn't set up to work with unicode.

Could you explain once more what solution you'd like to see right now, before we do it properly? (I feel like there are more important issues at the moment. We can also improve some of these messages once the compiler is launched. At the moment, I feel like stanc3 is holding up a lot of other things, so I'd like to get it to a point where it can be swapped out ASAP.)

bob-carpenter · 2018-12-14T17:00:31Z

Thanks for the explanation on the mismatch between OCaml and the lexer on character encodings. When the cycles are available, the best thing to do would be to precheck the input to make sure it doesn't contain non-ASCII code points outside of comments. Another approach would be to process the escapes on the output, but that would be making strong assumptions about when \<digit>+ occurs in error output from the lexer.

…

On Dec 14, 2018, at 8:49 AM, Matthijs Vákár ***@***.***> wrote: The original version was this: ------------------------------------------------- 1: model { 2: real £; ^ 3: } ------------------------------------------------- Invalid input "\194" Why is this printed as '£' in the code and '\194' in the error message? Nobody's going to understand what the \194 means. Because OCaml can handle unicode. The snippet of code printed above is just extracted from the original file using ocamls IO, so it's fine. The '\194' in the error message is returned by ocamllex which tried to extract it from the original file, but failed as ocamllex isn't set up to work with unicode. Could you explain once more what solution you'd like to see right now, before we do it properly. (I feel like there are more important issues at the moment. We can also improve some of these messages once the compiler is launched. At the moment, I feel like stanc3 is holding up a lot of other things, so I'd like to get it to a point where it can be swapped out ASAP.) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

rok-cesnovar · 2020-02-25T13:33:22Z

This is the current error in stanc3:

Syntax error in 'examples/schools/schools.stan', line 4, column 15, lexing error:
   -------------------------------------------------
     2:    int<lower=0> J;             // number of schools
     3:    real y[J];                  // estimated treatment effect (school j)
     4:    real<lower=0> σ[J];         // std err of effect estimate (school j)
                         ^
     5:  }
     6:  parameters {
   -------------------------------------------------

Invalid character found.

Is this fine @bob-carpenter? If not, we can transfer this to stanc3 and continue there.

bob-carpenter · 2020-02-25T15:34:34Z

Yes, I think that's OK. Most users will figure out what that means.

It would be even better to also say that an identifier was expected.

bob-carpenter added feature language code cleanup good first issue labels Mar 9, 2018

bob-carpenter added this to the v3 milestone Mar 9, 2018

bob-carpenter assigned mitzimorris Mar 9, 2018

bob-carpenter mentioned this issue Mar 9, 2018

stanc gives useless error message when model source has non-ASCII characters stan-dev/rstan#501

Closed

alashworth mentioned this issue Mar 12, 2019

illegal encoding warning alashworth/test-issue-import#188

Open

rok-cesnovar closed this as completed May 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

illegal encoding warning #2485

illegal encoding warning #2485

bob-carpenter commented Mar 9, 2018

VMatthijs commented Dec 13, 2018

bob-carpenter commented Dec 13, 2018 via email

VMatthijs commented Dec 13, 2018 •

edited

Loading

bob-carpenter commented Dec 13, 2018 via email

VMatthijs commented Dec 13, 2018 •

edited

Loading

bob-carpenter commented Dec 13, 2018 via email

VMatthijs commented Dec 13, 2018

VMatthijs commented Dec 13, 2018

bob-carpenter commented Dec 13, 2018

bob-carpenter commented Dec 14, 2018 via email

VMatthijs commented Dec 14, 2018 •

edited

Loading

bob-carpenter commented Dec 14, 2018 via email

rok-cesnovar commented Feb 25, 2020

bob-carpenter commented Feb 25, 2020

illegal encoding warning #2485

illegal encoding warning #2485

Comments

bob-carpenter commented Mar 9, 2018

Summary:

Current Version:

VMatthijs commented Dec 13, 2018

bob-carpenter commented Dec 13, 2018 via email

VMatthijs commented Dec 13, 2018 • edited Loading

bob-carpenter commented Dec 13, 2018 via email

VMatthijs commented Dec 13, 2018 • edited Loading

bob-carpenter commented Dec 13, 2018 via email

VMatthijs commented Dec 13, 2018

VMatthijs commented Dec 13, 2018

bob-carpenter commented Dec 13, 2018

bob-carpenter commented Dec 14, 2018 via email

VMatthijs commented Dec 14, 2018 • edited Loading

bob-carpenter commented Dec 14, 2018 via email

rok-cesnovar commented Feb 25, 2020

bob-carpenter commented Feb 25, 2020

VMatthijs commented Dec 13, 2018 •

edited

Loading

VMatthijs commented Dec 13, 2018 •

edited

Loading

VMatthijs commented Dec 14, 2018 •

edited

Loading