Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

illegal encoding warning #2485

Closed
bob-carpenter opened this issue Mar 9, 2018 · 14 comments
Closed

illegal encoding warning #2485

bob-carpenter opened this issue Mar 9, 2018 · 14 comments

Comments

@bob-carpenter
Copy link
Contributor

Summary:

We should alert users when they use non-ASCII characters outside of comments. This can be handled in the preprocessor and flagged as such.

Related issues:

Obviously if we allow UTF-8 encoded Unicode, it's a different set of illegal byte sequences we have to flag (not all byte sequences form legal UTF-8).

Current Version:

v2.17.1

@VMatthijs
Copy link
Member

When trying to compile

model {
  real £; 
}

stanc3 now says

Syntax error at file "test.stan", line 2, character 6, lexing error:
   -------------------------------------------------
     1:  model {
     2:    real £;
               ^
     3:  }
   -------------------------------------------------

Invalid input "\194"

@bob-carpenter
Copy link
Contributor Author

bob-carpenter commented Dec 13, 2018 via email

@VMatthijs
Copy link
Member

VMatthijs commented Dec 13, 2018

That's fair.

Is the lazy solution of just having

Syntax error at file "test.stan", line 2, column 6, lexing error:
   -------------------------------------------------
     1:  model {
     2:    real £;
               ^
     3:  }
   -------------------------------------------------

Invalid character found.

OK?

@bob-carpenter
Copy link
Contributor Author

bob-carpenter commented Dec 13, 2018 via email

@VMatthijs
Copy link
Member

VMatthijs commented Dec 13, 2018

It's not only non-ASCII that should trigger lexer errors, right? For instance, shouldn't $ also do that? The difficulty is that ocamllex captures the problematic token and returns it to the error message, but ocamllex cannot deal with unicode properly. It might be possible to just capture the original string using the positions specified by ocamllex.

@bob-carpenter
Copy link
Contributor Author

bob-carpenter commented Dec 13, 2018 via email

@VMatthijs
Copy link
Member

We could for ASCII characters if that is preferable.

For non-ASCII, we could try replacing ocamllex with sedlex, which seems to accept unicode. Then, we could also print the £ in the error. I didn't use sedlex to start with, because I thought our conclusion was that we don't care about unicode, because C++ can't handle it. That's why I stuck with ocamllex, as it's tried and tested technology with lots of examples out there.

I can open an issue for it in stanc3.

For the time being, should I move things back to the original error?

@VMatthijs
Copy link
Member

I looked into sedlex a bit more and I think I'm tempted to stay with ocamllex for now. The switch would be some work as their syntax and wiring is quite different and I have a preference to stick with the more standard technology until more people are using sedlex or we've spoken to people who have first hand experience with it.

@bob-carpenter
Copy link
Contributor Author

The original version was this:

  -------------------------------------------------
     1:  model {
     2:    real £;
               ^
     3:  }
   -------------------------------------------------

Invalid input "\194"

Why is this printed as '£' in the code and '\194' in the error message? Nobody's going to understand what the \194 means.

@bob-carpenter
Copy link
Contributor Author

bob-carpenter commented Dec 14, 2018 via email

@VMatthijs
Copy link
Member

VMatthijs commented Dec 14, 2018

The original version was this:

  -------------------------------------------------
     1:  model {
     2:    real £;
               ^
     3:  }
   -------------------------------------------------

Invalid input "\194"

Why is this printed as '£' in the code and '\194' in the error message? Nobody's going to understand what the \194 means.

Because OCaml can handle unicode. The snippet of code printed above is just extracted from the original file using ocamls IO, so it's fine. The '\194' in the error message is returned by ocamllex which tried to extract it from the original file, but failed as ocamllex isn't set up to work with unicode.

Could you explain once more what solution you'd like to see right now, before we do it properly? (I feel like there are more important issues at the moment. We can also improve some of these messages once the compiler is launched. At the moment, I feel like stanc3 is holding up a lot of other things, so I'd like to get it to a point where it can be swapped out ASAP.)

@bob-carpenter
Copy link
Contributor Author

bob-carpenter commented Dec 14, 2018 via email

@rok-cesnovar
Copy link
Member

This is the current error in stanc3:

Syntax error in 'examples/schools/schools.stan', line 4, column 15, lexing error:
   -------------------------------------------------
     2:    int<lower=0> J;             // number of schools
     3:    real y[J];                  // estimated treatment effect (school j)
     4:    real<lower=0> σ[J];         // std err of effect estimate (school j)
                         ^
     5:  }
     6:  parameters {
   -------------------------------------------------

Invalid character found.

Is this fine @bob-carpenter? If not, we can transfer this to stanc3 and continue there.

@bob-carpenter
Copy link
Contributor Author

Yes, I think that's OK. Most users will figure out what that means.

It would be even better to also say that an identifier was expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants