Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New lexer 2 — Electric Boogaloo #557

Merged
merged 59 commits into from
Oct 4, 2020

Conversation

ISSOtm
Copy link
Member

@ISSOtm ISSOtm commented Aug 22, 2020

Fixes #485, for real this time.

What's the point?

  • It does away with the current hacky lexer architecture, and most importantly makes the behavior of macro args consistent (which does change some behavior... but making this backwards-compatible would be too difficult);
  • The more ad-hoc architecture means it should be easier to add new tokens, freeing up all the issues locked behind [Help wanted] Turn lexer back into flex definition #485;
  • Hopefully, a performance boost!
  • Fixed line counting, in theory. It's now fully done by the lexer, too, not the parser!
  • Column counting, too! It's however not used for anything but debug info, because it counts characters as they are shifted, so it would report errors at the end of the last token read, which isn't necessarily where the error occurred...
  • Lexer context switching is now much cleaner;
  • Plainly deflated code size: fstack.c went from 545 to 400 lines, and lexer.c went from 1057 + 598 to 2048

Note that this lexer does not support "naked" braces, and that's intentional: their behavior was wonky anyways, hardly anyone uses them, and it frees them up for other, more useful behavior;

  • Since the lexer no longer expands EQUS by copying into a buffer of fixed size, it's no longer to error out by expanding a large EQUS at the beginning of a buffer (macro or REPT)

It has a few known problems.

  • macOS seems to force newlines at the end of files, so some tests fail to that;
  • This does not build on MinGW due to it not implementing open_memstream. I only used it as a convenience, so we can remove it if needed. Would take some effort, though, and I don't want to duplicate the printing code, either;
  • Windows has a different type system that causes some code to be dead on Linux but not Windows;
  • __FILE__ now leaks memory, but it sees little use, and that code will be changed when Don't terminate RGBDS strings with a NUL #505 is implemented anyways;
  • mmap might actually not be faster? This needs performance testing. That said, the code path is still needed, because such "file buffers" are used for macros and REPT. I'd expect little to no performance gain on small files, but larger files using macros a lot should see an improvement;
  • The mmap path never munmaps files if they contain a macro, even if the macro is PURGEd at any point. I also didn't check if nested macros were handled correctly, but I think they are;
  • This probably won't compile on MSVC because AFAIK it doesn't have mmap, but that should be shimmable around; I'll page @JL2210 to deal with that. (open_memstream probably won't work there either, but it should be done away with for MinGW anyways, as mentioned above);
  • EQUS expansion tracking is now harder, due to the different expansion tracking implementation; tokens that end precisely at the end of an EQUS are not reported as coming from within the EQUS. It means that recurse EQUS "recurse" does infinitely hang RGBASM.

I'm planning to make a "0.4.2-pre" release after this is merged, since this is rewriting the very core of RGBASM, and even after passing the full test suite I still found a bug. The point of that pre-release would be to allow "unstable" downstream packagers and Windows users relying on binaries, to beta-test the new lexer, and help avoiding a horribly broken 0.4.2 release.

Getting #434 vibes? Gee, I wonder. :^)

include/asm/symbol.h Show resolved Hide resolved
src/asm/fstack.c Show resolved Hide resolved
src/asm/fstack.c Outdated Show resolved Hide resolved
src/asm/lexer.c Outdated Show resolved Hide resolved
src/asm/lexer.c Outdated Show resolved Hide resolved
src/asm/lexer.c Show resolved Hide resolved
src/asm/lexer.c Show resolved Hide resolved
src/asm/lexer.c Show resolved Hide resolved
src/asm/lexer.c Outdated Show resolved Hide resolved
src/asm/util.c Show resolved Hide resolved
@ISSOtm ISSOtm force-pushed the new-lexer-electric-boogaloo branch 3 times, most recently from 7f3b0be to 9b6b2d2 Compare August 31, 2020 13:43
@AntonioND
Copy link
Member

So I'm wondering if it makes sense to merge this now as you mentioned here: #569 (comment)

Even if it's not perfect, while it's not merged it will block any improvement in RBGASM, as you have said.

I think that if you're reasonably sure this is working, it could be merged, and then any bugfix can be applied to this code.

Is there any reason why this can't be merged?

@JL2210
Copy link
Contributor

JL2210 commented Sep 21, 2020

  1. Merge conflicts
  2. EQUS expansion tracking is now harder, due to the different expansion tracking implementation; tokens that end precisely at the end of an EQUS are not reported as coming from within the EQUS. It means that recurse EQUS "recurse" does infinitely hang RGBASM.

@AntonioND
Copy link
Member

  1. Merge conflicts

Sure, I can see that too.

  1. EQUS expansion tracking is now harder, due to the different expansion tracking implementation; tokens that end precisely at the end of an EQUS are not reported as coming from within the EQUS. It means that recurse EQUS "recurse" does infinitely hang RGBASM.

That's not much worse than the current situation.

@ISSOtm
Copy link
Member Author

ISSOtm commented Sep 21, 2020

I fixed the conflicts, actually. I also improved the __FILE__ symbol to not malloc each time its string is asked for; this should be fine, as the comment notes. (The old symbol essentially used a static buffer as well.) There is a leak, technically, since the pointer is never freed, but that should be okay, especially as a lot more memory is leaked that way, anyways.

@AntonioND
Copy link
Member

You could leave a TODO comment next to the malloc so that nobody forgets about the leak (I don't know if you have already done this, I haven't checked this last update of the PR!).

@ISSOtm
Copy link
Member Author

ISSOtm commented Sep 23, 2020

Yes, the comment is there. As for merging the PR, the only downright blocker is open_memstream for Windows. I'll figure something out, then I'll ask for review.

@AntonioND
Copy link
Member

We could drop Windows support. :P

@JL2210
Copy link
Contributor

JL2210 commented Sep 23, 2020 via email

@AntonioND
Copy link
Member

To be fair, it's not a bad plan. It's not like we are going to make a release while Windows is broken. All I'd say is to tag a new version right before merging the PR, release binaries, etc, and then there is time to fix whatever is broken.

@ISSOtm
Copy link
Member Author

ISSOtm commented Sep 29, 2020

Updated to use a new file stack info format in object files, which removes the need for open_memstream and fixes #491 as a bonus :)

CI reports a random segfault on Windows and specifically Ubuntu 16.04 with Clang... why?

The lexer itself is very much incomplete, but this is intended to be a
safe point to revert to should further implementation go south.
Macro arg detection, first emitted tokens, primitive (bad) column counting
Add keywords and identifiers
Add comments
Add number literals
Add strings
Add a lot of new tokens
Add (and clean up) IF etc.
Improve reporting of unexpected chars / garbage bytes
Fix bug with and improved error messages when failing to open file
Add verbose-level messages about how files are opened
Enforce that files finish with a newline
Fix chars returned not being cast to unsigned char (may conflict w/ EOF)
Return null path when no file is open, rather than crash
Unify and improve error printing slightly

Known to be missing: macro expansion, REPT blocks, EQUS expansions
And fix line counting with expansion-made newlines.
This has the same bug as the old lexer (equs-newline's output does not
print the second warning as being part of the expansion).
Additionally, we regress equs-recursion, as we are no longer able to
catch this specific EQUS recursion. Simply enough, the new expansion
begins **after** the old one ends! I have found no way to handle that.
Attempt to grow it to the max size first.
Seriously, if this triggers, *how*
MacOS treats them differently, for some reason.
And added a test to check their behavior
There isn't really a better alternative.
Making several mappings instead requires too much bookkeeping.
Gets rid of `open_memstream`, enabling Windows compatibility again
Also fixes gbdev#491 as a nice bonus!
Removes a false positive from Clang static analysis
Translate it to \\n regardless of the lexer mode
"Initialization, sizeof, and the assignment operator ignore the flexible array member."
Oops!
@ISSOtm ISSOtm force-pushed the new-lexer-electric-boogaloo branch from 41b90ad to c246942 Compare October 4, 2020 02:46
Since the lexer buffer wraps, the refilling gets handled in two steps:
First, iff the buffer would wrap, the buffer is refilled until its end.
Then, if more characters are requested, that amount is refilled too.

An important detail is that `read()` may not return as many characters as
requested; for this reason, the first step checks if its `read()` was
"full", and skips the second step otherwise.
This is also where a bug lied.

After a *lot* of trying, I eventually managed to reproduce the bug on an
OpenBSD VM, and after adding a couple of `assert`s in `peekInternal`, this
is what happened, starting at line 724:

0. `lexerState->nbChars` is 0, `lexerState->index` is 19;
1. We end up with `target` = 42, and `writeIndex` = 19;
2. 42 + 19 is greater than `LEXER_BUF_SIZE` (= 42), so the `if` is entered;
3. Within the first `readChars`, **`read` only returns 16 bytes**,
   advancing `writeIndex` to 35 and `target` to 26;
4. Within the second `readChars`, a `read(26)` is issued, overflowing the
   buffer.

The bug should be clear now: **the check at line 750 failed to work!** Why?
Because `readChars` modifies `writeIndex`.
The fix is simply to cache the number of characters expected, and use that.
@ISSOtm
Copy link
Member Author

ISSOtm commented Oct 4, 2020

And with that last commit, this PR is finally ready to merge.

"It's been 3,000 years..."

@ISSOtm ISSOtm marked this pull request as ready for review October 4, 2020 14:24
@ISSOtm ISSOtm merged commit 3036b58 into gbdev:master Oct 4, 2020
@ISSOtm ISSOtm deleted the new-lexer-electric-boogaloo branch March 1, 2021 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Help wanted] Turn lexer back into flex definition
5 participants