New lexer 2 — Electric Boogaloo #557

ISSOtm · 2020-08-22T21:50:10Z

Fixes #485, for real this time.

What's the point?

It does away with the current hacky lexer architecture, and most importantly makes the behavior of macro args consistent (which does change some behavior... but making this backwards-compatible would be too difficult);
The more ad-hoc architecture means it should be easier to add new tokens, freeing up all the issues locked behind [Help wanted] Turn lexer back into flex definition #485;
Hopefully, a performance boost!
Fixed line counting, in theory. It's now fully done by the lexer, too, not the parser!
Column counting, too! It's however not used for anything but debug info, because it counts characters as they are shifted, so it would report errors at the end of the last token read, which isn't necessarily where the error occurred...
Lexer context switching is now much cleaner;
Plainly deflated code size: fstack.c went from 545 to 400 lines, and lexer.c went from 1057 + 598 to 2048

Note that this lexer does not support "naked" braces, and that's intentional: their behavior was wonky anyways, hardly anyone uses them, and it frees them up for other, more useful behavior;

Since the lexer no longer expands EQUS by copying into a buffer of fixed size, it's no longer to error out by expanding a large EQUS at the beginning of a buffer (macro or REPT)

It has a few known problems.

macOS seems to force newlines at the end of files, so some tests fail to that;
This does not build on MinGW due to it not implementing open_memstream. I only used it as a convenience, so we can remove it if needed. Would take some effort, though, and I don't want to duplicate the printing code, either;
Windows has a different type system that causes some code to be dead on Linux but not Windows;
__FILE__ now leaks memory, but it sees little use, and that code will be changed when Don't terminate RGBDS strings with a NUL #505 is implemented anyways;
mmap might actually not be faster? This needs performance testing. That said, the code path is still needed, because such "file buffers" are used for macros and REPT. I'd expect little to no performance gain on small files, but larger files using macros a lot should see an improvement;
The mmap path never munmaps files if they contain a macro, even if the macro is PURGEd at any point. I also didn't check if nested macros were handled correctly, but I think they are;
This probably won't compile on MSVC because AFAIK it doesn't have mmap, but that should be shimmable around; I'll page @JL2210 to deal with that. (open_memstream probably won't work there either, but it should be done away with for MinGW anyways, as mentioned above);
EQUS expansion tracking is now harder, due to the different expansion tracking implementation; tokens that end precisely at the end of an EQUS are not reported as coming from within the EQUS. It means that recurse EQUS "recurse" does infinitely hang RGBASM.

I'm planning to make a "0.4.2-pre" release after this is merged, since this is rewriting the very core of RGBASM, and even after passing the full test suite I still found a bug. The point of that pre-release would be to allow "unstable" downstream packagers and Windows users relying on binaries, to beta-test the new lexer, and help avoiding a horribly broken 0.4.2 release.

Getting #434 vibes? Gee, I wonder. :^)

include/asm/symbol.h

src/asm/fstack.c

src/asm/lexer.c

src/asm/util.c

src/asm/lexer.c

AntonioND · 2020-09-20T23:59:14Z

So I'm wondering if it makes sense to merge this now as you mentioned here: #569 (comment)

Even if it's not perfect, while it's not merged it will block any improvement in RBGASM, as you have said.

I think that if you're reasonably sure this is working, it could be merged, and then any bugfix can be applied to this code.

Is there any reason why this can't be merged?

JL2210 · 2020-09-21T00:00:40Z

Merge conflicts
EQUS expansion tracking is now harder, due to the different expansion tracking implementation; tokens that end precisely at the end of an EQUS are not reported as coming from within the EQUS. It means that recurse EQUS "recurse" does infinitely hang RGBASM.

AntonioND · 2020-09-21T00:06:18Z

Merge conflicts

Sure, I can see that too.

EQUS expansion tracking is now harder, due to the different expansion tracking implementation; tokens that end precisely at the end of an EQUS are not reported as coming from within the EQUS. It means that recurse EQUS "recurse" does infinitely hang RGBASM.

That's not much worse than the current situation.

ISSOtm · 2020-09-21T16:20:35Z

I fixed the conflicts, actually. I also improved the __FILE__ symbol to not malloc each time its string is asked for; this should be fine, as the comment notes. (The old symbol essentially used a static buffer as well.) There is a leak, technically, since the pointer is never freed, but that should be okay, especially as a lot more memory is leaked that way, anyways.

AntonioND · 2020-09-22T09:05:33Z

You could leave a TODO comment next to the malloc so that nobody forgets about the leak (I don't know if you have already done this, I haven't checked this last update of the PR!).

ISSOtm · 2020-09-23T07:12:55Z

Yes, the comment is there. As for merging the PR, the only downright blocker is open_memstream for Windows. I'll figure something out, then I'll ask for review.

AntonioND · 2020-09-23T14:19:58Z

We could drop Windows support. :P

JL2210 · 2020-09-23T16:16:28Z

I'd just break Windows support until we can remove the need for open_memstream.

AntonioND · 2020-09-23T16:51:40Z

To be fair, it's not a bad plan. It's not like we are going to make a release while Windows is broken. All I'd say is to tag a new version right before merging the PR, release binaries, etc, and then there is time to fix whatever is broken.

ISSOtm · 2020-09-29T02:02:47Z

Updated to use a new file stack info format in object files, which removes the need for open_memstream and fixes #491 as a bonus :)

CI reports a random segfault on Windows and specifically Ubuntu 16.04 with Clang... why?

src/asm/symbol.c

The lexer itself is very much incomplete, but this is intended to be a safe point to revert to should further implementation go south.

Macro arg detection, first emitted tokens, primitive (bad) column counting

Add keywords and identifiers Add comments Add number literals Add strings Add a lot of new tokens Add (and clean up) IF etc. Improve reporting of unexpected chars / garbage bytes Fix bug with and improved error messages when failing to open file Add verbose-level messages about how files are opened Enforce that files finish with a newline Fix chars returned not being cast to unsigned char (may conflict w/ EOF) Return null path when no file is open, rather than crash Unify and improve error printing slightly Known to be missing: macro expansion, REPT blocks, EQUS expansions

And fix line counting with expansion-made newlines. This has the same bug as the old lexer (equs-newline's output does not print the second warning as being part of the expansion). Additionally, we regress equs-recursion, as we are no longer able to catch this specific EQUS recursion. Simply enough, the new expansion begins **after** the old one ends! I have found no way to handle that.

Attempt to grow it to the max size first. Seriously, if this triggers, *how*

MacOS treats them differently, for some reason.

And added a test to check their behavior

There isn't really a better alternative. Making several mappings instead requires too much bookkeeping.

Gets rid of `open_memstream`, enabling Windows compatibility again Also fixes gbdev#491 as a nice bonus!

Removes a false positive from Clang static analysis

Translate it to \\n regardless of the lexer mode

"Initialization, sizeof, and the assignment operator ignore the flexible array member." Oops!

Since the lexer buffer wraps, the refilling gets handled in two steps: First, iff the buffer would wrap, the buffer is refilled until its end. Then, if more characters are requested, that amount is refilled too. An important detail is that `read()` may not return as many characters as requested; for this reason, the first step checks if its `read()` was "full", and skips the second step otherwise. This is also where a bug lied. After a *lot* of trying, I eventually managed to reproduce the bug on an OpenBSD VM, and after adding a couple of `assert`s in `peekInternal`, this is what happened, starting at line 724: 0. `lexerState->nbChars` is 0, `lexerState->index` is 19; 1. We end up with `target` = 42, and `writeIndex` = 19; 2. 42 + 19 is greater than `LEXER_BUF_SIZE` (= 42), so the `if` is entered; 3. Within the first `readChars`, **`read` only returns 16 bytes**, advancing `writeIndex` to 35 and `target` to 26; 4. Within the second `readChars`, a `read(26)` is issued, overflowing the buffer. The bug should be clear now: **the check at line 750 failed to work!** Why? Because `readChars` modifies `writeIndex`. The fix is simply to cache the number of characters expected, and use that.

ISSOtm · 2020-10-04T14:24:09Z

And with that last commit, this PR is finally ready to merge.

JL2210 suggested changes Aug 22, 2020

View reviewed changes

ISSOtm force-pushed the new-lexer-electric-boogaloo branch 3 times, most recently from 7f3b0be to 9b6b2d2 Compare August 31, 2020 13:43

JL2210 reviewed Aug 31, 2020

View reviewed changes

src/asm/lexer.c Show resolved Hide resolved

ISSOtm force-pushed the new-lexer-electric-boogaloo branch from 0189430 to 7aaa1aa Compare September 3, 2020 10:29

ISSOtm mentioned this pull request Sep 10, 2020

Fix __FILE__ when filename contains quotes #566

Merged

JL2210 mentioned this pull request Sep 19, 2020

__FILE__ breaks if filename contains a quote #546

Closed

ISSOtm mentioned this pull request Sep 20, 2020

Beginner friendly issues #569

Closed

ISSOtm force-pushed the new-lexer-electric-boogaloo branch from 7aaa1aa to 660bdf0 Compare September 21, 2020 15:21

ISSOtm force-pushed the new-lexer-electric-boogaloo branch from a9a1be5 to 89538cc Compare September 22, 2020 21:33

ISSOtm force-pushed the new-lexer-electric-boogaloo branch from 89538cc to 30ccd5e Compare September 29, 2020 01:40

Rangi42 reviewed Sep 29, 2020

View reviewed changes

src/asm/symbol.c Show resolved Hide resolved

ISSOtm added 7 commits October 4, 2020 04:37

Implement infrastructure around new lexer

6dc4ce6

The lexer itself is very much incomplete, but this is intended to be a safe point to revert to should further implementation go south.

Implement more functionality

71f8871

Macro arg detection, first emitted tokens, primitive (bad) column counting

Fix PC's name not being passed to parser

e56c6cc

Fix mmap read offset not being initialized

2ec1001

Add EQUS expansion

5ad7a93

ISSOtm added 19 commits October 4, 2020 04:46

Fix INCLUDE ignoring -MG

e33c2ad

Handle comments in line continuations

9e3d7a5

Use common function to discard comments in macro args

ac011fe

Fix C2x use of static_assert

71a0a42

Fix possible capture buffer size overflow

542b5d1

Attempt to grow it to the max size first. Seriously, if this triggers, *how*

Add newlines to all test output

b65ea64

MacOS treats them differently, for some reason.

Fix fixed-point constants not working correctly

c952dd8

And added a test to check their behavior

Move isWhitespace to a place where it makes more sense

dbef51b

Remove unnecessarily nested symbol data union

7381d7b

Harmonize printing distance

b224cab

Fix range-dependent dead code in recursion depth check

96cb5e1

Shim around mmap on Windows

82469ac

Fix possible uninitialized read on Windows

1385235

Move some MSVC-specific defines to platform.h

8e7afb0

Mark not unmapping macro-containing files as okay

930080f

There isn't really a better alternative. Making several mappings instead requires too much bookkeeping.

Implement compact file stacks in object files

5a65188

Gets rid of `open_memstream`, enabling Windows compatibility again Also fixes gbdev#491 as a nice bonus!

Change assertion condition in __FILE__ buf dumping

ee9e45b

Removes a false positive from Clang static analysis

Handle \\r better

423a7c4

Translate it to \\n regardless of the lexer mode

Fix incomplete duplication of REPT nodes

c246942

"Initialization, sizeof, and the assignment operator ignore the flexible array member." Oops!

ISSOtm force-pushed the new-lexer-electric-boogaloo branch from 41b90ad to c246942 Compare October 4, 2020 02:46

ISSOtm marked this pull request as ready for review October 4, 2020 14:24

AntonioND approved these changes Oct 4, 2020

View reviewed changes

ISSOtm merged commit 3036b58 into gbdev:master Oct 4, 2020

This was referenced Oct 5, 2020

@ is output in map files #549

Closed

Macro arguments in the middle of expressions. #63

Closed

This was referenced Jan 11, 2021

Expanding recurse EQUS "recurse" or recurse EQUS "\{recurse\}" hangs rgbasm #696

Closed

Measure RGBDS performance #653

Open

ISSOtm deleted the new-lexer-electric-boogaloo branch March 1, 2021 09:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New lexer 2 — Electric Boogaloo #557

New lexer 2 — Electric Boogaloo #557

ISSOtm commented Aug 22, 2020 •

edited by Rangi42

Loading

AntonioND commented Sep 20, 2020

JL2210 commented Sep 21, 2020

AntonioND commented Sep 21, 2020

ISSOtm commented Sep 21, 2020

AntonioND commented Sep 22, 2020

ISSOtm commented Sep 23, 2020

AntonioND commented Sep 23, 2020

JL2210 commented Sep 23, 2020 via email

AntonioND commented Sep 23, 2020

ISSOtm commented Sep 29, 2020

ISSOtm commented Oct 4, 2020

New lexer 2 — Electric Boogaloo #557

New lexer 2 — Electric Boogaloo #557

Conversation

ISSOtm commented Aug 22, 2020 • edited by Rangi42 Loading

AntonioND commented Sep 20, 2020

JL2210 commented Sep 21, 2020

AntonioND commented Sep 21, 2020

ISSOtm commented Sep 21, 2020

AntonioND commented Sep 22, 2020

ISSOtm commented Sep 23, 2020

AntonioND commented Sep 23, 2020

JL2210 commented Sep 23, 2020 via email

AntonioND commented Sep 23, 2020

ISSOtm commented Sep 29, 2020

ISSOtm commented Oct 4, 2020

ISSOtm commented Aug 22, 2020 •

edited by Rangi42

Loading