Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible improvements for tool builders #29

Open
neon12345 opened this issue Feb 9, 2024 · 12 comments
Open

Possible improvements for tool builders #29

neon12345 opened this issue Feb 9, 2024 · 12 comments

Comments

@neon12345
Copy link

I am the author of a tool for converting the antlr parse tree into a denser AST, which can output a standalone JavaScript version to query/transform/print/visualize using CSS selectors. (The end goal is to use machine learning with rule induction for AST transformation/printing.) Demo

Since there is the possibility of changing things with a new version, it would be nice to have a unique antlr state for as many positions in the grammar as possible. This does not currently apply to recursive rules.

@ftomassetti
Copy link
Collaborator

Hi @neon12345 I trying the demo but I get an error. Can you help me figuring out how to use the tool?
I also created several libraries to generate an AST on top of ANTLR's parse tree. One we created for TypeScript is called Tylasu https://github.com/Strumenta/tylasu . Maybe there are some overlaps with you work.

I think getting suggestions for ANTLR 5 is very useful, especially when coming from tool builders like you.

Regarding your question, I am not sure I get what you mean with "unique antlr state for as many positions in the grammar as possible". Could you give me an example?

@neon12345
Copy link
Author

@ftomassetti

Your ISP needs to support IPv6. Other than that, when you press the Run button, the source code on the top left is sent to the server and the returned JavaScript AST is transformed with the script at the bottom into the top right result editor. There is also some help accessible with the bottom tab.

@neon12345
Copy link
Author

@ftomassetti

For recursive rules, the parentState is set for the context. What we need from the context and currently have to patch in, is the state before the precpred.

        _localctx = _tracker.createInstance<PatternContext>(parentContext, parentState);
        pushNewRecursionContext(_localctx, startState, RulePattern);
        setState(1913);

        if (!(precpred(_ctx, 2))) throw FailedPredicateException(this, "precpred(_ctx, 2)");
        setState(1914);
        match(Swift5Parser::AS);

@ericvergnaud ericvergnaud pinned this issue Feb 9, 2024
@ericvergnaud ericvergnaud unpinned this issue Feb 9, 2024
@ftomassetti
Copy link
Collaborator

ftomassetti commented Feb 12, 2024

@neon12345 this is how I see the demo: https://www.loom.com/share/1523249bd9f2458785db29ef2c4ea421
I guess my ISP should support IPv6

@neon12345
Copy link
Author

neon12345 commented Feb 12, 2024

@ftomassetti
Can you test IPv6 compatibility with for example https://ipv6-test.com or https://test-ipv6.com/ to be sure? Also try ping antlr.syncsocial.de

@kaby76
Copy link
Collaborator

kaby76 commented Feb 12, 2024

For recursive rules, the parentState is set for the context. What we need from the context and currently have to patch in, is the state before the precpred.

Can't this be compute using "parentState" and the ATN containing that NFA state number? Given a parse tree, get "parentState" and "see if it is one of your recursive productions". Then, examine the ATN: follow the edges in the ATN back until you find a transition that involves "_p". Presumably, this would be a PrecedencePredicateTransition edge.

@neon12345
Copy link
Author

Can't this be compute using "parentState" and the ATN containing that NFA state number?

The way how we work with antlr is a visitor that gives the information about the position in the grammar with the antlr states. From a parsing perspective, it is best to get the information from the current context without many calculations. Thus from our perspective, it would be more valuable to give another state to the context. But I don't know other use cases and why the state had to be stored this way.

@ftomassetti
Copy link
Collaborator

@ftomassetti Can you test IPv6 compatibility with for example https://ipv6-test.com or https://test-ipv6.com/ to be sure? Also try ping antlr.syncsocial.de

Oh... surprisingly ipv6 does not work for me, so I am afraid I cannot watch the demo

@kaby76
Copy link
Collaborator

kaby76 commented Feb 12, 2024

it would be more valuable to give another state to the context. But I don't know other use cases and why the state had to be stored this way.

Considering the result of a parse is a parse tree, it seems fine to "hang" the "calling state of the ATN to another ATN" in the parent parse tree node. (It really should have been an edge ID because, possibly, one could have multiple edges with the same non-terminal symbol from a state.)

To avoid computing the precedence predicate state in a parse tree traversal, you could pre-compute prior to parse the precedence predicate state from the ATNs, and place them in a O(1) map.

So, for the java grammar, state "1372" would be the "parentState" for an ExpressionContext, but the computed precedence predicate state be "1370".

graphviz (12)

@neon12345
Copy link
Author

@kaby76
Thanks for the suggestions, I don't have a problem getting the values I need and it is fine for me to keep things like they are. I just wonder if the stored information is used as intended or could be improved, considering the state is also not consistently stored in the parse tree.

@neon12345
Copy link
Author

The example works now with ipv4.

With the states available in the antlr parser as proposed, it is easy for us to:

  1. infer antlr independent AST classes with members for the children (combined containers)
  2. infer visit or print methods to walk AST children from their containers in the correct order
  3. create the AST nodes and fill the child containers with an antlr parser

automatically for any language.

Looking at the network traffic (with chrome dev tools) when using the example, one can see the generated AST for the selected language with visit and print methods. (the 3rd blob:null....)

@kaby76
Copy link
Collaborator

kaby76 commented Mar 15, 2025

I'm trying to understand what "antlr state" (original comment, paragraph 2, 1st sentence; and in your latest comment With the states available in the antlr parser...) means. Instead, I'm going to look at your demo and try to describe what it does and draw some conclusions from that.

Looking at the demo, when I click "Run", the upper-right frame outputs:

number of class methods: 1
method name: test2

The demo apparently performs a parse of some Swift code in the upper-left frame, generating a parse tree, then walks over the parse tree by executing some JavaScript code in the lower frame.

var root_funcs = current.queryAll("Class_body Function_declaration");
println('number of class methods: ' + root_funcs.length);
root_funcs.forEach(el =>{
    println('method name: ' + el.query('Function_name').query('Identifier').text());
})

Assuming the grammar for Swift is one from grammars-v4, the engine is probably in Java because all the grammars provided there are all implemented for Java, and only Java. And, since the script is written in JavaScript, you've likely worked out a way to export/import Antlr4 parse trees from a Java server to the script running in the browser. And, you defined an API that contains methods on parse tree nodes for query(string), queryAll(string) and static current. The script finds all function_declaration nodes and prints out the name of the function.

Right now, Antlr generates a class for each parser node type and methods to access the children of the node named after the parser rule used on the right-hand side of a rule. Alternatively, the Antlr runtime provides an XPath interpreter that adds a query language over the parse tree. Though it is not implemented for the JavaScript/TypeScript target, it is with Mike's Antlr-ng/Antlr4ng system. Your API in your demo is equivalent to the XPath API.

Currently, in Antlr, to get the name of the parse tree node, you need to have access to the Parser class because parse tree nodes themselves do not contain the string "class_body". The name of the parse tree node is computed from the "rule index" contained in the parse tree node. In other words, the parse tree is not full fidelity. It's incomplete. Presumably, in the demo, the parse tree that you send to the client contains the parse tree node name.

The Trash Toolkit implements an entirely new parse tree because Antlr parse trees are incomplete. The trees that Trash produces include the name of the parse tree node, as well as inter-token strings embedded in the parse tree as "attributes" so that one can construct XPath expressions over attributes. The alternative is visitors and listeners, which is basically programming in assembly language.

The Antlr XPath API might already handle queries like //class_body//function_declaration, which is a short-hand for the descendant-or-self axis, which is equivalent to the query string Class_body Function_declaration in the demo. But, the XPath query language that Antlr implements is not standard. The standard predicate test is to use [...] (see Section "Predicates" in https://www.w3schools.com/xml/xpath_syntax.asp). But, I do find that complex queries involve a sequence of nodes rather a pain in XPath.

One reason why I wrote Trash is because the Antlr XPath in the runtime does not implement a realistic XPath engine. It's not even version 1.0 compliant. Trash uses a real XPath version 2 engine, ported from Xalan, but I've been trying to move to an XPath 3 engine with XQuery using Saxon. Based on this experience, I would recommend that the sub-standard XPath engine be replaced with a variety of high-quality query language engines. Ideally, Antlr5 would output a full-fidelity parse tree serialization. Antlr should not be implementing graph query languages. These are very complicated.

As I mentioned elsewhere, I do look over projects from time to time in GitHub. I have noticed someone trying to write an XQuery engine over Antlr parse trees. https://github.com/AleksanderKruk/antlr-xquery It's the right idea, but people really should not be implementing graph query languages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants