Skip to content

nmichael44/logo_lsp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Explanation of architecture and project layout

I wrote a logo Lexer/Parser by hand. The Lexer is fairly straight forward. It lexes the character stream into a stream of

   Tokens(lexeme: Int, start: Int, end: Int)

where lexeme tells us the kind of Token we are looking at. It's a fairly low level representation that tries to minimize the memory allocation overhead for the tokens. The list of possible lexeme are given in file LogoKeywords. If a complex token is encountered, the parser must call additional functions to recover the rest of the payload (this way we keep tokens small and uniform). The code is in file: logo_parser\LogoLexer.scala.

The parser is written again by hand, in recursive decent style. I chose to do this way because I wanted to remember all the details of such implementations, and this is why I avoided using a tool like antlr or javacc. I assume the parsers for the various languages Intellij supports are also written in this style.

It wraps the Lexer in its own little lexer so that it can provide peekToken kind of functionality. This proxy class also came in handy because at some point I realized that I needed to keep the stream of all tokens, and this was a handy place to collect them.

Logo is not the easiest language to parse because they play it fast and lose both when it comes to syntax and when it comes to semantics. Correctly building the syntax tree requires non-local knowledge. For example, parsing the expressions given to a function is not syntactically delineated and can only be done properly if you know the arity of the function. If the function is user-defined, you will, of course, not know its arity. In the parser we maintain state as we are parsing along, to help us with user-defined function arities (see below for more discussion on this). The code of the parser is in file: logo_parser\LogoParser.scala.

The output of the parser is:

  Either[CompilationError, (LogoAst, Vector<Token>)]

If there are errors in your file, you will get a CompilationError. This will info you if your error was a Lexer or a Parser error, where the error occurred, and what the token was at the error location. The language server uses this object to construct and publish diagnostics on the didOpen event.

If everything goes well and your logo program is syntactically correct, you will get the pair:

  (LogoAst, Vector<Token>)

where LogAst is the top node of the AST tree (you can check the full AST in file: logo_parser.AST). It's a fairly straightforward AST for a statement/expression language. The only difference is that it contains a springling of tokens all over since our goal is not compilation but IDE assistance. These tokens will help us later to construct the datastructures we need for def/use navigation and for semantic tokens generation. The second argument, the Vector of Tokens, we need so that we don't try to fetch definitions for keywords of the language. We use them to construct a data-structure to help us quickly identify what it is that the user is pointing to with his mouse.

The language server is implemented in file: lsp_server\LogoLanguageServer.scala. I used the library org.eclipse.lsp4j which simplifies the construction of language servers simpler by doing the grpc stuff behind the scenes. What I had to was to implement the methods in interfaces LanguageServer, TextDocumentService, WorkspaceService and LanguageClientAware

When a didOpen call comes in, the Language server will extract the file and initiate a compilation and if successful, will proceed to construct the State it keeps for that file. The server state is just a java ConcurrentHashMap from URI to the DocumentState (the state of we keep for each file). Using a ConcurrentHashMap here is necessary because language servers are supposed to be thread safe with multiple requests happening concurrently. The DocumentState is made up of persistent (immutable) objects, and so once a call retrieves it, it can traverse the DocumentState with no fear of it changing under it. If the document changes, another didOpen call will parse the file, create a new DocumentState and install it. This will not affect at all any of the existing calls into the language server. They may be working with old data, but this is at least safe. If desired, we can add checks at the end of calls to figure out of recomputation is necessary(by checking to see if the state for the document has changed). For this version of the language server, I did not do this.

The state we maintain per document is the following:

class DocumentState(
    val fileContent: String,
    val ast: AST.LogoAst,
    val defsMap: DefsMap,
    val defUseMap: UseDefMap,
    val semanticTokens: Vector[(Token, SemTokenType)],
    val semanticTokensArrayRepresentation: ju.List[Integer],
    val allTokensMap: TreeMap[StartLocation, Token],

The fileContent, ast, defsMap, semanticTokens are not really used. I kept them there to aid debugging. To implement the use to def functionality, we construct the defUseMap which is a:

TreeMap[StartLocation, DefLocationWithUseEndCharOffset]

This Map contains a mapping from every token on which we are allowed to ask for its definition, to the actual definition. When the definition() function is called, we are given a Position(line, startChar) and we have to retrieve the definition. What do is simply index into this Map looking for the greatest-lower-bound bound, i.e., the position nearest to us that is still less than or equal to us. This is because the user may not click at the starting position of the symbol whose definition he is looking for. An additional constraint that must be satisfied is the endChar of the use token must occur after the position we are on at the moment (we maintain this information in the value part of the Map).

So finding a definition is O(log(n)) where n is the number of tokens that can have a definition in the user program. As I have discovered, definition is called on pretty much everything (semantic tokens are not taken into account). Thus, we need a way to quickly identify tokens which have not definition (like keywords for example). To do this we construct allTokensMap: TreeMap[StartLocation, Token] which given a location (i.e., (Line, StartChar)), can tell you the kind of token that was there. This way we can respond immediately (again in O(log(n)) time with n the number of tokens) for tokens that have no definitions.

To respond to the semantics-tokens api, we simply return semanticTokensArrayRepresentation which is also computed everytime we rebuild DocumentState. The representation is constructed from semanticTokens which is build at the same time.

Building the DocumentState is most of the work. This is done in the Scala singleton for DocumentState, by recursively traversing the AST (found in logo_parser\AST), and analyzing each statement or expression with the double aim or extracting the semantic tokens, and identifying all identifiers that have a meaningful definition, and constructing the above described Maps. The code (though long) is fairly straightforward.

As for the extra assignment requirement, I implemented publication of errors (diagnostics). Those error can come from either from Lexer or Parser and that is reported in the diagnostic. The diagnostic contains both both the offending token and the exact position in the file where it occurred. Intellij correctly uses the diagnostic and highlights problematic position.

Tests

I have also commited a bunch of logo programs that I used to test the parser and the language server. Those are in src\main\test\resources. You can run the program both as a language server and as I parser for logo, by modifying the flag asServer in file main.scala.

How to connect it to an LSP client

I just installed the plugin you suggested and added a new language server point to the jar file of my lsp server. I created an associate of '*.logo' files for the new language-server, and Intellij was able to start the server everytime I would open a .logo file.

How to build and run the lsp server:

The server is written in Scala, so you will need to use sbt to build it. Thankfully, sbt is fully integrated in Intellij, so it's really easy. Run sbt (icon on the left bottom of Intellij) and type

  assembly

This will build the project and create a .jar file ready to go. It will also report where the .jar file is, so for testing you can point at it there. I used the plugin you suggested to test, so maybe that's the easiest way for you to try it out as well. My execution command was:

C:\Users\nmich\.jdks\openjdk-23.0.1\bin\java.exe -jar "C:\Users\nmich\work\logo_lsp\target\scala-3.3.6\logo_lsp.jar"

but any recent Java version should do as well.

Parser Limitations:

  1. When calling functions, logo does not have a syntactic mechanism to delineate the arguments to the function. This is a challenge for any parser of logo because, to properly parse the language, you need to know how many arguments each function or command is expecting (its arity). So what we do is remember the functions arity when we first encounter it some global parser state, and then use that info when we find calls to that function. At the same time, logo allows the declaration of a function to happen after use. So it can happen that you have to parse a call to a function you know nothing about as in the valid program below.

    to sayHi print "Hi someone end
    to someone print "dude end
    sayHi
    

    The function "someone" here is defined after its use and this is legal Logo. The Logo parser needs to know the arity of a user-defined command to correctly parse its arguments and build the Abstract Syntax Tree (AST), but that information isn't available when the call is first encountered. The most common and generally robust solution to this is to use a two-pass parsing approach, but that would be too much work for this assignment.

  2. Consider the parsing of the following command that takes two expressions as arguments:

    setxy 5 -8
    

    The parser will see three tokens: "5", "-", and "8". The usual expression parsing algorithm will then parse this as one argument (5-8) (clearly not what the user intended), and no second expression will be found. This is again a problem caused by the fact that logo couldn't be bothered to properly delineate the arguments of functions. My workaround is to start the parsing of such expressions (consecutive) expressions, not at the top of the expression tree but at parseMultTerm. This way the most common cases (with negative constants like the one above) are correctly handled and something like setxy 1+2 8 will generate an error that the user would have to fix by employing advanced tools (i.e., parenthesis as in setxy (1+2) 8). To get the exact semantics of turtleAcademy compiler, one would need to make the parser sensitive to whitespace, which is a bridge too far for this assignment. Unfortunately, both patterns appear in some of the example programs of the page. The fix is to modify them to use parenthesis when the problem appears.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages