Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Use JuliaSyntax to count things out of parsed source code #79

Draft
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

ericphanson
Copy link
Member

xref #63

so far:

julia> analyze_syntax(PackageAnalyzer)
Dict{String, Int64} with 6 entries:
  "struct"   => 6
  "usings"   => 16
  "call"     => 593
  "function" => 53
  "imports"  => 1
  "exports"  => 7

src/count_loc.jl Outdated
# TODO:
# Handle `@doc` calls?
# What about inline comments #= comment =#?
# Can a docstring not start at the beginning of a line?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

julia> JuliaSyntax.parseall(JuliaSyntax.SyntaxNode, "x; \"some docstring\"\nf")
line:col│ tree                                   │ file_name
   1:1  │[toplevel]
   1:1  │  [toplevel]
   1:1  │    x
   1:3  │    [macrocall]
   1:3  │      core_@doc
   1:4  │      [string]
   1:5  │        "some docstring"
   2:1  │      f

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, that makes sense, thanks for the example. For the purpose "counting lines of code", I think I need to make a decision here if this line counts as a "docstring" line or a "code" line. My inclination is to just treat it as a code line, since that seems easier implementation-wise and I think it is also fairly explainable (it counts as the first thing that happens on that line).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a pretty unusual case to be honest, so I don't actually think it matters for the purposes of package stats.

But you could count it as both a code line and a docstring line if you wanted? It is, after all, literally both.

Depends if you think it's important to have a 1:1 mapping between lines and classification

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is important, yeah. I think it “feels” buggy otherwise, if you try it on a file and the counts don’t add up to the line count of the file you see in your editor. And I think it adds extra conceptual complexity.

src/count_loc.jl Outdated
# Handle `@doc` calls?
# What about inline comments #= comment =#?
# Can a docstring not start at the beginning of a line?
# Can there be multiple string nodes on the same line as a docstring?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure exactly what this means. But probably. consider this:

julia> JuliaSyntax.parseall(JuliaSyntax.SyntaxNode, "\"doc1\" x; \"doc2 \$an_interpolation\" y")
line:col│ tree                                   │ file_name
   1:1  │[toplevel]
   1:1  │  [toplevel]
   1:1  │    [macrocall]
   1:1  │      core_@doc
   1:1  │      [string]
   1:2  │        "doc1"
   1:8  │      x
   1:10 │    [macrocall]
   1:10 │      core_@doc
   1:11 │      [string]
   1:12 │        "doc2 "
   1:18 │        an_interpolation
   1:36 │      y

Copy link
Member Author

@ericphanson ericphanson Mar 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good point. Not sure how to handle this. What I do to count docstrings is I find a docstring call, take the first string node after that, and then look where that node ends (which line). Then I count all those lines as docstring lines. This handles things like

"""
abc
def
"""
f

which should count as 4 lines I think (or maybe 2?).

If someone did your example, I think that's fine to count as 1 docstring line. But something like

"""
abc
def
"""
function f end; """
a
b
c
"""
function g end

currently is handled badly:

julia> PackageAnalyzer.LineCategories("test/lines_of_code/docstrings.jl")
┌ Debug: Parsing test/lines_of_code/docstrings.jl
└ @ PackageAnalyzer ~/PackageAnalyzer.jl/src/count_loc.jl:109
1    | Docstring | """
2    | Docstring | abc
3    | Docstring | def
4    | Docstring | """
5    | Code      | f; """
6    | Code      | a
7    | Code      | b
8    | Code      | c
9    | Code      | """
10   | Code      | g

BTW: this case does not parse on JuliaSyntax v0.3.0, 0.3.1, or 0.3.2 (but does on v0.2). Even a slightly simple variant fails:

julia> JuliaSyntax.parse(JuliaSyntax.GreenNode, """
       "
       a
       "
       f
       g
       """; ignore_trivia=false)
ERROR: ParseError:
Error: unexpected text after parsing statement
@ line 5:1
"
f
g

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect a lot of these problems would go away if you give up on the 1:1 mapping between source files and classification, and allow each line to have multiple labels.

For the failing parse, you need JuliaSyntax.parseall - that's for parsing a whole file of top-level statements, not just a single statement:

julia> JuliaSyntax.parseall(JuliaSyntax.GreenNode, """
              "
              a
              "
              f
              g
              """; ignore_trivia=false)
     1:10     │[toplevel]
     1:7      │  [macrocall]
     1:0      │    core_@doc1:5      │    [string]
     1:1"
     2:4      │      String             ✔
     5:5      │      "
     6:6      │    NewlineWs
     7:7      │    Identifier           ✔
     8:8      │  NewlineWs
     9:9      │  Identifier             ✔
    10:10     │  NewlineWs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh ok, I didn’t get that, thanks.

I agree it would make these decisions easier… I kinda hate to give it up though. I also don’t think that’s how any other loc tool works. Would love to see examples otherwise though.

Copy link

@c42f c42f Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't want to give it up that's fine. But you could make it part of the implementation:

  • Pass 1: Classify all lines into multiple categories according to which nodes touch them
  • Pass 2: Reduce each line to a single category according to a precedence rule - something like code > docstring > comment or whatever

This would also fix the "what to do about inline comments" in an easy way.

Copy link
Member Author

@ericphanson ericphanson May 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea. I was also able to push it up into the traversal so I don't need to keep a list of all objects on every line, and I think having a clear rule made that easier too.

edit: into the traversal of the GreenNode, not all the way up to replace constructing the GreenNode, which would be even more efficient

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great!

src/count_loc.jl Outdated Show resolved Hide resolved
src/count_loc.jl Show resolved Hide resolved
src/count_loc.jl Outdated Show resolved Hide resolved
src/count_loc.jl Outdated Show resolved Hide resolved
src/syntax.jl Outdated Show resolved Hide resolved
src/count_loc.jl Outdated
using .CategorizeLines

# TODO:
# Handle `@doc` calls?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think @doc matters. It's rarely used and supporting it won't matter much for aggregate package stats. And when it is used, it can be used for things other than attaching docstrings to objects. The @doc api is bizarre if you ask me :)

src/count_loc.jl Outdated

# TODO:
# Handle `@doc` calls?
# What about inline comments #= comment =#?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure. Context?

Copy link
Member Author

@ericphanson ericphanson Mar 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was trying to decide if they should count as "code" lines or "comment" lines. I think calling them "code" here is fine though and then won't need any special handling, if there's other code on the line. But I might need special handling for multiline comments.

@c42f
Copy link

c42f commented Mar 6, 2023

Overall comments on this:

  • Having a well defined category precedence (eg, code > docstring > comment) would make categorizing lines easy, given there can be multiple categories per line. (cf in the comment about a two-pass system.)
  • The API for JuliaSyntax.SyntaxNode is ... honestly kind of WIP and not entirely awesome haha 😬 I feel like some additions to that could really help you here, though I'm not entirely sure what they should be. Ideas welcome.

@c42f
Copy link

c42f commented Mar 14, 2023

I've just thought of an upstream change that would help make this cleaner.

I'm planning to emit docstrings as nodes of kind K"doc" in the next version - this reflects the fact that it's special surface syntax, not an explicit macro call. (You can already detect this by detecting the special kind K"core_@doc", but I think emitting a K"macrocall" for this is one of those cases where lowering has crept into the parser a little.)

src/syntax.jl Outdated
elseif k == K"doc"
kids = JuliaSyntax.children(node)
# Is this ever not a string?
kind(kids[1].raw) == K"string" || return
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must always be a string now: In JuliaSyntax 0.4 I made sure all strings are wrapped in K"string", regardless of whether they're literals or contain interpolations,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, thanks for letting me know!

In general, I am not sure what to do with expected invariants in this kind of code. I think maybe I should copy over @maybecatch from https://github.com/beacon-biosignals/SlackThreads.jl/blob/74351c2863ec9a1cf22732873d4d2816aa9c140d/src/SlackThreads.jl#L27-L49 so we can get errors when testing and emit them as logs (maybe debug logs) at runtime, since we want to be able to run this over General and all kinds of things can happen.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just use an @assert? This particular invariant shouldn't be able to be broken regardless of broken package code - it's the parser's job to ensure that's true.

But I agree catch-and-log is also generally appropriate near the top level of a tool which you'll run across all of General. It's nice for the tool to continue even if a single package breaks an invariant you expected when writing the code.

Copy link

@c42f c42f May 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the parser's job to ensure that's true

This wouldn't be true if we needed to analyze code after macro expansion of course. That involves executing user code and all bets are off. But we don't do that yet :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the parser maintain that invariant even if the source code is somehow invalid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Because currently we analyze all *.jl files, even ones that are never included in package code, and may not actually be Julia source code…)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, by the definition of a doc block really: we wouldn't parse something as a K"doc" unless it has a string first followed by an expression on the same or next line.

Copy link
Member Author

@ericphanson ericphanson May 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, makes sense. I put it in an assert but wrapped it in a try-catch. Sounds like I can remove that wrapper. The outer try-catch will get it still if something does go wrong still (but we will give up on all static analysis in that case).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks promising! What is the status? What is missing?
I would be interested in: Number of functions, and average and maximal lines of code per function...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, those are helpful to know @ufechner7 ! I got stuck trying to figure out when a function is being extended from another package/base (e.g. a new method for getindex) vs being introduced in this package. It is tricky in the presence of submodules. My latest thought has been to skip all that, for now at least, and just count "methods" rather than "functions". However I moved to first land #86 separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants