WIP: Use JuliaSyntax to count things out of parsed source code #79

ericphanson · 2023-01-14T13:17:01Z

xref #63

so far:

julia> analyze_syntax(PackageAnalyzer)
Dict{String, Int64} with 6 entries:
  "struct"   => 6
  "usings"   => 16
  "call"     => 593
  "function" => 53
  "imports"  => 1
  "exports"  => 7

c42f · 2023-03-04T04:23:01Z

src/count_loc.jl

+# TODO:
+# Handle `@doc` calls?
+# What about inline comments #= comment =#?
+# Can a docstring not start at the beginning of a line?


Yes.

julia> JuliaSyntax.parseall(JuliaSyntax.SyntaxNode, "x; \"some docstring\"\nf") line:col│ tree │ file_name 1:1 │[toplevel] 1:1 │ [toplevel] 1:1 │ x 1:3 │ [macrocall] 1:3 │ core_@doc 1:4 │ [string] 1:5 │ "some docstring" 2:1 │ f

ok, that makes sense, thanks for the example. For the purpose "counting lines of code", I think I need to make a decision here if this line counts as a "docstring" line or a "code" line. My inclination is to just treat it as a code line, since that seems easier implementation-wise and I think it is also fairly explainable (it counts as the first thing that happens on that line).

It's a pretty unusual case to be honest, so I don't actually think it matters for the purposes of package stats.

But you could count it as both a code line and a docstring line if you wanted? It is, after all, literally both.

Depends if you think it's important to have a 1:1 mapping between lines and classification

I think it is important, yeah. I think it “feels” buggy otherwise, if you try it on a file and the counts don’t add up to the line count of the file you see in your editor. And I think it adds extra conceptual complexity.

c42f · 2023-03-04T04:25:06Z

src/count_loc.jl

+# Handle `@doc` calls?
+# What about inline comments #= comment =#?
+# Can a docstring not start at the beginning of a line?
+# Can there be multiple string nodes on the same line as a docstring?


Unsure exactly what this means. But probably. consider this:

julia> JuliaSyntax.parseall(JuliaSyntax.SyntaxNode, "\"doc1\" x; \"doc2 \$an_interpolation\" y") line:col│ tree │ file_name 1:1 │[toplevel] 1:1 │ [toplevel] 1:1 │ [macrocall] 1:1 │ core_@doc 1:1 │ [string] 1:2 │ "doc1" 1:8 │ x 1:10 │ [macrocall] 1:10 │ core_@doc 1:11 │ [string] 1:12 │ "doc2 " 1:18 │ an_interpolation 1:36 │ y

ah good point. Not sure how to handle this. What I do to count docstrings is I find a docstring call, take the first string node after that, and then look where that node ends (which line). Then I count all those lines as docstring lines. This handles things like

""" abc def """ f

which should count as 4 lines I think (or maybe 2?).

If someone did your example, I think that's fine to count as 1 docstring line. But something like

""" abc def """ function f end; """ a b c """ function g end

currently is handled badly:

julia> PackageAnalyzer.LineCategories("test/lines_of_code/docstrings.jl") ┌ Debug: Parsing test/lines_of_code/docstrings.jl └ @ PackageAnalyzer ~/PackageAnalyzer.jl/src/count_loc.jl:109 1 | Docstring | """ 2 | Docstring | abc 3 | Docstring | def 4 | Docstring | """ 5 | Code | f; """ 6 | Code | a 7 | Code | b 8 | Code | c 9 | Code | """ 10 | Code | g

BTW: this case does not parse on JuliaSyntax v0.3.0, 0.3.1, or 0.3.2 (but does on v0.2). Even a slightly simple variant fails:

julia> JuliaSyntax.parse(JuliaSyntax.GreenNode, """ " a " f g """; ignore_trivia=false) ERROR: ParseError: Error: unexpected text after parsing statement @ line 5:1 " f g

I suspect a lot of these problems would go away if you give up on the 1:1 mapping between source files and classification, and allow each line to have multiple labels.

For the failing parse, you need JuliaSyntax.parseall - that's for parsing a whole file of top-level statements, not just a single statement:

julia> JuliaSyntax.parseall(JuliaSyntax.GreenNode, """ " a " f g """; ignore_trivia=false) 1:10 │[toplevel] 1:7 │ [macrocall] 1:0 │ core_@doc ✔ 1:5 │ [string] 1:1 │ " 2:4 │ String ✔ 5:5 │ " 6:6 │ NewlineWs 7:7 │ Identifier ✔ 8:8 │ NewlineWs 9:9 │ Identifier ✔ 10:10 │ NewlineWs

Ahh ok, I didn’t get that, thanks.

I agree it would make these decisions easier… I kinda hate to give it up though. I also don’t think that’s how any other loc tool works. Would love to see examples otherwise though.

If you don't want to give it up that's fine. But you could make it part of the implementation:

Pass 1: Classify all lines into multiple categories according to which nodes touch them

Pass 2: Reduce each line to a single category according to a precedence rule - something like code > docstring > comment or whatever

This would also fix the "what to do about inline comments" in an easy way.

This is a great idea. I was also able to push it up into the traversal so I don't need to keep a list of all objects on every line, and I think having a clear rule made that easier too.

edit: into the traversal of the GreenNode, not all the way up to replace constructing the GreenNode, which would be even more efficient

Sounds great!

src/count_loc.jl

src/syntax.jl

c42f · 2023-03-04T05:06:28Z

src/count_loc.jl

+using .CategorizeLines
+
+# TODO:
+# Handle `@doc` calls?


I don't think @doc matters. It's rarely used and supporting it won't matter much for aggregate package stats. And when it is used, it can be used for things other than attaching docstrings to objects. The @doc api is bizarre if you ask me :)

c42f · 2023-03-04T10:22:09Z

src/count_loc.jl

+
+# TODO:
+# Handle `@doc` calls?
+# What about inline comments #= comment =#?


Unsure. Context?

I think I was trying to decide if they should count as "code" lines or "comment" lines. I think calling them "code" here is fine though and then won't need any special handling, if there's other code on the line. But I might need special handling for multiline comments.

c42f · 2023-03-06T02:30:20Z

Overall comments on this:

Having a well defined category precedence (eg, code > docstring > comment) would make categorizing lines easy, given there can be multiple categories per line. (cf in the comment about a two-pass system.)
The API for JuliaSyntax.SyntaxNode is ... honestly kind of WIP and not entirely awesome haha 😬 I feel like some additions to that could really help you here, though I'm not entirely sure what they should be. Ideas welcome.

c42f · 2023-03-14T06:16:54Z

I've just thought of an upstream change that would help make this cleaner.

I'm planning to emit docstrings as nodes of kind K"doc" in the next version - this reflects the fact that it's special surface syntax, not an explicit macro call. (You can already detect this by detecting the special kind K"core_@doc", but I think emitting a K"macrocall" for this is one of those cases where lowering has crept into the parser a little.)

… JuliaSyntax changes

c42f · 2023-05-21T04:06:54Z

src/syntax.jl

+        elseif k == K"doc"
+            kids = JuliaSyntax.children(node)
+            # Is this ever not a string?
+            kind(kids[1].raw) == K"string" || return


This must always be a string now: In JuliaSyntax 0.4 I made sure all strings are wrapped in K"string", regardless of whether they're literals or contain interpolations,

Ah ok, thanks for letting me know!

In general, I am not sure what to do with expected invariants in this kind of code. I think maybe I should copy over @maybecatch from https://github.com/beacon-biosignals/SlackThreads.jl/blob/74351c2863ec9a1cf22732873d4d2816aa9c140d/src/SlackThreads.jl#L27-L49 so we can get errors when testing and emit them as logs (maybe debug logs) at runtime, since we want to be able to run this over General and all kinds of things can happen.

You could just use an @assert? This particular invariant shouldn't be able to be broken regardless of broken package code - it's the parser's job to ensure that's true.

But I agree catch-and-log is also generally appropriate near the top level of a tool which you'll run across all of General. It's nice for the tool to continue even if a single package breaks an invariant you expected when writing the code.

it's the parser's job to ensure that's true

This wouldn't be true if we needed to analyze code after macro expansion of course. That involves executing user code and all bets are off. But we don't do that yet :-)

Will the parser maintain that invariant even if the source code is somehow invalid?

(Because currently we analyze all *.jl files, even ones that are never included in package code, and may not actually be Julia source code…)

Yes, by the definition of a doc block really: we wouldn't parse something as a K"doc" unless it has a string first followed by an expression on the same or next line.

Ah ok, makes sense. I put it in an assert but wrapped it in a try-catch. Sounds like I can remove that wrapper. The outer try-catch will get it still if something does go wrong still (but we will give up on all static analysis in that case).

This looks promising! What is the status? What is missing?
I would be interested in: Number of functions, and average and maximal lines of code per function...

Thanks, those are helpful to know @ufechner7 ! I got stuck trying to figure out when a function is being extended from another package/base (e.g. a new method for getindex) vs being introduced in this package. It is tricky in the presence of submodules. My latest thought has been to skip all that, for now at least, and just count "methods" rather than "functions". However I moved to first land #86 separately.

ericphanson added 17 commits November 24, 2022 13:40

wip

108b243

Merge remote-tracking branch 'origin/main' into eph/syntax

c553e87

wip

dbbd2c0

wip

6afeadf

wip

41f2838

wip

76c60f2

wip counting stuff

3533b4d

use custom julia parsing

3a8b839

some cleanup

9162fd5

cleanup

3a838b0

wip

5c81a9b

rm using & import counts

529b848

wip

d7a3636

rename

6b5c36e

fix padding

19b34b7

bump JuliaSyntax

58ee021

wip

71f8d16