-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Use JuliaSyntax to count things out of parsed source code #79
base: main
Are you sure you want to change the base?
Conversation
src/count_loc.jl
Outdated
# TODO: | ||
# Handle `@doc` calls? | ||
# What about inline comments #= comment =#? | ||
# Can a docstring not start at the beginning of a line? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
julia> JuliaSyntax.parseall(JuliaSyntax.SyntaxNode, "x; \"some docstring\"\nf")
line:col│ tree │ file_name
1:1 │[toplevel]
1:1 │ [toplevel]
1:1 │ x
1:3 │ [macrocall]
1:3 │ core_@doc
1:4 │ [string]
1:5 │ "some docstring"
2:1 │ f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, that makes sense, thanks for the example. For the purpose "counting lines of code", I think I need to make a decision here if this line counts as a "docstring" line or a "code" line. My inclination is to just treat it as a code line, since that seems easier implementation-wise and I think it is also fairly explainable (it counts as the first thing that happens on that line).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a pretty unusual case to be honest, so I don't actually think it matters for the purposes of package stats.
But you could count it as both a code line and a docstring line if you wanted? It is, after all, literally both.
Depends if you think it's important to have a 1:1 mapping between lines and classification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is important, yeah. I think it “feels” buggy otherwise, if you try it on a file and the counts don’t add up to the line count of the file you see in your editor. And I think it adds extra conceptual complexity.
src/count_loc.jl
Outdated
# Handle `@doc` calls? | ||
# What about inline comments #= comment =#? | ||
# Can a docstring not start at the beginning of a line? | ||
# Can there be multiple string nodes on the same line as a docstring? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure exactly what this means. But probably. consider this:
julia> JuliaSyntax.parseall(JuliaSyntax.SyntaxNode, "\"doc1\" x; \"doc2 \$an_interpolation\" y")
line:col│ tree │ file_name
1:1 │[toplevel]
1:1 │ [toplevel]
1:1 │ [macrocall]
1:1 │ core_@doc
1:1 │ [string]
1:2 │ "doc1"
1:8 │ x
1:10 │ [macrocall]
1:10 │ core_@doc
1:11 │ [string]
1:12 │ "doc2 "
1:18 │ an_interpolation
1:36 │ y
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah good point. Not sure how to handle this. What I do to count docstrings is I find a docstring call, take the first string node after that, and then look where that node ends (which line). Then I count all those lines as docstring lines. This handles things like
"""
abc
def
"""
f
which should count as 4 lines I think (or maybe 2?).
If someone did your example, I think that's fine to count as 1 docstring line. But something like
"""
abc
def
"""
function f end; """
a
b
c
"""
function g end
currently is handled badly:
julia> PackageAnalyzer.LineCategories("test/lines_of_code/docstrings.jl")
┌ Debug: Parsing test/lines_of_code/docstrings.jl
└ @ PackageAnalyzer ~/PackageAnalyzer.jl/src/count_loc.jl:109
1 | Docstring | """
2 | Docstring | abc
3 | Docstring | def
4 | Docstring | """
5 | Code | f; """
6 | Code | a
7 | Code | b
8 | Code | c
9 | Code | """
10 | Code | g
BTW: this case does not parse on JuliaSyntax v0.3.0, 0.3.1, or 0.3.2 (but does on v0.2). Even a slightly simple variant fails:
julia> JuliaSyntax.parse(JuliaSyntax.GreenNode, """
"
a
"
f
g
"""; ignore_trivia=false)
ERROR: ParseError:
Error: unexpected text after parsing statement
@ line 5:1
"
f
g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect a lot of these problems would go away if you give up on the 1:1 mapping between source files and classification, and allow each line to have multiple labels.
For the failing parse, you need JuliaSyntax.parseall
- that's for parsing a whole file of top-level statements, not just a single statement:
julia> JuliaSyntax.parseall(JuliaSyntax.GreenNode, """
"
a
"
f
g
"""; ignore_trivia=false)
1:10 │[toplevel]
1:7 │ [macrocall]
1:0 │ core_@doc ✔
1:5 │ [string]
1:1 │ "
2:4 │ String ✔
5:5 │ "
6:6 │ NewlineWs
7:7 │ Identifier ✔
8:8 │ NewlineWs
9:9 │ Identifier ✔
10:10 │ NewlineWs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh ok, I didn’t get that, thanks.
I agree it would make these decisions easier… I kinda hate to give it up though. I also don’t think that’s how any other loc tool works. Would love to see examples otherwise though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't want to give it up that's fine. But you could make it part of the implementation:
- Pass 1: Classify all lines into multiple categories according to which nodes touch them
- Pass 2: Reduce each line to a single category according to a precedence rule - something like
code > docstring > comment
or whatever
This would also fix the "what to do about inline comments" in an easy way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea. I was also able to push it up into the traversal so I don't need to keep a list of all objects on every line, and I think having a clear rule made that easier too.
edit: into the traversal of the GreenNode, not all the way up to replace constructing the GreenNode, which would be even more efficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great!
src/count_loc.jl
Outdated
using .CategorizeLines | ||
|
||
# TODO: | ||
# Handle `@doc` calls? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think @doc
matters. It's rarely used and supporting it won't matter much for aggregate package stats. And when it is used, it can be used for things other than attaching docstrings to objects. The @doc
api is bizarre if you ask me :)
src/count_loc.jl
Outdated
|
||
# TODO: | ||
# Handle `@doc` calls? | ||
# What about inline comments #= comment =#? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure. Context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was trying to decide if they should count as "code" lines or "comment" lines. I think calling them "code" here is fine though and then won't need any special handling, if there's other code on the line. But I might need special handling for multiline comments.
Overall comments on this:
|
I've just thought of an upstream change that would help make this cleaner. I'm planning to emit docstrings as nodes of kind |
… JuliaSyntax changes
src/syntax.jl
Outdated
elseif k == K"doc" | ||
kids = JuliaSyntax.children(node) | ||
# Is this ever not a string? | ||
kind(kids[1].raw) == K"string" || return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must always be a string now: In JuliaSyntax 0.4 I made sure all strings are wrapped in K"string"
, regardless of whether they're literals or contain interpolations,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, thanks for letting me know!
In general, I am not sure what to do with expected invariants in this kind of code. I think maybe I should copy over @maybecatch
from https://github.com/beacon-biosignals/SlackThreads.jl/blob/74351c2863ec9a1cf22732873d4d2816aa9c140d/src/SlackThreads.jl#L27-L49 so we can get errors when testing and emit them as logs (maybe debug logs) at runtime, since we want to be able to run this over General and all kinds of things can happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just use an @assert
? This particular invariant shouldn't be able to be broken regardless of broken package code - it's the parser's job to ensure that's true.
But I agree catch-and-log is also generally appropriate near the top level of a tool which you'll run across all of General. It's nice for the tool to continue even if a single package breaks an invariant you expected when writing the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's the parser's job to ensure that's true
This wouldn't be true if we needed to analyze code after macro expansion of course. That involves executing user code and all bets are off. But we don't do that yet :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the parser maintain that invariant even if the source code is somehow invalid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Because currently we analyze all *.jl
files, even ones that are never included in package code, and may not actually be Julia source code…)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, by the definition of a doc block really: we wouldn't parse something as a K"doc"
unless it has a string first followed by an expression on the same or next line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, makes sense. I put it in an assert but wrapped it in a try-catch. Sounds like I can remove that wrapper. The outer try-catch will get it still if something does go wrong still (but we will give up on all static analysis in that case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks promising! What is the status? What is missing?
I would be interested in: Number of functions, and average and maximal lines of code per function...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, those are helpful to know @ufechner7 ! I got stuck trying to figure out when a function is being extended from another package/base (e.g. a new method for getindex
) vs being introduced in this package. It is tricky in the presence of submodules. My latest thought has been to skip all that, for now at least, and just count "methods" rather than "functions". However I moved to first land #86 separately.
xref #63
so far: