-
Notifications
You must be signed in to change notification settings - Fork 942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate tree-sitter to replace syntect #1787
Comments
I haven't used tree-sitter as a library but it's really nice in nvim. |
👍 ts is the bee's knees. The official tree-sitter-highlight crate has a few nice examples in the README... I haven't used it before, but I'm interested to try... |
If it's adopted (I don't know yet, I need to see the theming capabilities and just try it on various inputs), it would be its own package that can probably be on crates.io as well. I'd like to move all the lines numbers/highlights etc in it. |
List of parsers from neovim: https://github.com/nvim-treesitter/nvim-treesitter/blob/master/lua/nvim-treesitter/parsers.lua |
What would the tree-sitter output look like? would it list a ton of classes in the generated html like the current syntect solution does when you use css mode? or would it use classes that refer to css variables which we can then set to specific colors and styles, eg: |
Ideally it would the same kind as the current syntect output |
All the class definitions make the page source code much larger in size, but I can see how it would make things simpler as far as generating goes. If you simply used classes which refer to colors, then you would have to have some sort of lookup table per programming language. (because a bracket in one language might be colored, but in another language it might not be colored or colored differently) I just wish there was a simple way to have much leaner generated html for syntax highlighting while using the css method. |
I have the HTML renderer working, now to figure out which how to use a VS Code theme to link scopes with tree-sitter to know which colour to use... |
I think I'll forget the VSCode themes as they can be in JSON, YAML or even JS. It can probably just be a tiny ~20L long key value ini file since it's not like we are going to have hundreds of scope. Not much to show so far but I've set up https://github.com/getzola/giallo which is still very much a clone of the html renderer in tree-sitter-highlight so far since I got stuck on theming. |
This solution will be able to work with css right? I ask because the description in the top right of giallo says: Syntax highlighter to HTML using tree-sitter, using VSCode theme, the wording does not mention css eg: HTML/CSS Really appreciate the work on this, I do hope css will still be supported, let me know if you need any help/testing. I am actually ok without language specific scopes on most things because the resulting output will likely be a lot leaner. I cannot speak to what is the normally because a LOT of my editing over the years was in notepad, until a few years ago when I switched to using Atom, just recently I switched to Kate because of how long atom takes to load. |
Yes, it's just a matter of exporting a theme as CSS, which is trivial. |
So are you mostly asking for feedback from people that use VSCode? Are you asking if it is common to have a lot of language specific scopes, or are you asking if it would be ok to have less language-specific scopes? I can install vscode in a VM and play around with it for a bit (never used it before) Did not realize vscode was open source was as easy as |
Mostly asking for people with knowledge of tree-sitter to see what they know about scopes. Also curious neovim-treesitter and how themes/scopes are defined. Differences are probably due to me getting some scopes wrong when looking at the theme and/or missing some necessary scopes, I'll try to fix it when I'm not tired but it's kind of acceptable. |
Hey, Don't get me wrong it's an awesome tool and on top of that it's really fast, I just think that for a web facing thing it would be nice to have syntax highlighting for the kind of stuff you could have on a website, like for example:
PS: Still think tree-sitter is a nice replacement. I understand that the kind of tool able to do that might not be easy to deal with, and it's really not that hard to use |
The issue with syntect is that we are stuck with 2 years old buggy (the JS one for example can take forever to highlight a snippet) sublime syntaxes since they introduced new syntax not supported by syntect. The choices are:
I think 1 is a dead end in the long run as the Sublime Text people can keep changing their spec however they want. I've started porting pygments to Rust a while back and would be an easy solution for people wanting to add syntaxes since it could be just a yaml/toml file. It would also be annoying to use VSCode/Sublime themes as the scopes are very different. Tree-sitter is nicer in that the highlights are much more accurate, it's easy to port TextMate themes and I wouldn't have to maintain it. It's harder to provider custom syntaxes like Zola currently allows though. |
I've started using Helix themes and queries and the result is really good. With their OneDark theme and the default Rust highlight query: With their OneDark theme and their Rust highlight query: The last screenshot is pretty much the same as opening that file in VSCode. Helix is a really great match as they have already a great collection of themes and a lot of improvements to the default queries. I'll see if they are ok with moving those bits out of the main repo for collaboration, otherwise it can be solved with copying and licensing. |
Yes that bottom one does indeed look really nice! |
Tree-sitter is capable of really nice syntax highlighting but there are some drawbacks to consider. For the 109 languages supported in Helix, the total size of the compiled parsers is 108.5 MiB. Most compiled parsers are somewhere on the order of hundreds of KiB with some larger parsers on the order of ones or tens of MiB. The queries are altogether very small: only 1.7 MiB for all of them. The parsers are also C and many languages have C++ external scanners, so you would need to add compile-time dependencies on a C++ toolchain. It's a large amount of work to add support for a language which doesn't have a tree-sitter parser yet. With regular expression based highlighting you can work incrementally - start with a few highlights and add more as you go - but it's hard to write a parser that incrementally covers the full syntax of a language. Language support has become very mature recently with tree-sitter though and there are even parsers for non-programming languages (I have a few for git commits, configs, rebase syntax, diffs). We're happy to take those tradeoffs with Helix since tree-sitter can be used to build so many features (syntax highlighting, syntax-based motions, textobjects, indentation, rainbow brackets) but those tradeoffs are worth some consideration for Zola. There's a similar project which could be more appropriate: https://lezer.codemirror.net/ but admittedly I haven't used it and I think the language support is less full. Plus then the syntax highlighting would need to be done client-side. All of that being said, I would really love to see tree-sitter syntax highlighting in Zola. At least selfishly since the Helix website uses Zola :) |
I don't like solutions that require client-side highlighting (unnecessary JavaScript), the page would load significantly slower. I prefer a solution that makes efficient use of html/css to style the page. I went out of my way to make the back to top button CSS only for the abridge theme so that it would be one less JavaScript file. I am not completely against JavaScript, I make plenty of use of it in abridge, I just don't like using JavaScript when there is a more efficient way of solving a problem (page speed performance). I wonder if Zola makes use of any other tools/libraries that are also C/C++, or if supporting Helix would be the first one? Very cool that there is parsers for: git commits, configs, rebase syntax, diffs |
Argh, I didn't know the parsers were that big :o. From Helix 22.05:
I'm not planning to add all of those but just the Verilog one is the same size as current Zola x)
Zola is super annoying to build on Windows because of libsass requirements. Definitely not the first one. |
I have a working prototype of tree-sitter highlighting working for zola with Helix themes on branch https://github.com/lf-/zola/tree/tree-painter It uses https://github.com/matze/tree-painter/ as a back end instead of the one @Keats was writing a couple months ago, just because it seems to have all the highlighting to HTML done already. It's probably not upstreamable as is, containing a good many hacks, and also compiles the treesitter stuff statically which is more convenient but makes LTO infeasible due to absurd link times (I've not investigated how to selectively do LTO). Feel free to take any amount of it that you'd like; I don't have resources to clean it up to upstream it. Also, the perf is Not Good. Even with the LTO build I had, my site build went from 36ms to 800ms. I don't know why it's slow, and probably the best way to figure that out is to instrument Zola with
I don't have the same constraints as Zola is designed for (binary size does not bother me, generation time does not bother me as long as it's not a workflow blocker), and I've got it good enough to power my site, so I'm stopping where I got to. Regarding the highlight groups, it's distinctly possible that the different clients are using different queries. The treesitter parsers often come with highlight queries, but nvim-treesitter seems to vendor theirs. My speculation is that a big reason for this is that nvim-treesitter has some nonstandard features such as the The way that I've debugged these is by using my nvim which has nvim-treesitter-playground installed and using Anyway, good luck! Good highlighting is really important to programming blogs, and I almost got rid of Zola over it before realizing it was probably easier to hack it in instead. Sample: Before (notice that some After: |
One difficulty with any form of tree-sitter integration is that building parsers is a nightmare due to Cargo being quite very bad at submodules. More details here: matze/tree-painter#3 This could be done either statically linked or dynamically linked, but I would lean toward dynamic linking since it is otherwise impossible to add more parsers to the system without forking it. But dynamic linking would compromise the current single-executable nature of Zola (not something that I'm bothered about, but I understand it is a design goal). |
I can't build your fork for some reason on rustc 1.64 or nightly. I'll have a deeper look when I get more time. Can you tell me how big is the generated Zola binary?
That's surprising. Sounds like something being instantiated too often? I'm expecting the tree-sitter parsers themselves to be faster than tons of regexes from syntect.
It's the issue yes. If the generated parsers size was manageable, we could just add everything to the library. Of course that's not going to work for home-made languages but hey... |
With shiki you get all the syntaxes/themes from VSCode for free, which is the main draw. Otherwise a port of something like pygments, prism, highlight.js would work but it's less interesting. |
Shiki definitely looks like the best option if an effort to port it will happen! |
There's some very promising work on improving tree-sitter start time: tree-sitter/tree-sitter#2374 |
I took a look at Shiki to determine how much work a Rust port would be. It doesn't seem too hard, but there's one hitch: TM grammars use Oniguruma regex, and there's no Rust port of that either, just FFI bindings. Porting that would be much more difficult than porting Shiki, since Oniguruma is 85,000 lines of C vs Shiki's 7,000 lines of TS. The FFI bindings could work, but only if @Keats is okay with having Oniguruma be a build-time dependency statically linked into zola. The above is all moot of course if someone can show the Oniguruma regex syntax to be close enough to |
That's what we do with syntect already, through https://crates.io/crates/onig |
My bad, I didn't see it mentioned in the guide for installing from source. An easy starting point for porting shiki seems to be handling TM grammars. I couldn't find a Rust implementation of a TM grammar deserializer, so I started one here: https://github.com/mwcz/textmate-grammar-rs I could use some help finishing it up. Or, if there is a crate out there that I missed, please correct me. 😅 |
There's no textmate parser in Rust afaik, I had a look before :/ I did something slightly similar with a WIP pygments parser but didn't get very far. In a world where loading tree-sitter is fast (< 50ms) and could be improved (eg a Zola user could list the language they use in Config.toml so we only load those) which one would we prefer between tree-sitter and shiki? Advantages of tree-sitter:
Cons of tree-siter:
Pros of shiki-like:
Cons of shiki-like:
|
I am curious how long it would take to port shiki to rust... days? weeks? months? A rust port of shiki could be a really fun project if its not too large a project to get something up and running in a reasonable amount of time. (a couple hours to implement tree-sitter in Zola is certainly fast!) |
The TextMate grammar parser is about all I have time for, but I can continue to improve it if someone else (@Jieiku??? 😁) is interested in doing the rest of Shiki. I really don't want to clutter this tree-sitter issue with updates about TM grammars, so here's a last update, unless things do start moving strongly towards a shiki port. textmate-grammar-rs updateThe TM grammar description is extremely loose and thus raises a lot of questions when implementing a parser. I chose to only capture fields that are defined in the official TM grammar description, which leaves many many non-standard fields uncaptured. Still, I got it to a point where it can parse a good chunk of the grammars included with Shiki. 3 of the failures are minor and can be fixed with a little more serde tweaking. The remaining 38 failures are from grammars with regex patterns incompatible with Oniguruma (eg, |
add Ref to getzola/zola#1787
At Sourcegraph, we've switched from syntect to tree-sitter for major languages because of performance. I did some benchmarking in Dec 2021, here's the performance report. We haven't been able to drop syntect because of the long tail of languages not supported by tree-sitter. Some differences between the Sourcegraph and Zola use cases:
For small snippets, the highlighting performance probably doesn't matter as much, and syntect's typical speed of about 50k SLOC/s per core should be good enough. That said some grammars like Scala and C# were, depending on the code, about an order of magnitude slower, and we'd not infrequently hit 10s timeouts. |
Thanks for the perf report! |
tree-sitter/tree-sitter#2594 (comment) so it should come eventually! |
For me personally, I'd rather have a (even significant) performance hit, but better syntax highlighting. There is also the option to implement both, make syntect the default (for performance), and treesitter an optional alternative through a config option. Current syntax highlighting is just a bit disappointing in most cases I have used so far. |
If anyone wants an (admittedly jank) solution for the time being, Here's a .zip of the files i'm using for rust currently - the sublime-syntax file is based off of rust enhanced, styled to look like One Dark in VSCode. The modifications aren't pretty, but it does the job. Below is an example screenshot from my website: |
I was thinking of maybe looking into this considering how a lot more programs are migrating to using tree-sitter, and at least from what I see, the existing sublime packages are slowly growing out of date due to the fact that very few people use Sublime any more. Like, the thing that really pushed me to feel this way was the fact that the Java syntax just cannot recognise multiline comments formatted like this:
Instead of this:
And it just, completely breaks the syntax entirely if you have comments formatted this way. On the other hand, tree-sitter feels a lot more robust, although I do have some other apprehensions about the way it's built. I might poke around and see what a minimal version of |
I've already built the minimal version with tree-sitter, it's easy. The issues are on tree-sitter side:
Honestly I wish someone did a port of https://github.com/microsoft/vscode-textmate to Rust just to tap into the VSCode ecosystem. It's not ideal but there are still going to be more TextMate grammars than tree-sitter for the foreseeable future. |
Why not implement it, and let the user optional pick that one? Speed might not be an issue for some people, and it's gonna get better in the future. |
I mean, it seems pretty clear why to not implement it for now; 100MB extra binary size for a program meant to be in one binary is a hard sell. |
Fair, though I personally wouldn't mind that at all. |
Look at tree-sitter/tree-sitter#1799, just the SQL parser got to 89MB. |
Me neither. |
Has anyone used it? The last time I looked at tree-sitter it didn't have many grammars but a quick look shows it's getting better. Our syntect syntaxes are stuck on old versions of the grammars because of new features in the Sublime grammar format not supported by Syntect.
See https://github.com/nvim-treesitter/nvim-treesitter#supported-languages for a list of supported languages.
An alternative would be a basic textmate highlighter using VSCode syntaxes/themes since that's what everyone seems to be using these days.
The text was updated successfully, but these errors were encountered: