Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re-implement Base Unicode features using utf8proc_jll #381

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Nov 5, 2023

Closes #377.

@stevengj
Copy link
Member Author

stevengj commented Nov 5, 2023

(It's maybe 1% faster on tests/benchmark.jl compared to main.)

Copy link

codecov bot commented Nov 5, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (cfdac5f) 96.57% compared to head (1506668) 96.64%.
Report is 8 commits behind head on main.

❗ Current head 1506668 differs from pull request most recent head c820724. Consider uploading reports for the commit c820724 to get more accurate results

Files Patch % Lines
src/unicode.jl 98.11% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #381      +/-   ##
==========================================
+ Coverage   96.57%   96.64%   +0.06%     
==========================================
  Files          14       15       +1     
  Lines        4179     4198      +19     
==========================================
+ Hits         4036     4057      +21     
+ Misses        143      141       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stevengj
Copy link
Member Author

stevengj commented Nov 5, 2023

Looks like utf8proc_jll version 2.9.0 doesn't work with Julia < 1.6 due to changes in BinaryBuilder or something?

How badly do you want to support ancient Julia versions?

@c42f
Copy link
Member

c42f commented Nov 5, 2023

How badly do you want to support ancient Julia versions

Personally? I'd prefer to keep supporting them but I don't have a super strong opinion as long as we support 1.6 and above. But tooling maintainers did want it to work on older versions I think. @davidanthoff @pfitzseb ?

If we can't do it easily, can we continue to support older versions by calling into Base.is_id_char() for now?

@stevengj
Copy link
Member Author

stevengj commented Nov 5, 2023

If we can't do it easily, can we continue to support older versions by calling into Base.is_id_char() for now?

That will silently give different results on older versions of Julia, which also seems bad?

(If you want to go that route, it would be easier to simply lower the version requirement for utf8proc_jll?)

Though I think JLL's were not supported at all before Julia 1.3 or so? I really don't see how to maintain compatibility with Julia 1.0 without keeping a separate branch around with a different Project.toml that omits the utf8proc_jll requirement.

@c42f
Copy link
Member

c42f commented Nov 6, 2023

I really don't see how to maintain compatibility with Julia 1.0 without keeping a separate branch around with a different Project.toml that omits the utf8proc_jll requirement.

Oof. Yes of course - the lack of optional dependencies in early Julia versions makes this really awkward. Hmm.

Now I'm looking at porting the necessary parts of utf8proc.c to Julia. The library doesn't look large or scary and I'm tempted to give it a go.

@pfitzseb
Copy link
Member

pfitzseb commented Nov 6, 2023

For the VS Code extension, any non-Julia dependencies are a little annoying because we're trying to vendor everything. If porting utf8proc to Julia is feasible, then that'd be ideal I think.

@stevengj
Copy link
Member Author

stevengj commented Nov 6, 2023

Now I'm looking at porting the necessary parts of utf8proc.c to Julia. The library doesn't look large or scary and I'm tempted to give it a go.

Yes, it shouldn't be too bad if you write a little script to convert the data tables in utf8proc_data.c (from C arrays to Julia arrays), something like:

utf8proc_data = read("utf8proc_data.c", String)
jl_data = replace(utf8proc_data,
    # arrays
    ",\n"=> ", ", "};"=>"]",
    r"static const [A-Za-z0-9_]+ ([A-Za-z0-9_]+)\[\] = \{\n" => s"const \1 = [ ",

   # utf8proc_property constructors
   "{" => "Property(", "}," => "),",

    # constants
    "UINT16_MAX" => typemax(UInt16))
write("utf8proc_data.jl", jl_data)

along with copying the definitions of UTF8PROC_CATEGORY_CC etcetera.

Note that Julia does not support bitfield structs, so you'd have to emulate them for re-defining utf8proc_property_t unless you want to waste a lot of memory for full-width fields.

That being said, this seems like a lot of effort (and an ongoing maintenance headache since utf8proc has to be updated every year for new Unicode versions) for very little gain. You don't really gain much from Julia's flexibility since the unicode stuff is all going to be concretely typed, and it's not clear to me that there is much to be gained (for the functions we use) by e.g. inlining or constant propagation. It's an awful lot of hassle to go to in order to avoid "a little annoyance" for vscode, and to make it easier to support julia < 1.6 versions that fewer and fewer people care about.

@pfitzseb
Copy link
Member

pfitzseb commented Nov 6, 2023

It's an awful lot of hassle to go to in order to avoid "a little annoyance" for vscode, and to make it easier to support julia < 1.6 versions that fewer and fewer people care about.

That's fair.

"A little annoyance" was me understating the issue, because we don't currently have a good way to vendor binary dependencies. That doesn't mean that we can't figure it out or that anyone should go do the work of porting utf8proc, of course.

@stevengj
Copy link
Member Author

stevengj commented Nov 6, 2023

Maybe there is a straightforward way to make JuliaSyntax optionally use the version of utf8proc that is linked into julia itself?

This would presumably help the "vendoring" (whatever that is) situation with vscode, and it would also be good for the version of JuliaSyntax that is bundled with julia (so that it doesn't load two identical versions of utf8proc into memory).

Does some Pkg expert (e.g. @vchuravy or @IanButterworth) know a good way to do this that would be friendly with the way JuliaSyntax will be bundled with Julia? Preferences.jl?

@c42f
Copy link
Member

c42f commented Nov 7, 2023

"A little annoyance" was me understating the issue

Yea I've have a feeling it could be quite a problem for you. It's troubling that porting (parts of?) utf8proc might actually be the simplest way out of this situation. It seems like taking extreme measures.

Oh, but if you're vendoring JuliaSyntax and other libs, you're already messing with their Project.toml's, yes? So in principle one can modify those to not depend on utf8proc_jll on older Julia VERSION? Or is that not true? Is the vendored stuff for the VSCode plugin shared between Julia versions?

this seems like a lot of effort (and an ongoing maintenance headache since utf8proc has to be updated every year for new Unicode versions)

Mm it seems like buying into integration, testing, bugfixing and ongoing maintenance which is a chunk of duplicated effort :( The initial porting is a lot less scary


For vendoring into Base I think we're in an ok situation - in that case JuilaSyntax isn't a stdlib, it's actually a submodule of Base. So we don't depend on the Project.toml at all. We could branch on something - at toplevel I guess - to detect that situation and use Julia's internal utf8proc in that case.

@davidanthoff
Copy link
Contributor

if you're vendoring JuliaSyntax and other libs, you're already messing with their Project.toml's

No, we are not messing with anything inside the packages that we are using in the LS, we just ship them as part of the extension, and we have a different project/manifest combo for each Julia version for the entire project, but the packages themselves are left as-is. We keep some packages on older versions on older Julia versions, but that really only works if their public API doesn't change...

Also, just for background: this is clearly a major pain point... My hope is that at some point we will get support for statically compiling the LS in Julia. At that point we ship a binary version of the LS with the extension, and we could drop support for running the LS on older Julia versions. I would still want it to be able to analyze code that targets old Julia versions, but overall that is a much, much simpler problem. So my current plan is to continue to go through all these hoops to support running on old Julia versions and hope that the static compile story materializes at some point...

@c42f
Copy link
Member

c42f commented Nov 16, 2023

Just FYI, I've not forgotten this. I'm currently working on porting parts of libutf8proc. I've decided to start with porting the data_generation.rb script there from ruby to Julia. That will familiarize me with what's going on a bit in that library. If nothing else comes of this at least I hope the libutf8proc table generation will be cleaner and not depend on both Julia and ruby :)

@KristofferC
Copy link
Member

One thing I don't understand.

  • JuliaSyntax is in the sysimage
  • This PR will hence cause utf8proc_jll to be in the sysimage.
  • The version of utf8proc_jll used will thus no longer be independently upgradable and now be tied to the Julia version
  • So how is JuliaSyntax made independent of the julia version by this PR?

@c42f
Copy link
Member

c42f commented Nov 17, 2023

So how is JuliaSyntax made independent of the julia version by this PR?

JuliaSyntax is vendored into the sysimage as part of Base, it's not a stdlib. In that context, it's certainly dependent on the version of utf8proc shipped with Julia (and it wouldn't use utf8proc_jll - it'd use the builtin library).

However, it's also separately installable as a normal package for use by tooling. In this use case it'd be independent of Julia's VERSION.

As a non-stdlib, the vendored copy's only public API is Meta.parse()

@KristofferC
Copy link
Member

and it wouldn't use utf8proc_jll - it'd use the builtin library).

Okay, that answers it.

@c42f
Copy link
Member

c42f commented Jan 17, 2024

fyi I'm still plugging away at the upstream implications of a pure-Julia unicode library...

https://github.com/c42f/UnicodeNext.jl

I think this could be generally useful though it does feel a bit marginal in terms of payoff for the effort 🤷‍♀️

Here I've put tokenization-related functionality like
`is_id_start_char()` in the Tokenize module.
@c42f
Copy link
Member

c42f commented Jan 23, 2024

Ok, I've pushed an update here which makes this PR use my experimental UnicodeNext library.

Current issues:

  • This breaks CI because that library isn't registered yet. To get this to work you need to Pkg.develop https://github.com/c42f/UnicodeNext.jl
  • When vendored into Base it might be nice to instead depend on Base.Unicode and avoid the extra dependency and sysimage cost of UnicodeNext unless(/until?) we can figure out how to make UnicodeNext the implementation of unicode in Base.

Thoughts? This would

  • Solve any issues with dependencies way back to Julia 1.0 (sigh)
  • Allow JuliaSyntax to understand the latest unicode versions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

eliminate dependency on Base.is_id_char and julia-specific utf8proc
5 participants