Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: PoC implementing Wkt parser using Pidgin Parser Combinator lib #121

Draft
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

driekus77
Copy link
Contributor

@driekus77 driekus77 commented Jul 14, 2024

The corresponding Unit tests and beneath comments show how to use this parser.

I'm basing the parser on this document: Chapter 9 - Wkt 1.2.1
(Thanks to @airbreather)
Plus extras to get the Unit tests running. (E.g. ToWgs84, FittedCS and more are not in this spec)

Features:

  • Own WktObject tree with Traversal support
  • WktParser together with Pidgin library doing the Parsing
  • WktTextReader tied coupling with the WktParser to read from TextReader (StringReader in UnitTests)
  • ToString with IWktOutputFormatter support
  • WktToProjConverter converting the wkt tree to Proj components.
  • Leaving the existing Api be. Only adding new code.
  • ...

@inforithmics
Copy link

inforithmics commented Jul 14, 2024

I like the general idea of this pull request. I saw that pidgin library has codegen. So is this only an development dependency and at runtime pigdin isn‘t needed?

@driekus77
Copy link
Contributor Author

@inforithmics Don't know nothing about codegen. I'm using an older version of Pidgin (Version 2.5.0) that still supports netstandard 2.0.

This older Pidgin version assembly is needed runtime AFAIK.

@airbreather
Copy link
Member

This older Pidgin version assembly is needed runtime AFAIK.

I've looked this up a little bit.

Without doing something unreasonable like making it possible to bundle (up to) the entire runtime library as a source generator, parser combinators like Pidgin should theoretically always require a runtime dependency.

Parser generators like ANTLR already generate source code for their target language, so they don't technically need a runtime library, though I can't imagine (or find) a practically useful one that doesn't.


Am I using the right specification document?

That depends.

  • The term "WKT 1", to me, means the standard that's published as part of OGC SFA which is here: https://www.ogc.org/standard/sfa (latest version 1.2.1 is document 06-103r4 and contains this specification in section 9). Theoretically, that is the standard that this project is already supposed to implement.
  • The document you linked is the first version of a different standard called WKT-CRS, whose Wikipedia page claims that it is sometimes called "WKT 2" (and also agrees with me what "WKT 1" is).
  • https://www.ogc.org/standard/wkt-crs — which contains your link — also contains two revisions since then.

Is this a workable solution for this library?

I think so.

My biggest concern is that we must use an older version of the library in order to support netstandard2.0. This means that we should not expect any further improvements in the future, including bug fixes or performance improvements. "WKT-CRS 2" was adopted in 2018. It's reasonable to imagine that there will be a further revision at some point, so it's possible that this older version of Pidgin may be seen as a significant limitation if that happens (if not already when moving to "WKT-CRS 2").

With that said, I don't think that this concern is a dealbreaker. It really isn't reasonable in 2024 to implement parsers the hard way like this project did at first, and Pidgin 2.5.0 looks like one of the least intrusive ways to do that, so I think it's a good balance. I would have PERSONALLY chosen ANTLR if I were still contributing regularly (because of my personal familiarity, because the latest versions of the runtime on both versions still support netstandard2.0, and because someone's already written the grammar for it) but I'm not, so if this is something that someone else can take past the finish line, then I can't complain.

@inforithmics
Copy link

Runtime dependency is okay and I had a look a the project website so it seems well maintained.

@driekus77
Copy link
Contributor Author

This older Pidgin version assembly is needed runtime AFAIK.

I've looked this up a little bit.

Without doing something unreasonable like making it possible to bundle (up to) the entire runtime library as a source generator, parser combinators like Pidgin should theoretically always require a runtime dependency.

Parser generators like ANTLR already generate source code for their target language, so they don't technically need a runtime library, though I can't imagine (or find) a practically useful one that doesn't.

Am I using the right specification document?

That depends.

  • The term "WKT 1", to me, means the standard that's published as part of OGC SFA which is here: https://www.ogc.org/standard/sfa (latest version 1.2.1 is document 06-103r4 and contains this specification in section 9). Theoretically, that is the standard that this project is already supposed to implement.
  • The document you linked is the first version of a different standard called WKT-CRS, whose Wikipedia page claims that it is sometimes called "WKT 2" (and also agrees with me what "WKT 1" is).
  • https://www.ogc.org/standard/wkt-crs — which contains your link — also contains two revisions since then.

Is this a workable solution for this library?

I think so.

My biggest concern is that we must use an older version of the library in order to support netstandard2.0. This means that we should not expect any further improvements in the future, including bug fixes or performance improvements. "WKT-CRS 2" was adopted in 2018. It's reasonable to imagine that there will be a further revision at some point, so it's possible that this older version of Pidgin may be seen as a significant limitation if that happens (if not already when moving to "WKT-CRS 2").

With that said, I don't think that this concern is a dealbreaker. It really isn't reasonable in 2024 to implement parsers the hard way like this project did at first, and Pidgin 2.5.0 looks like one of the least intrusive ways to do that, so I think it's a good balance. I would have PERSONALLY chosen ANTLR if I were still contributing regularly (because of my personal familiarity, because the latest versions of the runtime on both versions still support netstandard2.0, and because someone's already written the grammar for it) but I'm not, so if this is something that someone else can take past the finish line, then I can't complain.

Thanks for the clarification concerning the specification. I'm gone switch now to the right one ;-) Hopefully I can reuse some parts.

I looked briefly at ANTLR and have some concerns about the Java part of the generator. And it looks that runtime you also are going to need some assemblies. The grammar is different than (E)BNF which is another thing to get my head around.

Parer combinator Lib is sort of in between Regular expressions and a full blown Parser generator like ANTLR. Hopefully we don't need the full blown....but I could be wrong.

@inforithmics
Copy link

inforithmics commented Jul 14, 2024

What I especially like about pidgin is that it is AOT compatible which is important for mobile and Wasm solutions.

It's a later version, maybe this can be supported by using multitargeting, but this is an improvement for later.
https://github.com/benjamin-hodgson/Pidgin/releases/tag/v3.2.2

- WktTextReader with underlying WktParser now working
- Followed spec WKT 1 and looked at existing Unit Tests.
- WKT now has its own AST for multiple reasons.
- Still need to integrate the new WKT AST with rest of PROJ4Net.
@driekus77
Copy link
Contributor Author

driekus77 commented Jul 19, 2024

  • WktTextReader with underlying WktParser now working
  • Followed spec WKT 1 and looked at existing Unit Tests.
  • WKT now has its own AST for multiple reasons:
    • Using WktObject classes as builder(s) for the Proj4net complex class with constructors with lots of params.
    • Extra Semantic Checks will become easier to handle.
    • Transformation from one to other e.g. wkt 1 vs wkt2, would be easier to handle.
  • TODO: Still need to integrate the new WKT AST with rest of PROJ4Net.

Feedback is welcome!

- Discovered Pidgin Parsers Map(...) function which makes WktParser a lot more readable!
- Made some performance improvements for ParseAllWKTs taken 4 seconds down to under 1 second.
- Cleaned some obsolete code.
WktToProjConverter is an implementation example of this interface.

ParseAllWKTs unittest succeeds.
@driekus77
Copy link
Contributor Author

IWktObject now has a Traverse method that accepts a IWktTraverseHandler.

WktToProjConverter is now a working example of such a TraverseHandler.

Code to read, parse and convert now looks like:

            // read
            using var sr01 = new StringReader(wkt.Wkt);
            using var wktReader01 = new WktTextReader(sr01);

            // parse
            var result01 = wktReader01.ReadToEnd();

            // convert
            var converter01 = new WktToProjConverter();
            converter01.Convert(result01.Value);

TODO: Wkt output handing, semantic checks, and replacing old wkt parser...?

Again feedback is welcome!

Also reparsing in unit test ParseAllWKTs from SRID csv => Works!
@driekus77
Copy link
Contributor Author

The Output formatting is now also working:

First Text from SRID CSV:

PROJCS["Anguilla 1957 / British West Indies Grid",GEOGCS["Anguilla 1957",DATUM["Anguilla_1957",SPHEROID["Clarke 1880 (RGS)",6378249.145,293.465,AUTHORITY["EPSG","7012"]],AUTHORITY["EPSG","6600"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4600"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-62],PARAMETER["scale_factor",0.9995],PARAMETER["false_easting",400000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","2000"]]

Can now be formatted to:

PROJCS("Anguilla 1957 / British West Indies Grid",
 	GEOGCS("Anguilla 1957",
 		DATUM("Anguilla_1957",
 			SPHEROID("Clarke 1880 (RGS)", 6378249.145, 293.465, AUTHORITY("EPSG", "7012")),
			AUTHORITY("EPSG", "6600")),
 		PRIMEM("Greenwich", 0, AUTHORITY("EPSG", "8901")),
 		UNIT("degree", 0.0174532925199433, AUTHORITY("EPSG", "9122")),
		AUTHORITY("EPSG", "4600")),
 	PROJECTION("Transverse_Mercator"),
 	PARAMETER("latitude_of_origin", 0),
 	PARAMETER("central_meridian", -62),
 	PARAMETER("scale_factor", 0.9995),
 	PARAMETER("false_easting", 400000),
 	PARAMETER("false_northing", 0),
 	UNIT("metre", 1, AUTHORITY("EPSG", "9001")),
 	AXIS("Easting", East),
 	AXIS("Northing", North),
	AUTHORITY("EPSG", "2000"))

DefaultWktOutputFormatter with default constructor settings leaves the output compact on one line.
Above can be done using following C#:

            // Parse in from CSV...
            using var sr = new StringReader(wkt.Wkt);
            using var reader = new WktTextReader(sr);
            var result01 = reader.ReadToEnd();

            var cs01 = result01.Value;

            // Generate WKT string from WktObject...
            var formatter = new DefaultWktOutputFormatter(
                newline: Environment.NewLine,
                leftDelimiter: '(', rightDelimiter: ')',
                indent: "\t", extraWhitespace: " ");

            string wkt01 = cs01.ToString(formatter);

Again feed back is welcome....

- Fixed multiple errors
- Fixed support for FittedCoordinateSystem
- All unit tests except one now working. The one needs partial parsing which is not supported yet.
@driekus77 driekus77 changed the title [WIP|Draft|Spike] First try using Pidgin Parser lib for WKT1Parser. Replace WktParser by new Parser Combinator: Pidgin. Jul 21, 2024
@driekus77
Copy link
Contributor Author

I'm now ready with this WktParser 1.0 replacement. All unit tests, except one, run fine using the new WktTextReader/WktParser under CreateFromWkt.

When this work is coming through I will look into Wkt 2.0 support.

Again feedback and/or reviews are welcome!

@driekus77
Copy link
Contributor Author

I sleept a night over it and I'm not happy with a few things:

  • Full parsing from Wkt text to Proj4net CoordinateSystem now takes almost double the amount of time (ParseAllWKTs unit test)
  • The (creation of) the dedicated WktObject tree isn't really needed for Proj4Net lib itself.

So I'm thinking of adding a dedicated ProjNet.IO.WKT project to this PR with the new WKT stuff. The WktParser will than write to an IWktBuilder interface with its own implementation in Proj4Net: WktToProjTextReader.

Again feedback and Idea's are welcome!

@inforithmics
Copy link

inforithmics commented Jul 23, 2024

Concerning the Performance: I think there a lot of Performance improvements in newer Pidgin Versions. Because of AOT Compatibility I think MultiTargeting needs to be added so in essance something like this should be added in the csproj

<TargetFramework>netstandard2.0</TargetFramework> => <TargetFrameworks>netstandard2.0;net6.0;net8.0</TargetFrameworks>

<ItemGroup Condition="'$(TargetFramework)' == 'net8.0' Or '$(TargetFramework)' == 'net6.0'">
     <PackageReference Include="Pidgin" Version="3.2.3" />
</ItemGroup>

<ItemGroup Condition="'$(TargetFramework)' == 'netstandard2.0'">
     <PackageReference Include="Pidgin" Version="2.5.0" />
</ItemGroup>

So this means for .netstandard2.0 it is using pidgin 2.5 and for .Net 8 it is using pidgin 3.2.3.

The Performance Tests should be done on net8.0 Framework project.

@inforithmics
Copy link

Another Topic: Have tried parsing these https://github.com/maptiler/epsg.io/tree/master/extra_codes_proj4_4.8.0.2

@inforithmics
Copy link

I like the idea of having a new and old/legacy implementation of the wktparser so if something doesn‘t work the old implementation can be used If it is in the same project or in a different one I don‘t have a strong opinion on that. But normally it is good practice to have its own nuget package per project.

@driekus77
Copy link
Contributor Author

Concerning the Performance: I think there a lot of Performance improvements in newer Pidgin Versions. Because of AOT Compatibility I think MultiTargeting needs to be added so in essance something like this should be added in the csproj

<TargetFramework>netstandard2.0</TargetFramework> => <TargetFrameworks>netstandard2.0;net6.0;net8.0</TargetFrameworks>

<ItemGroup Condition="'$(TargetFramework)' == 'net8.0' Or '$(TargetFramework)' == 'net6.0'">
     <PackageReference Include="Pidgin" Version="3.2.3" />
</ItemGroup>

<ItemGroup Condition="'$(TargetFramework)' == 'netstandard2.0'">
     <PackageReference Include="Pidgin" Version="2.5.0" />
</ItemGroup>

So this means for .netstandard2.0 it is using pidgin 2.5 and for .Net 8 it is using pidgin 3.2.3.

The Performance Tests should be done on net8.0 Framework project.

Thanks for your feedback and help!

When I'm done with the basic design of the new WktParser I will look into this.

@driekus77
Copy link
Contributor Author

driekus77 commented Jul 23, 2024

Another Topic: Have tried parsing these https://github.com/maptiler/epsg.io/tree/master/extra_codes_proj4_4.8.0.2

Is this something the current (old) parser also supports? I noticed the Wkt Extension having the same proj content. But this content is not supported in the new parser...

@driekus77
Copy link
Contributor Author

I like the idea of having a new and old/legacy implementation of the wktparser so if something doesn‘t work the old implementation can be used If it is in the same project or in a different one I don‘t have a strong opinion on that. But normally it is good practice to have its own nuget package per project.

Yes thats also something that I had in my head. Maybe leaving the implementation of CreateFromWKT alone and only create a new WktToProjTextReader using the new WktParser.

@airbreather
Copy link
Member

airbreather commented Jul 23, 2024

I know that you're primarily looking for feedback on the more substantive parts of this PR; I don't have the capacity to review it very much right now, but I did start to see some e-mail notifications for comments in this thread that I wanted to address, so that's what pulled me back here.


Concerning the Performance: I think there a lot of Performance improvements in newer Pidgin Versions. Because of AOT Compatibility I think MultiTargeting needs to be added so in essance something like this should be added in the csproj

If we can get away with it, I'd prefer not to. Is there any breaking change in higher versions of the Pidgin library that would stop this package from working as-is if a net5.0-or-higher downstream consumer directly referenced a newer version of the Pidgin package themselves?

  • If there is no breaking change, then we shouldn't need to multitarget. Downstream can restore a higher version of the package and get those performance enhancements, even if we only reference the lower version.
  • If there is a breaking change, and we would need to change the shape of our public API in order to accommodate, then we should not multitarget, and we should instead strongly declare an upper-bound on our package reference to force NuGet to throw an error if someone tries to directly reference a newer version.
  • If there is a breaking change, but we can accommodate it with changes to internal APIs and method bodies, then multitargeting is probably the best solution, for reasons that haven't even been brought up in this thread.

Because there seems to be revived interest in using multitargeting in order to benefit from the newer version of Pidgin, I'd like to take a moment to respond directly to the following points:

I looked briefly at ANTLR and have some concerns about the Java part of the generator.

In the .NET world, there are two viable ways to use an ANTLR parser generator: the official ANTLR v4 project which does require a JRE at build-time (not at runtime), and Sam Harwell's antlr4cs project which has a Java-free mode.

I must concede that I would prefer using the official ANTLR v4 project these days, which does require Java. I'm assuming that your concerns are similar to mine: it sucks to require a JRE to be installed in order to build a .NET application. I've seen worse, but it still sucks, and it can limit this project's accessibility to new contributors.

And it looks that runtime you also are going to need some assemblies.

Whereas all newer versions of Pidgin have already switched to targeting only net5.0 and higher, the latest versions of Antlr4.Runtime.Standard support netstandard2.0 targets, and I expect that to continue for a very long time. In fact, it still has a net45 target, even though all supported versions of the Windows operating system have AT LEAST .NET Framework 4.6 pre-installed.

The grammar is different than (E)BNF which is another thing to get my head around.

I see two things here:

  1. While ANTLR's syntax for describing grammars is not compatible with ISO EBNF, I do note that sequences such as this in Pidgin are quite a bit further away than ANTLR's syntax. At a glance, if you're more familiar with ISO EBNF or ABNF, then this table looks like it covers pretty much all the differences that would commonly come up.
  2. Specifically regarding WKT-CRS v1, an ANTLR grammar has already been written, BSD-licensed. I haven't evaluated the correctness of this grammar myself, but at a glance it looks fine, and in the worst-case it seems like a solid starting point.

I want to re-emphasize that if you are familiar with Pidgin, then it's probably not worth redoing everything just to use my favorite tool. Again, I wasn't going to respond to these points until multitargeting came up as a possible consequence of using Pidgin (which is the context where I have brought up ANTLR before), since that's when it starts getting relevant.

@driekus77
Copy link
Contributor Author

driekus77 commented Jul 23, 2024

I know that you're primarily looking for feedback on the more substantive parts of this PR; I don't have the capacity to review it very much right now, but I did start to see some e-mail notifications for comments in this thread that I wanted to address, so that's what pulled me back here.

Concerning the Performance: I think there a lot of Performance improvements in newer Pidgin Versions. Because of AOT Compatibility I think MultiTargeting needs to be added so in essance something like this should be added in the csproj

If we can get away with it, I'd prefer not to. Is there any breaking change in higher versions of the Pidgin library that would stop this package from working as-is if a net5.0-or-higher downstream consumer directly referenced a newer version of the Pidgin package themselves?

  • If there is no breaking change, then we shouldn't need to multitarget. Downstream can restore a higher version of the package and get those performance enhancements, even if we only reference the lower version.
  • If there is a breaking change, and we would need to change the shape of our public API in order to accommodate, then we should not multitarget, and we should instead strongly declare an upper-bound on our package reference to force NuGet to throw an error if someone tries to directly reference a newer version.
  • If there is a breaking change, but we can accommodate it with changes to internal APIs and method bodies, then multitargeting is probably the best solution, for reasons that haven't even been brought up in this thread.

Because there seems to be revived interest in using multitargeting in order to benefit from the newer version of Pidgin, I'd like to take a moment to respond directly to the following points:

I looked briefly at ANTLR and have some concerns about the Java part of the generator.

In the .NET world, there are two viable ways to use an ANTLR parser generator: the official ANTLR v4 project which does require a JRE at build-time (not at runtime), and Sam Harwell's antlr4cs project which has a Java-free mode.

I must concede that I would prefer using the official ANTLR v4 project these days, which does require Java. I'm assuming that your concerns are similar to mine: it sucks to require a JRE to be installed in order to build a .NET application. I've seen worse, but it still sucks, and it can limit this project's accessibility to new contributors.

And it looks that runtime you also are going to need some assemblies.

Whereas all newer versions of Pidgin have already switched to targeting only net5.0 and higher, the latest versions of Antlr4.Runtime.Standard support netstandard2.0 targets, and I expect that to continue for a very long time. In fact, it still has a net45 target, even though all supported versions of the Windows operating system have AT LEAST .NET Framework 4.6 pre-installed.

The grammar is different than (E)BNF which is another thing to get my head around.

I see two things here:

  1. While ANTLR's syntax for describing grammars is not compatible with ISO EBNF, I do note that sequences such as this in Pidgin are quite a bit further away than ANTLR's syntax. At a glance, if you're more familiar with ISO EBNF or ABNF, then this table looks like it covers pretty much all the differences that would commonly come up.
  2. Specifically regarding WKT-CRS v1, an ANTLR grammar has already been written, BSD-licensed. I haven't evaluated the correctness of this grammar myself, but at a glance it looks fine, and in the worst-case it seems like a solid starting point.

I want to re-emphasize that if you are familiar with Pidgin, then it's probably not worth redoing everything just to use my favorite tool. Again, I wasn't going to respond to these points until multitargeting came up as a possible consequence of using Pidgin (which is the context where I have brought up ANTLR before), since that's when it starts getting relevant.

Thanks @airbreather for your feedback!

You are totaly right: Lets get the right tool for the job and not the other way arround!

For me this PR was and is an exercise to proof that a parser combinator lib like Pidgin CAN be used to implement WKT. And for now it looks like it can.
A step back: There are also other heavily used C# Parser combinator libs arround that still support netstandard2.0: https://www.nuget.org/packages/Superpower/3.0.0 I can have a look at this one if you like?

About ANTLR: I'm still not convinced its the right tool for this job. Using the JRE is one thing but its so totally different than Parser Combinators which, indeed, I like very much. My reason for choosing Parser combinator libs above something like ANTLR is because you are already working in the target programming language where the parsing happends.

If Superpower lib, or any other lib, is an option let me know. Else I will look into ANTLR when I have some time.

EDIT:
The reason I looked at Pidgin is:

  • Probably faster than Superpower
  • Has the possiblity to parse binaries (e.g. Wkb...?)

@airbreather
Copy link
Member

airbreather commented Jul 23, 2024

I know that you're primarily looking for feedback on the more substantive parts of this PR; I don't have the capacity to review it very much right now

Looking at the description again, I didn't realize that this PR had been so substantially modified since my first comment on this thread. Is it fair to say that this PR now adds a second implementation of the same specification that the project has already (ostensibly) implemented for a very long time? If so, why? I would at least want to see it linked to one specific problem that it solves, so that I can feel more OK about the way that this is going.

Even if such problem(s) on their own probably wouldn't justify such a major rewrite, I can actually live with it because I expect that using an actual parsing library will be easier for normal people to maintain than everything in this directory, and so it would be a huge benefit to be able to say "see what happens when we stop trying to implement complicated parsing on our own? these issues automatically go away".

But if we can't even point to a single issue that it solves, then it should probably not go forward as currently designed. Adding a second way to achieve the exact same goal means more code that must be maintained (even if we deprecate one of the two), and it becomes much more difficult to justify adding a new dependency on an older version of a library that someone might incidentally want to use a newer version of: it turns the question of breaking changes from my above comment into a new concern that is not counterbalanced by anything beneficial.

This isn't all bad news, I think: IIRC, the existing WKT parser did have some sort of issues, though I can't remember what they were right now, so maybe they were resolved... worst-case, even if we can't justify a second implementation of the same spec, I fully support implementing WKT-CRS using something more standard.


A step back: There are also other heavily used C# Parser combinator libs arround that still support netstandard2.0: https://www.nuget.org/packages/Superpower/3.0.0 I can have a look at this one if you like?
[...]
If Superpower lib, ore any other lib, is an option let me know. Else I will look into ANTLR when I have some time.

It's Apache-2.0 licensed, and it supports netstandard2.0 targets, so I see no reason to block it.

I try to say this every time, so to make sure I say it again... as someone without much time to spend on this project (see #99), I can't justify standing in the way of improvements or additions just because they're not done in my favorite way. I can only justify blocking something because it makes something worse, or has a high risk or making something worse. So if you can make a substantial improvement to something using Superpower, then go for it.


About ANTLR: I'm still not convinced its the right tool for this job. Using the JRE is one thing but its so totally different than Parser Combinators which, indeed, I like very much. My reason for choosing Parser combinator libs above something like ANTLR is because you are already working in the target programming language where the parsing happends.

I understand that you prefer parser combinators. This is fine, and I do not think that the approach is fatally flawed or anything. They are simple to use for what they bring to the table, extremely straightforward, and they automatically benefit tremendously from IntelliSense. Compared to working with parser generators, combinator libraries like this will tend to minimize the magical black-box bits that you "just have to trust" do the right thing. And compared to ANTLR in particular, these sorts of things will be much easier to integrate: you just add a package dependency like any other, no need to punch ANTLR-shaped holes in your toolchain to squeeze it in.

That being said, I believe that this is more a matter of individual preference rather than a "should I use a screwdriver or a hammer to install a screw?" style issue. Putting my personal preferences aside, I think it's easy to say that ANTLR is a viable option based on its merits (at the very least), and it has some advantages:

  1. The C# language doesn't have good built-in syntax for clearly specifying formal grammars, unlike the BNF family which ANTLR uses. I think you agree, and I think this because right above this block is a comment describing the same thing using BNF-like syntax. Presumably, if the C# were more plainly obvious, then there wouldn't be a need for the comment, and if BNF-like syntax hadn't been so much clearly better, then I presume that the comment would read something very different.
  2. An ANTLR grammar file can (typically, by default) be used as-is to parse the same grammar into the same sorts of objects in any language that ANTLR can target. This isn't always useful by itself for a single-language project like this one, but it does mean that you can often find pre-existing grammar files for many popular languages that are ready to go, even if the grammar was written by someone who only uses Go or something. https://github.com/antlr/grammars-v4 has many, including the one that I've linked a couple of times now.
  3. ANTLR is incredibly more popular. Tool selection isn't just a popularity contest, but popularity means that you will find plenty of resources and documentation out there for ANTLR if you get stuck on some aspect or another. And if those problems are with the grammar, then you can get useful information from someone who might not even use C#.
    • On the NuGet gallery alone, the "Antlr" package has more than 3x as many lifetime total downloads as Superpower, Pidgin, and Parlot combined.
    • This is the older package whose last release was in 2013, but even just filtering to the last six weeks, it's still got almost 2x as many downloads as those three combined.
    • If I want to only look at the last six weeks of downloads of Antlr4.Runtime.Standard to see who's actively using the absolute latest version of ANTLR in .NET, it's still above Pidgin and Superpower, though Parlot by itself does have slightly more downloads...
    • ...but this is the .NET side alone, with all the aforementioned toolchain difficulties. In other words, ANTLR is so popular that even though it's a headache to get it working in .NET, the community at large still generally chooses it over these other solutions which only work with (and are therefore specifically designed for) .NET.

I'm definitely spending too much time on these comments. The short version is that I think ANTLR is absolutely viable and has some advantages and disadvantages relative to any parser combinator library, and I also think that parser combinator libraries are perfectly viable as well, so the main decision point should be to use what you're most comfortable with.


[Pidgin] Has the possiblity to parse binaries (e.g. Wkb...?)

I expect that to be irrelevant for this project.

@driekus77
Copy link
Contributor Author

I know that you're primarily looking for feedback on the more substantive parts of this PR; I don't have the capacity to review it very much right now

Looking at the description again, I didn't realize that this PR had been so substantially modified since my first comment on this thread. Is it fair to say that this PR now adds a second implementation of the same specification that the project has already (ostensibly) implemented for a very long time? If so, why? I would at least want to see it linked to one specific problem that it solves, so that I can feel more OK about the way that this is going.

Yes the PR description is maybe bad. I can convert it to a draft and leave it lying arround as a Pidgin PoC?
My personal issues with this lib/project are:

  • It needs an update in code and more understandable (at least for me)
  • I needed to know more about CoordinateSystems Meridians etc and found this lib :-)
  • This project needed help and parsing is the way in which I can help (I think....:-)
  • Some issues pointed out that WKT 2.+ support is not there yet: That was my reason to do a rewrite. So I can build further on an implementation that I fully understand.

This PR became indeed to much work and should probably seen as a proof of concept. The whole Wkt(Base)Object tree is nice for separation of concerns but too much for this library which already has its components.

Even if such problem(s) on their own probably wouldn't justify such a major rewrite, I can actually live with it because I expect that using an actual parsing library will be easier for normal people to maintain than everything in this directory, and so it would be a huge benefit to be able to say "see what happens when we stop trying to implement complicated parsing on our own? these issues automatically go away".

But if we can't even point to a single issue that it solves, then it should probably not go forward as currently designed. Adding a second way to achieve the exact same goal means more code that must be maintained (even if we deprecate one of the two), and it becomes much more difficult to justify adding a new dependency on an older version of a library that someone might incidentally want to use a newer version of: it turns the question of breaking changes from my above comment into a new concern that is not counterbalanced by anything beneficial.

Yes you are right about pointing out a problem to solve: They are only developer wise and maybe not directly for users of this lib. But improving code quality and better support of WKT 2+ can become a indirect plus feature and not a problem solver.

This isn't all bad news, I think: IIRC, the existing WKT parser did have some sort of issues, though I can't remember what they were right now, so maybe they were resolved... worst-case, even if we can't justify a second implementation of the same spec, I fully support implementing WKT-CRS using something more standard.

To me two implementations are only temporarly: The best implementations wins and remains.

A step back: There are also other heavily used C# Parser combinator libs arround that still support netstandard2.0: https://www.nuget.org/packages/Superpower/3.0.0 I can have a look at this one if you like?
[...]
If Superpower lib, ore any other lib, is an option let me know. Else I will look into ANTLR when I have some time.

It's Apache-2.0 licensed, and it supports netstandard2.0 targets, so I see no reason to block it.

I try to say this every time, so to make sure I say it again... as someone without much time to spend on this project (see #99), I can't justify standing in the way of improvements or additions just because they're not done in my favorite way. I can only justify blocking something because it makes something worse, or has a high risk or making something worse. So if you can make a substantial improvement to something using Superpower, then go for it.

I am hesitating to jump onto Superpower right away. Some Pidgin PoC work todo before.

About ANTLR: I'm still not convinced its the right tool for this job. Using the JRE is one thing but its so totally different than Parser Combinators which, indeed, I like very much. My reason for choosing Parser combinator libs above something like ANTLR is because you are already working in the target programming language where the parsing happends.

I understand that you prefer parser combinators. This is fine, and I do not think that the approach is fatally flawed or anything. They are simple to use for what they bring to the table, extremely straightforward, and they automatically benefit tremendously from IntelliSense. Compared to working with parser generators, combinator libraries like this will tend to minimize the magical black-box bits that you "just have to trust" do the right thing. And compared to ANTLR in particular, these sorts of things will be much easier to integrate: you just add a package dependency like any other, no need to punch ANTLR-shaped holes in your toolchain to squeeze it in.

That being said, I believe that this is more a matter of individual preference rather than a "should I use a screwdriver or a hammer to install a screw?" style issue. Putting my personal preferences aside, I think it's easy to say that ANTLR is a viable option based on its merits (at the very least), and it has some advantages:

  1. The C# language doesn't have good built-in syntax for clearly specifying formal grammars, unlike the BNF family which ANTLR uses. I think you agree, and I think this because right above this block is a comment describing the same thing using BNF-like syntax. Presumably, if the C# were more plainly obvious, then there wouldn't be a need for the comment, and if BNF-like syntax hadn't been so much clearly better, then I presume that the comment would read something very different.

  2. An ANTLR grammar file can (typically, by default) be used as-is to parse the same grammar into the same sorts of objects in any language that ANTLR can target. This isn't always useful by itself for a single-language project like this one, but it does mean that you can often find pre-existing grammar files for many popular languages that are ready to go, even if the grammar was written by someone who only uses Go or something. https://github.com/antlr/grammars-v4 has many, including the one that I've linked a couple of times now.

  3. ANTLR is incredibly more popular. Tool selection isn't just a popularity contest, but popularity means that you will find plenty of resources and documentation out there for ANTLR if you get stuck on some aspect or another. And if those problems are with the grammar, then you can get useful information from someone who might not even use C#.

    • On the NuGet gallery alone, the "Antlr" package has more than 3x as many lifetime total downloads as Superpower, Pidgin, and Parlot combined.
    • This is the older package whose last release was in 2013, but even just filtering to the last six weeks, it's still got almost 2x as many downloads as those three combined.
    • If I want to only look at the last six weeks of downloads of Antlr4.Runtime.Standard to see who's actively using the absolute latest version of ANTLR in .NET, it's still above Pidgin and Superpower, though Parlot by itself does have slightly more downloads...
    • ...but this is the .NET side alone, with all the aforementioned toolchain difficulties. In other words, ANTLR is so popular that even though it's a headache to get it working in .NET, the community at large still generally chooses it over these other solutions which only work with (and are therefore specifically designed for) .NET.

I'm definitely spending too much time on these comments. The short version is that I think ANTLR is absolutely viable and has some advantages and disadvantages relative to any parser combinator library, and I also think that parser combinator libraries are perfectly viable as well, so the main decision point should be to use what you're most comfortable with.

[Pidgin] Has the possiblity to parse binaries (e.g. Wkb...?)

I expect that to be irrelevant for this project.

Again I will convert this PR to a draft and rename it to a PoC.

Then will finish this PoC PR with code I still have lying arround.

After that I will look into Superpower: Should not take to much time because It looks a lot like Pidgin code => New PR.

After that I will try to dive into ANTLR in another PoC => Another PR

I'm willing todo so because I like parsing and all about it. And I'm willing to help out.

Thanks for this discussion and your valuable time!

@driekus77 driekus77 marked this pull request as draft July 23, 2024 12:08
@driekus77 driekus77 changed the title Replace WktParser by new Parser Combinator: Pidgin. DRAFT: PoC implementing Wkt parser using Pidgin Parser Combinator lib Jul 23, 2024
@driekus77
Copy link
Contributor Author

Saved the last code changes for this PoC.

Unfortunately the outcome is not satisfying:

image

So I'm going to dive into ANTLR first. Because I think Superpower won't be faster.
With a dedicated Lexer/Tokenizer Pidgin could be faster and maybe some other optimization's as wel needed. But I don't see them at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants