Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decompiler no longer works for en-US & en-GB #10708

Open
milekpl opened this issue Jul 7, 2024 · 4 comments
Open

Decompiler no longer works for en-US & en-GB #10708

milekpl opened this issue Jul 7, 2024 · 4 comments
Assignees

Comments

@milekpl
Copy link
Member

milekpl commented Jul 7, 2024

The documentation at

https://dev.languagetool.org/hunspell-support

is outdated, as it does not specify that English morfologik dictionaries are now, for some reason (which is obscure to me, given how small these files are), kept in a separate jar: english-pos-dict.jar. However, decompiling the files from the jar fails as well:

An unhandled exception occurred. Stack trace below. java.lang.IndexOutOfBoundsException at java.nio.Buffer.checkBounds(Unknown Source) at java.nio.HeapByteBuffer.put(Unknown Source) at morfologik.stemming.TrimSuffixEncoder.decode(TrimSuffixEncoder.java:86) at morfologik.stemming.DictionaryIterator.next(DictionaryIterator.java:86) at morfologik.stemming.DictionaryIterator.next(DictionaryIterator.java:12) at morfologik.tools.DictDecompile.call(DictDecompile.java:80) at morfologik.tools.DictDecompile.call(DictDecompile.java:20) at morfologik.tools.CliTool.main(CliTool.java:133) at morfologik.tools.DictDecompile.main(DictDecompile.java:132) at org.languagetool.tools.DictionaryExporter.build(DictionaryExporter.java:82) at org.languagetool.tools.DictionaryExporter.main(DictionaryExporter.java:59) Done. The dictionary export has been written to en-US.txt

I did not delve deeper into it, but Polish dictionaries decompile fine. Any ideas @jaumeortola ?

@jaumeortola
Copy link
Member

jaumeortola commented Jul 7, 2024

Hi @milekpl
We prefer to put dictionaries in external dependencies because, even if the files are small (<1M, but some are greater), every time we update them we add a substantial amount of data to the git repo.

When you export spelling binary dictionaries, make sure that the path contains "hunspell" or "spelling". See:

if (inputPath.contains("hunspell") || inputPath.contains("spelling")) {
)
We are using that to distinguish spelling and tagger or synthesizer dictionaries.
I know that this is confusing. If we remove it, we'll need a new input parameter to specify the kind of dictionary. But we'll also need to modify all the scripts that use this class.

@milekpl
Copy link
Member Author

milekpl commented Jul 7, 2024

Hi @jaumeortola, thanks for the explanation. Indeed, it does work when the dictionary is stored under a hunspell directory.

Right now I have to time to work on this, but it seems to be it would be much easier just to use the existent logic of LT, and require the user to provide the language code and the explicit flag -spell. Tagging and synthesis should work the same way as before. LT is able to locate its resources, so we could simply instantiate a language and get the resource path this way, so that the user won't need to decompile a jar etc. Alternatively, provide -i with a full path and the explicit flag (-spell).

@jaumeortola
Copy link
Member

jaumeortola commented Jul 8, 2024

LT is able to locate its resources, so we could simply instantiate a language and get the resource path this way, so that the user won't need to decompile a jar etc.

We could do that, yes, keeping the current methods for backward compatibility.
Anyway, what is your goal with the English dictionary? Usually, developers decompile a binary dictionary when they want to update the dictionary and need to see the contents of the old dict.

@milekpl
Copy link
Member Author

milekpl commented Jul 8, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants