Decompiler no longer works for en-US & en-GB #10708

milekpl · 2024-07-07T14:29:14Z

The documentation at

https://dev.languagetool.org/hunspell-support

is outdated, as it does not specify that English morfologik dictionaries are now, for some reason (which is obscure to me, given how small these files are), kept in a separate jar: english-pos-dict.jar. However, decompiling the files from the jar fails as well:

An unhandled exception occurred. Stack trace below. java.lang.IndexOutOfBoundsException at java.nio.Buffer.checkBounds(Unknown Source) at java.nio.HeapByteBuffer.put(Unknown Source) at morfologik.stemming.TrimSuffixEncoder.decode(TrimSuffixEncoder.java:86) at morfologik.stemming.DictionaryIterator.next(DictionaryIterator.java:86) at morfologik.stemming.DictionaryIterator.next(DictionaryIterator.java:12) at morfologik.tools.DictDecompile.call(DictDecompile.java:80) at morfologik.tools.DictDecompile.call(DictDecompile.java:20) at morfologik.tools.CliTool.main(CliTool.java:133) at morfologik.tools.DictDecompile.main(DictDecompile.java:132) at org.languagetool.tools.DictionaryExporter.build(DictionaryExporter.java:82) at org.languagetool.tools.DictionaryExporter.main(DictionaryExporter.java:59) Done. The dictionary export has been written to en-US.txt

I did not delve deeper into it, but Polish dictionaries decompile fine. Any ideas @jaumeortola ?

The text was updated successfully, but these errors were encountered:

jaumeortola · 2024-07-07T18:30:40Z

Hi @milekpl
We prefer to put dictionaries in external dependencies because, even if the files are small (<1M, but some are greater), every time we update them we add a substantial amount of data to the git repo.

When you export spelling binary dictionaries, make sure that the path contains "hunspell" or "spelling". See:

languagetool/languagetool-tools/src/main/java/org/languagetool/tools/DictionaryExporter.java

Line 68 in 2446a07

if (inputPath.contains("hunspell") || inputPath.contains("spelling")) {

)
We are using that to distinguish spelling and tagger or synthesizer dictionaries.
I know that this is confusing. If we remove it, we'll need a new input parameter to specify the kind of dictionary. But we'll also need to modify all the scripts that use this class.

milekpl · 2024-07-07T19:57:52Z

Hi @jaumeortola, thanks for the explanation. Indeed, it does work when the dictionary is stored under a hunspell directory.

Right now I have to time to work on this, but it seems to be it would be much easier just to use the existent logic of LT, and require the user to provide the language code and the explicit flag -spell. Tagging and synthesis should work the same way as before. LT is able to locate its resources, so we could simply instantiate a language and get the resource path this way, so that the user won't need to decompile a jar etc. Alternatively, provide -i with a full path and the explicit flag (-spell).

jaumeortola · 2024-07-08T06:51:54Z

LT is able to locate its resources, so we could simply instantiate a language and get the resource path this way, so that the user won't need to decompile a jar etc.

We could do that, yes, keeping the current methods for backward compatibility.
Anyway, what is your goal with the English dictionary? Usually, developers decompile a binary dictionary when they want to update the dictionary and need to see the contents of the old dict.

milekpl · 2024-07-08T07:39:02Z

Ah, needed a modern word list for English, and ours is nicely curated. pon., 8 lip 2024, 08:52 użytkownik Jaume Ortolà ***@***.***> napisał:

…

LT is able to locate its resources, so we could simply instantiate a language and get the resource path this way, so that the user won't need to decompile a jar etc. We could do that, yes, keeping the current methods for backward compatibility. Anyway, what is your goal with the English dictionary? Usually, developers decompile a binary dictionary when they want to update the dictionary and need to see the contents of the old dict. — Reply to this email directly, view it on GitHub <#10708 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALBERSBVXKXLX7AO7KHNSDZLIZJ7AVCNFSM6AAAAABKPNDTH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJTGE3TQMRXHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

milekpl assigned jaumeortola Jul 7, 2024

milekpl added bug easy fix labels Jul 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decompiler no longer works for en-US & en-GB #10708

Decompiler no longer works for en-US & en-GB #10708

milekpl commented Jul 7, 2024

jaumeortola commented Jul 7, 2024 •

edited

Loading

milekpl commented Jul 7, 2024

jaumeortola commented Jul 8, 2024 •

edited

Loading

milekpl commented Jul 8, 2024 via email

Decompiler no longer works for en-US & en-GB #10708

Decompiler no longer works for en-US & en-GB #10708

Comments

milekpl commented Jul 7, 2024

jaumeortola commented Jul 7, 2024 • edited Loading

milekpl commented Jul 7, 2024

jaumeortola commented Jul 8, 2024 • edited Loading

milekpl commented Jul 8, 2024 via email

jaumeortola commented Jul 7, 2024 •

edited

Loading

jaumeortola commented Jul 8, 2024 •

edited

Loading