-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul the search infrastructure, add stress underlining, and add RN support #133
Draft
Willem3141
wants to merge
59
commits into
master
Choose a base branch
from
refactor-search
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Until now, searching used to be done using a single JS object indexed by word:type strings. That means, to search for a noun we'd do two searches into this object: query:n and query:n:pr. And to search for a verb, we'd do many searches (query:v:in, query:v:tr, ...). And worse, to search for anything else, we'd just linear search through the list. This clearly is suboptimal, but was grown this way for historical reasons. This commit is a big refactor to improve this. Now, the word database is simply an array. A word can be indexed in two ways: (1) with their index in the array, and (2) with a word/type pair like before, but now we do a preprocessing step to create an object that maps words (independently of their type) to a list of array indexes. This way we can answer questions like finding a word in essentially O(1) time instead of O(n). A secondary benefit (actually the one because of which I finally decided to get started with this big refactor) is that now that we are not using the "na'vi" field for searching anymore, it becomes possible to put some extra data in the "na'vi" field. Specifically Reykunyu has pronunciation and infix data, but these are separate fields from the "na'vi" field, so it is not (easily) possible to show infix dots or underline the stressed syllable in the lemma itself. This commit allows indicating the syllable boundaries with "/" and the stressed syllable with "[...]"; this is then shown in the lemma display, without affecting searches. Also, we can now put ù in the field so that we can automatically create Reef Na'vi forms, if the user wants to see RN. As such, this commit also adds a (as of yet still non-functional) setting to the UI to switch between dialect modes: FN, RN, or a combined mode. Right now all of this is very experimental. Many features (e.g., the all words page and the editor) don't work. I still have to reconsider how to exactly deal with the FN/RN distinction. For example, in word links what dialect should be used? I don't want to have to write [[to]/rùk:n] for a word link to [toruk:n]. So, should word links just use FN spelling? When this all works, this should address #70, #77, and #105. Eventually, it could also enable addressing #36, #42, and #57.
This was
linked to
issues
Apr 26, 2024
Open
Open
Willem3141
force-pushed
the
refactor-search
branch
from
May 19, 2024 14:57
4a3b230
to
d19214f
Compare
Word data that is independent of the user query is now processed when the dictionary file is loaded, on Reykunyu startup, instead of every time the user does a query. Fixes #138.
Move word data processing out of the postprocessing step
Willem3141
changed the title
Overhaul the search infrastructure
Overhaul the search infrastructure, add stress underlining, and add RN support
May 27, 2024
And show only the pronunciation IPA of the selected dialect.
This (ridiculously complicated) code implements the 'awkx → 'awgìl pattern in RN.
For example, tì'i'a; originally only one of the two 's would be removed.
Alternative title: Wllìm figures out how to use lookahead.
New plan: it would take a very long time to implement all intricacies of RN, so let's move that out of this PR. Instead, let's just merge the current limited RN support, and show a warning when the user selects RN mode. |
New to-do list:
|
Because the dialect was only set after the on-page-load search was triggered, the search would always use the combined dialect.
This way, clicking them doesn't result in a page reload.
Fixes #148. The proper noun doesn't properly get capitalized yet, but that's another issue.
The problem was an overzealous parsing rule. If a noun ends on -g (after other affixes, such as -ìl, already were removed) then the rule replaced the -g by -kx, and because that's the only possible option, I didn't even make it bother to check the variant with -g. That doesn't work however for nouns ending in -ng, because it would become -nkx, which obviously doesn't make sense. Fixed this by just also checking the unchanged variant (exactly how all the other rules work).
Tatlam ke lengu oe tsùlfätu lì'fyaye wione.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Until now, searching used to be done using a single JS object indexed by word:type strings. That means, to search for a noun we'd do two searches into this object: query:n and query:n:pr. And to search for a verb, we'd do many searches (query:v:in, query:v:tr, ...). And worse, to search for anything else, we'd just linear search through the list. This clearly is suboptimal, but was grown this way for historical reasons.
This PR is a big refactor to improve this. Now, the word database is simply an array. A word can be indexed in two ways: (1) with their index in the array, and (2) with a word/type pair like before, but now we do a preprocessing step to create an object that maps words (independently of their type) to a list of array indexes. This way we can answer questions like finding a word in essentially O(1) time instead of O(n).
A secondary benefit (actually the one because of which I finally decided to get started with this big refactor) is that now that we are not using the "na'vi" field for searching anymore, it becomes possible to put some extra data in the "na'vi" field. Specifically Reykunyu has pronunciation and infix data, but these are separate fields from the "na'vi" field, so it is not (easily) possible to show infix dots or underline the stressed syllable in the lemma itself. This PR allows indicating the syllable boundaries with "/" and the stressed syllable with "[...]"; this is then shown in the lemma display, without affecting searches. Also, we can now put ù in the field so that we can automatically create Reef Na'vi forms, if the user wants to see RN. As such, this PR also adds a (as of yet still non-functional) setting to the UI to switch between dialect modes: FN, RN, or a combined mode.
Right now all of this is very experimental. Many features (e.g., the all words page and the editor) don't work. I still have to reconsider how to exactly deal with the FN/RN distinction. For example, in word links what dialect should be used? I don't want to have to write [[to]/rùk:n] for a word link to [toruk:n]. So, should word links just use FN spelling?
This PR also adds help and API pages, and custom HTTP error pages.
When this all works, this should address #18, #70, #77, #85, #105, and #135. Eventually, it could also enable addressing #36, #42, and #57.