Overhaul the search infrastructure, add stress underlining, and add RN support #133

Willem3141 · 2024-03-24T23:54:12Z

Until now, searching used to be done using a single JS object indexed by word:type strings. That means, to search for a noun we'd do two searches into this object: query:n and query:n:pr. And to search for a verb, we'd do many searches (query:v:in, query:v:tr, ...). And worse, to search for anything else, we'd just linear search through the list. This clearly is suboptimal, but was grown this way for historical reasons.

This PR is a big refactor to improve this. Now, the word database is simply an array. A word can be indexed in two ways: (1) with their index in the array, and (2) with a word/type pair like before, but now we do a preprocessing step to create an object that maps words (independently of their type) to a list of array indexes. This way we can answer questions like finding a word in essentially O(1) time instead of O(n).

A secondary benefit (actually the one because of which I finally decided to get started with this big refactor) is that now that we are not using the "na'vi" field for searching anymore, it becomes possible to put some extra data in the "na'vi" field. Specifically Reykunyu has pronunciation and infix data, but these are separate fields from the "na'vi" field, so it is not (easily) possible to show infix dots or underline the stressed syllable in the lemma itself. This PR allows indicating the syllable boundaries with "/" and the stressed syllable with "[...]"; this is then shown in the lemma display, without affecting searches. Also, we can now put ù in the field so that we can automatically create Reef Na'vi forms, if the user wants to see RN. As such, this PR also adds a (as of yet still non-functional) setting to the UI to switch between dialect modes: FN, RN, or a combined mode.

Right now all of this is very experimental. Many features (e.g., the all words page and the editor) don't work. I still have to reconsider how to exactly deal with the FN/RN distinction. For example, in word links what dialect should be used? I don't want to have to write [[to]/rùk:n] for a word link to [toruk:n]. So, should word links just use FN spelling?

This PR also adds help and API pages, and custom HTTP error pages.

When this all works, this should address #18, #70, #77, #85, #105, and #135. Eventually, it could also enable addressing #36, #42, and #57.

Until now, searching used to be done using a single JS object indexed by word:type strings. That means, to search for a noun we'd do two searches into this object: query:n and query:n:pr. And to search for a verb, we'd do many searches (query:v:in, query:v:tr, ...). And worse, to search for anything else, we'd just linear search through the list. This clearly is suboptimal, but was grown this way for historical reasons. This commit is a big refactor to improve this. Now, the word database is simply an array. A word can be indexed in two ways: (1) with their index in the array, and (2) with a word/type pair like before, but now we do a preprocessing step to create an object that maps words (independently of their type) to a list of array indexes. This way we can answer questions like finding a word in essentially O(1) time instead of O(n). A secondary benefit (actually the one because of which I finally decided to get started with this big refactor) is that now that we are not using the "na'vi" field for searching anymore, it becomes possible to put some extra data in the "na'vi" field. Specifically Reykunyu has pronunciation and infix data, but these are separate fields from the "na'vi" field, so it is not (easily) possible to show infix dots or underline the stressed syllable in the lemma itself. This commit allows indicating the syllable boundaries with "/" and the stressed syllable with "[...]"; this is then shown in the lemma display, without affecting searches. Also, we can now put ù in the field so that we can automatically create Reef Na'vi forms, if the user wants to see RN. As such, this commit also adds a (as of yet still non-functional) setting to the UI to switch between dialect modes: FN, RN, or a combined mode. Right now all of this is very experimental. Many features (e.g., the all words page and the editor) don't work. I still have to reconsider how to exactly deal with the FN/RN distinction. For example, in word links what dialect should be used? I don't want to have to write [[to]/rùk:n] for a word link to [toruk:n]. So, should word links just use FN spelling? When this all works, this should address #70, #77, and #105. Eventually, it could also enable addressing #36, #42, and #57.

Fixes #135.

Word data that is independent of the user query is now processed when the dictionary file is loaded, on Reykunyu startup, instead of every time the user does a query. Fixes #138.

Move word data processing out of the postprocessing step

And show only the pronunciation IPA of the selected dialect.

Fixes #137.

This (ridiculously complicated) code implements the 'awkx → 'awgìl pattern in RN.

For example, tì'i'a; originally only one of the two 's would be removed.

Alternative title: Wllìm figures out how to use lookahead.

Willem3141 · 2024-06-09T22:05:42Z

New plan: it would take a very long time to implement all intricacies of RN, so let's move that out of this PR. Instead, let's just merge the current limited RN support, and show a warning when the user selects RN mode.

Willem3141 · 2024-06-09T22:06:22Z

New to-do list:

Add warning to the RN mode
Make the rhyme search work again
Make the editor work again
Make the source and etymology editors work again
Ask people to do beta testing
Figure out why nouns ending on -ng aren't found
Figure out why Aonungit isn't found
Figure out why proper names don't have a conjugation table anymore (Conjugation tables don't get shown for proper nouns in the web frontend #148)
Make sure the Discord bot doesn't completely break

Because the dialect was only set after the on-page-load search was triggered, the search would always use the combined dialect.

This way, clicking them doesn't result in a page reload.

Fixes #148. The proper noun doesn't properly get capitalized yet, but that's another issue.

The problem was an overzealous parsing rule. If a noun ends on -g (after other affixes, such as -ìl, already were removed) then the rule replaced the -g by -kx, and because that's the only possible option, I didn't even make it bother to check the variant with -g. That doesn't work however for nouns ending in -ng, because it would become -nkx, which obviously doesn't make sense. Fixed this by just also checking the unchanged variant (exactly how all the other rules work).

Tatlam ke lengu oe tsùlfätu lì'fyaye wione.

Willem3141 added 5 commits March 25, 2024 00:47

Merge branch 'master' into refactor-search

9885c5f

Convert pronunciation data to the new format

cdbbdcb

Add help and API documentation pages

d1b68c1

Add 403, 404, and 500 error pages

a481bf7

Fixes #135.

This was linked to issues Apr 26, 2024

Add RN support #70

Open

Refactor search, don't use word:type keys for searching #77

Open

Write API documentation #85

Open

Underline the stressed syllable in lemmas #105

Open

Add a proper 404 page #135

Open

Add help page #18

Open

Merge branch 'master' into refactor-search

d19214f

Willem3141 force-pushed the refactor-search branch from 4a3b230 to d19214f Compare May 19, 2024 14:57

Willem3141 added 2 commits May 20, 2024 18:04

Fix interrupted stress underline and a translation

f4ff377

Merge branch 'master' into refactor-search

c51741d

Willem3141 mentioned this pull request May 20, 2024

Move query-independent tasks out of the postprocessing step #138

Open

Willem3141 and others added 7 commits May 24, 2024 23:27

Preprocess instead of postprocess word data

bd91529

Word data that is independent of the user query is now processed when the dictionary file is loaded, on Reykunyu startup, instead of every time the user does a query. Fixes #138.

Merge pull request #140 from Willem3141/postprocess-less-preprocess-more

851d98e

Move word data processing out of the postprocessing step

Add FN and RN spelling conversions

e077e0e

Add unit tests for the spelling conversions

81a158b

Use word and word_raw on the client side

80fcde6

Add missing ù to vowel list

91390e6

Make the dialect radioboxes mutually exclusive

7fbb083

Willem3141 changed the title ~~Overhaul the search infrastructure~~ Overhaul the search infrastructure, add stress underlining, and add RN support May 27, 2024

Willem3141 added 2 commits May 28, 2024 00:20

Make the dialect setting actually functional

b0d3a3f

And show only the pronunciation IPA of the selected dialect.

Re-search when saving the settings

a1ca7af

Fixes #137.

Willem3141 linked an issue May 27, 2024 that may be closed by this pull request

Re-search when saving the settings #137

Open

Willem3141 added 3 commits May 28, 2024 00:29

Add ù to another vowel list

73300c2

Take the dialect into account when searching

f1414b8

Remember the dialect setting on the website

9f6f973

Willem3141 added 15 commits June 2, 2024 00:35

Fix IPA conversion of ng

75b271e

Fix proper nouns not being findable

fccee79

Simplify the code that applies lenition to nouns

24a613e

Implement voicing the last noun consonant in RN

8b05884

This (ridiculously complicated) code implements the 'awkx → 'awgìl pattern in RN.

Take the dialect into account when noun parsing

ed300f4

Add candidate for the RN genitive ending

f14e9a9

Fix parsing of nì + adjective

b09a8c4

Add candidate for be+

2ea8586

Add support for parsing RN adpositions as affixes

b4e6572

Fix errors in the affix list generation

eeb6bbe

Implement parsing ejectives made into voiced stops

c24f71b

Fix display of combined infixes

d062e2c

Fix RN conversion for words with more than one '

4805e20

For example, tì'i'a; originally only one of the two 's would be removed.

Improve the regexes used for RN conversion

5f34490

Alternative title: Wllìm figures out how to use lookahead.

Add (currently failing) testcase for rä'ä

a088d35

Willem3141 added 6 commits June 10, 2024 00:30

Show a warning that RN mode is still experimental

530908b

Check the 'combined' dialect option by default

d0ef50c

Fix the /api/fwew endpoint

d9b09d0

Fix searches on page load using the wrong dialect

926f4f0

Because the dialect was only set after the on-page-load search was triggered, the search would always use the combined dialect.

Make "did you mean" suggestions word links

372672e

This way, clicking them doesn't result in a page reload.

Show conjugation tables for proper nouns

2e8110c

Fixes #148. The proper noun doesn't properly get capitalized yet, but that's another issue.

Willem3141 linked an issue Jun 15, 2024 that may be closed by this pull request

Conjugation tables don't get shown for proper nouns in the web frontend #148

Open

Willem3141 added 6 commits June 15, 2024 23:36

Make the rhyme search work again

7881486

Disable the RN mode for now

9032af1

Add temporary hack to enable RN mode

65367cf

Fix checkbox label "for" targets

50f93b3

Fix the RN easter egg trigger

020b840

Tatlam ke lengu oe tsùlfätu lì'fyaye wione.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul the search infrastructure, add stress underlining, and add RN support #133

Overhaul the search infrastructure, add stress underlining, and add RN support #133

Willem3141 commented Mar 24, 2024 •

edited

Loading

Willem3141 commented Jun 9, 2024

Willem3141 commented Jun 9, 2024 •

edited

Loading

Overhaul the search infrastructure, add stress underlining, and add RN support #133

Are you sure you want to change the base?

Overhaul the search infrastructure, add stress underlining, and add RN support #133

Conversation

Willem3141 commented Mar 24, 2024 • edited Loading

Willem3141 commented Jun 9, 2024

Willem3141 commented Jun 9, 2024 • edited Loading

Willem3141 commented Mar 24, 2024 •

edited

Loading

Willem3141 commented Jun 9, 2024 •

edited

Loading