Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(web): provide lexicon probabilities directly on the search path 📚 #11868

Merged
merged 6 commits into from
Jul 8, 2024

Conversation

jahorton
Copy link
Contributor

@jahorton jahorton commented Jun 25, 2024

This PR was originally part of #10973.

In order to efficiently traverse a full lexical efficiently for dictionary-based wordbreaking, it's best to directly provide relevant probability data as efficiently as possible. Fortunately, it's easily possible to make this O(1) on the lexical model's internal iterator - the LexiconTraversal type. It would take O(log(N)) time to recompute it via the model's .predict method instead.

Note that this provides two different probability value types:

  1. The probability of each reached entry.
  2. The probability of the highest-frequency entry either represented by the current node or by any of its descendants.

There are uses for this outside of dictionary-based wordbreaking, too. The latter 'probability' listed above can be useful for optimizing the correction-search - if a path only produces low-frequency words, we should consider other paths that could yield higher-frequency words first.

There's also notable potential for being able to merge / blend two different models together via their LexiconTraversal iterators in this manner. Noting our upcoming push toward #11872, this would facilitate a fantastic way to achieve that goal - to create a stand-in model for the OS's dictionary and blend that with the loaded lexical-model via traversals.

@keymanapp-test-bot skip

@jahorton jahorton requested a review from mcdurdin as a code owner June 25, 2024 05:46
@keymanapp-test-bot keymanapp-test-bot bot added the user-test-missing User tests have not yet been defined for the PR label Jun 25, 2024
Copy link
Member

@mcdurdin mcdurdin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- nits only

common/models/templates/src/trie-model.ts Show resolved Hide resolved
}
};
return;
}
}

get entries(): string[] {
get entries() {
const totalWeight = this.totalWeight;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

common/models/templates/src/trie-model.ts Outdated Show resolved Hide resolved
@@ -35,6 +35,10 @@ describe('Trie traversal abstractions', function() {
it('traversal with simple internal nodes', function() {
var model = new TrieModel(jsonFixture('tries/english-1000'));

// Prob: entry weight / total weight
// "the" is the highest-weighted word in the fixture.
const PROB_OF_THE = 1000 / 500500;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this and PROB_OF_TROUBLE outside the functions so we can use them throughout?

common/models/templates/test/test-trie-traversal.js Outdated Show resolved Hide resolved
common/models/templates/test/test-trie-traversal.js Outdated Show resolved Hide resolved
@darcywong00 darcywong00 modified the milestones: A18S5, A18S6 Jul 5, 2024
@github-actions github-actions bot added web/ and removed web/ labels Jul 8, 2024
@jahorton jahorton merged commit 6ecb461 into master Jul 8, 2024
15 checks passed
@jahorton jahorton deleted the feat/common/models/lexicon-traversal-probs branch July 8, 2024 07:37
@keyman-server
Copy link
Collaborator

Changes in this pull request will be available for download in Keyman version 18.0.70-alpha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

4 participants