-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Language Expansion Visitor #2411
base: integration
Are you sure you want to change the base?
Conversation
apmoriarty
commented
Jun 10, 2024
- Added a visitor to support language-based expansion of query terms
- LanguageAwareAnalyzer extends Lucene's Analyzer class.
- Default implementations for most Lucene analyzers provided.
- QueryParameters and QueryOptions can enable language analysis and disable stemming, lemmatization, bigramming or unigramming on a per-language basis
LanguageAwareAnalyzer is based on lucene's Analyzer. Default implementations for most lucene analyzers provided. QueryParameters and QueryOptions can enable language analysis and disable stemming, lemmatization, bigramming or unigramming on a per-language basis
<p class="left">Datawave provides limited support for Lucene-based language expansion. A user may specify a set of language short codes that determine which | ||
analyzers are applied to a query.</p> | ||
<p class="left">Language analyzers may support stemming, lemmatization, unigramming and bigramming. All options are enabled by default, but Lucene only | ||
supports certain features by default for certain languages. Stemming is supported for most languages, and bigrams are supported or CJK languages.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supports certain features by default for certain languages. Stemming is supported for most languages, and bigrams are supported or CJK languages.</p> | |
supports certain features by default for certain languages. Stemming is supported for most languages, and bigrams are supported for CJK languages.</p> |
Seems like this should be 'for'?
/** | ||
* Base class for lucene-based language analyzers | ||
*/ | ||
public abstract class LuceneAnalyzer extends LanguageAnalyzer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuceneLanguageAnalyzer?
} | ||
|
||
@Test | ||
public void testDisableLemmas() throws ParseException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get some query tests that validate the Plans when disabling these pieces in addition to just making sure the parameter is set? If they already exist somewhere I haven't found yet, disregard.