Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language Expansion Visitor #2411

Open
wants to merge 2 commits into
base: integration
Choose a base branch
from

Conversation

apmoriarty
Copy link
Collaborator

  • Added a visitor to support language-based expansion of query terms
  • LanguageAwareAnalyzer extends Lucene's Analyzer class.
  • Default implementations for most Lucene analyzers provided.
  • QueryParameters and QueryOptions can enable language analysis and disable stemming, lemmatization, bigramming or unigramming on a per-language basis

LanguageAwareAnalyzer is based on lucene's Analyzer.
Default implementations for most lucene analyzers provided.

QueryParameters and QueryOptions can enable language analysis and disable stemming,
lemmatization, bigramming or unigramming on a per-language basis
<p class="left">Datawave provides limited support for Lucene-based language expansion. A user may specify a set of language short codes that determine which
analyzers are applied to a query.</p>
<p class="left">Language analyzers may support stemming, lemmatization, unigramming and bigramming. All options are enabled by default, but Lucene only
supports certain features by default for certain languages. Stemming is supported for most languages, and bigrams are supported or CJK languages.</p>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
supports certain features by default for certain languages. Stemming is supported for most languages, and bigrams are supported or CJK languages.</p>
supports certain features by default for certain languages. Stemming is supported for most languages, and bigrams are supported for CJK languages.</p>

Seems like this should be 'for'?

/**
* Base class for lucene-based language analyzers
*/
public abstract class LuceneAnalyzer extends LanguageAnalyzer {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuceneLanguageAnalyzer?

}

@Test
public void testDisableLemmas() throws ParseException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get some query tests that validate the Plans when disabling these pieces in addition to just making sure the parameter is set? If they already exist somewhere I haven't found yet, disregard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants