集成到ElasticSearch

基于 jieba 的 elasticsearch 中文分词插件。

集成到ElasticSearch

git clone [email protected]:hongfuli/elasticsearch-analysis-jieba.git
cd elasticsearch-analysis-jieba
mvn package

把release/elasticsearch-analysis-jieba-{version}.zip文件解压到 elasticsearch 的 plugins 目录下，重启elasticsearch即可。

创建字段：

curl -XPOST http://localhost:9200/index/type/_mapping -d'
{
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "jieba",
                "search_analyzer": "jieba"
            }
        }
    }
}'

直接使用Tokenizer分词

可直接使用 com.github.hongfuli.jieba.Tokenizer 对文本字符进行分词，方法参数完全和 jieba python 一致。

imort com.github.hongfuli.jieba.Tokenizer

Tokenizer t = new Tokenizer();
t.cut("这是一个伸手不见五指的黑夜。我叫孙悟空，我爱北京，我爱Python和C++。", false, true);

集成到Lucene

import com.github.hongfuli.jieba.lucene.JiebaAnalyzer;

Analyzer analyzer = new JiebaAnalyzer();
try(TokenStream ts = analyzer.tokenStream("field", "这是一个伸手不见五指的黑夜。我叫孙悟空，我爱北京，我爱Python和C++。")) {
      StringBuilder b = new StringBuilder();
      CharTermAttribute termAtt = ts.getAttribute(CharTermAttribute.class);
      PositionIncrementAttribute posIncAtt = ts.getAttribute(PositionIncrementAttribute.class);
      PositionLengthAttribute posLengthAtt = ts.getAttribute(PositionLengthAttribute.class);
      OffsetAttribute offsetAtt = ts.getAttribute(OffsetAttribute.class);
      assertNotNull(offsetAtt);
      ts.reset();
      int pos = -1;
      while (ts.incrementToken()) {
        pos += posIncAtt.getPositionIncrement();
        b.append(termAtt);
        b.append(" at pos=");
        b.append(pos);
        if (posLengthAtt != null) {
          b.append(" to pos=");
          b.append(pos + posLengthAtt.getPositionLength());
        }
        b.append(" offsets=");
        b.append(offsetAtt.startOffset());
        b.append('-');
        b.append(offsetAtt.endOffset());
        b.append('\n');
      }
      ts.end();
      return b.toString();
    }

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
config		config
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

集成到ElasticSearch

直接使用Tokenizer分词

集成到Lucene

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

mty2015/elasticsearch-analysis-jieba

Folders and files

Latest commit

History

Repository files navigation

集成到ElasticSearch

直接使用Tokenizer分词

集成到Lucene

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages