This is a package for Chinese text segmentation, keyword extraction
and speech tagging. Jieba.jl
supports four
types of segmentation modes: Maximum Probability, Hidden Markov Model, Query Segment and Mix Segment.
- Support Windows, Linux,and Mac.
- Support Chinese text segmentation, keyword extraction, speech tagging and simhash computation.
- Custom dictionary path.
- Support simplified Chinese and traditional Chinese.
- New words identification.
- Auto encoding detection.
- Fast text segmentation.
- Easy installation.
- MIT license.
Install the latest development version from GitHub:
Pkg.add(url="https://github.com/mobtgzhang/Jieba.jl.git")
Pkg.build("Jieba")
using Jieba
There are four segmentation models. You can use worker()
to initialize a worker, and then use <=
or segment()
to do the segmentation.
using Jieba
## Using default argument to initialize worker.
cutter = worker()
cutter <= "江州市长江大桥参加了长江大桥的通车仪式"
9-element Array{UTF8String,1}:
"江州"
"市长"
"江大桥"
"参加"
"了"
"长江大桥"
"的"
"通车"
"仪式"
You can pipe a file path to cut file.
cutter <= "./temp.dat"
The package uses initialized engines for word segmentation. You can initialize multiple engines simultaneously.
分词初始化(引擎类型= "混合", 默认编码 = "UTF-8",读取行数 = 1000000, 检查编码 = true, 保留符号 = false,
输出路径 = " ", 写入文件 = true, 关键词数 = 5, dict = DICTPATH,hmm = HMMPATH,user = USERPATH,
最大索引长度 = 20, stop_words = STOPPATH, idf = IDFPATH)
worker( worker_type = "mix", encoding = "UTF-8", lines = 100000, output = " ", detect = true, symbol = false,
write_file = true, topn =5,dict = DICTPATH, hmm = HMMPATH, user = USERPATH, qmax = 20, stop_words = STOPPATH, idf = IDFPATH)
The model public settings can be modified and got using .
, such as WorkerName.symbol = true
. Some private settings are fixed when the engine is initialized, and you can get them by WorkerName.private
.
cutter.encoding
cutter.detect = false
Users can specify their own custom dictionary to be included in the jiebaR default dictionary. jiebaR is able to identify new words, but adding your own new words can ensure a higher accuracy. imewlconverter is a good tools for dictionary construction.
There are three column in the system dictionary. The first column is the word, and the second column is the frequency of word. The third column is speech tag using labels compatible with ictclas.
There are two column in the user dictionary. The first column is the word, and the second column is speech tag using labels compatible with ictclas. Frequency of every word in the user dictionary will be the maximum number of the system dictionary. If you want to provide the frequency for a new word, you can put it in the system dictionary.
Speech Tagging function <=
or tagger()
uses speech tagging worker to cut word and tags each word after segmentation, using labels compatible with ictclas. dict
hmm
and user
should be provided when initializing Jieba.jl
worker.
words = "我爱北京天安门"
tagworker = worker("tag")
tagworker <= words
4x2 Array{UTF8String,2}:
"我" "r"
"爱" "v"
"北京" "ns"
"天安门" "ns"
Keyword Extraction worker use MixSegment model to cut word and use
TF-IDF algorithm to find the keywords. dict
, hmm
,
idf
, and stop_word
should be provided when initializing Jieba.jl
worker.
keys = worker("keywords")
keys.topn = 2
julia> keys.topn
2
keys <= "我爱北京天安门"
# keys <= "一个文件路径.txt"
2x2 Array{Any,2}:
"天安门" 8.9954
"北京" 4.6674
Simhash worker can do keyword extraction and find the keywords from two inputs, and then computes Hamming distance between them.
(
2x2 Array{Any,2}:
"长江大桥" 22.3853
"江州" 8.69667,
0xb2c6a622481d8eb2)
distance("江州市长江大桥参加了长江大桥的通车仪式" , "hello world!", simhasher)
((
2x2 Array{Any,2}:
"长江大桥" 22.3853
"江州" 8.69667,
0xb2c6a622481d8eb2),(
2x2 Array{Any,2}:
"hello" 11.7392
"world" 11.7392,
0x2482942840042428),0x0000000000000017)
- Support parallel programming on Windows , Linux , Mac.
- Simple Natural Language Processing features.
https://github.com/qinwf/jiebaR