Skip to content

Commit 133131a

Browse files
author
qinwf
committed
init
0 parents  commit 133131a

16 files changed

+1658
-0
lines changed

.Rbuildignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
^.*\.Rproj$
2+
^\.Rproj\.user$

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
.Rproj.user
2+
.Rhistory
3+
.RData

DESCRIPTION

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
Package: jiebaRD
2+
Type: Package
3+
Title: Chinese Text Segmentation Data for jiebaR Package
4+
Description: jiebaR is a package for Chinese text segmentation, keyword extraction
5+
and speech tagging. This package provide the data files required by jiebaR.
6+
Version: 0.1
7+
Date: 2015-01-03
8+
Author: Qin Wenfeng
9+
Maintainer: Qin Wenfeng <[email protected]>
10+
License: MIT + file LICENSE
11+
Suggests:
12+
jiebaR
13+
URL: https://github.com/qinwf/jiebaRD/
14+
BugReports: https://github.com/qinwf/jiebaRD/issues
15+
NeedsCompilation: no

LICENSE

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
YEAR: 2014-2015
2+
COPYRIGHT HOLDER: Qin Wenfeng

NAMESPACE

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
exportPattern("^[[:alpha:]]+")

R/jiebaRD-package.r

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
#' A package for Chinese text segmentation
2+
#'
3+
#' jiebaR is a package for Chinese text segmentation, keyword extraction
4+
#' and speech tagging. This package provide the data files required by jiebaR.
5+
#' jiebaR supports four types of segmentation mode: Maximum Probability, Hidden Markov Model,
6+
#' Query Segment and Mix Segment.
7+
#'
8+
#' You can use custom dictionary to be included in the jiebaR default dictionary.
9+
#' jiebaR can also identify new words, but adding your own new words will ensure a higher
10+
#' accuracy.
11+
#'
12+
#' @docType package
13+
#' @name jiebaRD
14+
#' @author Qin Wenfeng <\url{http://qinwenfeng.com}>
15+
#' @references CppJieba \url{https://github.com/aszxqw/cppjieba};
16+
#' @seealso JiebaR \url{https://github.com/qinwf/jiebaR};
17+
NULL

inst/dict/README.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# CppJieba字典
2+
3+
文件后缀名代表的是词典的编码方式。
4+
比如filename.utf8 是 utf8编码,filename.gbk 是 gbk编码方式。
5+
6+
7+
## 分词
8+
9+
### jieba.dict.utf8/gbk
10+
11+
作为最大概率法(MPSegment: Max Probability)分词所使用的词典。
12+
13+
### hmm_model.utf8/gbk
14+
15+
作为隐式马尔科夫模型(HMMSegment: Hidden Markov Model)分词所使用的词典。
16+
17+
__对于MixSegment(混合MPSegment和HMMSegment两者)则同时使用以上两个词典__
18+
19+
20+
## 关键词抽取
21+
22+
### idf.utf8
23+
24+
IDF(Inverse Document Frequency)
25+
在KeywordExtractor中,使用的是经典的TF-IDF算法,所以需要这么一个词典提供IDF信息。
26+
27+
### stop_words.utf8
28+
29+
停用词词典
30+
31+

inst/dict/backup.rda

210 Bytes
Binary file not shown.

inst/dict/hmm_model.zip

193 KB
Binary file not shown.

inst/dict/idf.zip

1.9 MB
Binary file not shown.

0 commit comments

Comments
 (0)