新词发现毕业设计

论文

google scholar

unknown OR new word detection OR identification
cnki

新词发现新词识别新词检测

可以优化的地方

把StringFreq里面的计算左右熵的算法改一下
不算pmi,算pmi花了很多时间，而且没有用到
现在把字母串的过滤功能去掉了，对连字符拼接的字母串的识别效果不是很好

基本用法（repo里面已经有模型文件了，不用重新训练）

git clone https://github.com/chiyang10000/newWordDetection
cd newWordDetection
mvn package
./tar.sh
cd tar
./init.sh # 安装crfpp
java -cp target/detect.jar main.Main -i <输入文件>
接下来当前文件会生成per.txt, loc.txt, org.txt, new.txt四个文件
分别对应输入文件中人名，地名，机构名，新词。
其中新词指的是人民日报语料2000年前3个月中未出现的词。词表见data/corpus/wordlist/renminribao.txt.wordlist。第一行为出现的词，第二行为其出现的频率。
可修改此文件来减少或者增大基本词表。
输出文件中，第一行为对应的人名，地名，机构名，新词，第二行为他们所在的上下文，其他各行为调试信息

1.运行

1.1 IDEA

右键iml文件导入，右键pom.xml导入。

1.2 teminal

git clone https://github.com/chiyang10000/newWordDetection
cd newWordDetection
mvn package
./init.sh # 安装 crfpp
java -server -cp target/*with-dependencies.jar <main.class>

dataProcess.Corpus

生成数据
crfModel.charBased

训练命名实体识别模型
crfModel.wordBased

训练未登录词识别模型
evaluate.Test

运行测试

2. 文件组织

data/

原始数据和缓存数据
1. data/model/
  
  放的是训练出来的模型文件
2. data/raw/
  
  放原始数据文件
3. data/crf-template
  
  放crfpp模板文件
4. data/corpus/
  
  放缓存的词表信息
5. data/jupyter
  
  从info/生成报表
6. data/test
  
  运行dataProcess.Corpus之后的生成的测试文件
library/

ansj的字典文件,用来修正一些分词错误
tmp/

运行时的一些临时文件
info/

运行的一些结果统计
target/

maven编译生成的jar包
tar.sh

打包运行时的必要文件到tar这个文件夹里面
config.properties

配置运行时的参数

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
data		data
lib		lib
library		library
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
config.properties		config.properties
init.sh		init.sh
newWordDetection.iml		newWordDetection.iml
pom.xml		pom.xml
readme.md		readme.md
readme.txt		readme.txt
tar.sh		tar.sh
tmp.in		tmp.in

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

新词发现毕业设计

论文

可以优化的地方

基本用法（repo里面已经有模型文件了，不用重新训练）

1.运行

1.1 IDEA

1.2 teminal

2. 文件组织

About

Releases

Packages

Languages

License

chiyang10000/newWordDetection

Folders and files

Latest commit

History

Repository files navigation

新词发现毕业设计

论文

可以优化的地方

基本用法（repo里面已经有模型文件了，不用重新训练）

1.运行

1.1 IDEA

1.2 teminal

2. 文件组织

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages