Mining the data for the corpus
$python weibomining.py
Extract the word from corpus as word list
$python oovfinder.py
Compare the word list with dictionary and extract the oov as list
$python isoov.py
Filter person name , organization name , place name from OOV list and delete these word from the list as cleaned oov list
$python namefinder.py
$python placefinder.py
$python orgfinder.py
Mining some corpus using the oov as keyword in Weibo
$python keywordcorpuscrawl.py
Merge the keyword corpus and origin corpus and spilt words with jieba
$python splitsystem.py
Training model and caculate the similarity of each oov
$python modeltraining.py
Additional experiments are inputting an OOV for direct semantic understanding
$python modeltraining.py
Mutual information(MI)
Higher the correlation between X and Y, the higher the possibility of X and Y forming words,Lower the value of mutual information, lower the correlation between X and Y, the higher possibility of a boundary between X and Y
Left and right entropy
W : candidate words after N-Gram segmentation.
A: a collection of all words appearing on the left of a candidate.
a: a word appearing on the left.
B: a collection of all words appearing on the right of a candidate.
b: a word appearing on the right.
The more words appear around the candidate word W, the more likely it is that W is a word.
Class | OOV | Similar Words of OOV |
---|---|---|
A | 天才病(Genius Disease) | 阿兹伯格综合症(Asperger's Syndrome) |
B | 新冠 (COVID-19) | 感染(Infection), 病毒(Virus), 肺炎(pneumonia) |
C | 凤凰网(Media Organization) | 应该 (Should be),讨论 (discuss),看法 (view) |
The example of ’凤凰网‘(Media organization)on the left and ‘新冠’(Covid-19) on the right,Because the word ‘凤凰网’ often appears in the back of some news, it is difficult to predict the meaning of the word because there is not enough information in the context and there is a lot of noise,On the contrary, the word '新冠' is rich in contextual information, so the predicted value is also relatively accurate. This example shows the understand of '耗子尾汁' by both CBOW and Skip-gram models. Both models accurately understand the semantic words, but the similarity between the two words understood by the CBOW model is higher
Model | A | B | C | Accuracy |
---|---|---|---|---|
CBOW | 21 | 13 | 1 | 97.10% |
Skip-gram | 17 | 14 | 4 | 88.57% |
The result of OOV ’ 耗子尾汁’
Word | Translation | Similarity |
---|---|---|
好自为之 | Take care of yourself | 0.99997896 |
吗 | particle (in Chinese) | 0.99997878 |
我 | i | 0.99997693 |
马保国 | Baoguo Ma | 0.99997658 |
又 | Also | 0.99997264 |
和 | and | 0.99997222 |
呢 | particle (in Chinese) | 0.99997193 |
JiaKai Gu
E-mail: [email protected]
Jason J. Jung
Department of Computer Engineering, Chung-Ang University 84, Heukseok-ro, Dongjak-gu, Seoul, Republic of Korea 06974
Tel.: +82-2-820-5136
Fax: +82-2-820-5301
E-mail: [email protected]
@article{gu2022contextual,
author = {Gu, JiaKai and Li, Gen and Vo, Nam D. and Jung, Jason J.},
title = {Contextual Word2Vec Model for Understanding Chinese Out of Vocabularies on Online Social Media},
journal = {International Journal on Semantic Web and Information Systems (IJSWIS)},
volume = {18},
number = {1},
pages = {1-14},
ISSN = {1552-6283},
DOI = {10.4018/IJSWIS.309428},
url = { https://services.igi-global.com/resolvedoi/resolve.aspx?doi=10.4018/IJSWIS.309428 },
year = {2022},
type = {Journal Article}
}