Skip to content

In this project, the authors propose to use contextual Word2Vec model for understanding OOV (out of vocabulary). The OOV is extracted by using left-right entropy and point information entropy. They choose to use Word2Vec to construct the word vector space and CBOW (continuous bag of words) to obtain the contextual information of the words.


Notifications You must be signed in to change notification settings


Repository files navigation

Chinese OOV recognition and understanding by contextual Word2Vec model

GitHub issues


Project Introduction


Run Way

Mining the data for the corpus
Extract the word from corpus as word list
Compare the word list with dictionary and extract the oov as list
Filter person name , organization name , place name from OOV list and delete these word from the list as cleaned oov list
Mining some corpus using the oov as keyword in Weibo 
Merge the keyword corpus and origin corpus and spilt words with jieba
Training model and caculate the similarity of each oov
Additional experiments are inputting an OOV for direct semantic understanding

Word Extract

Mutual information(MI)
Higher the correlation between X and Y, the higher the possibility of X and Y forming words,Lower the value of mutual information, lower the correlation between X and Y, the higher possibility of a boundary between X and Y
Left and right entropy

W : candidate words after N-Gram segmentation.
A: a collection of all words appearing on the left of a candidate.
a: a word appearing on the left.
B: a collection of all words appearing on the right of a candidate.
b: a word appearing on the right.
The more words appear around the candidate word W, the more likely it is that W is a word.

Some Result

Class OOV Similar Words of OOV
A 天才病(Genius Disease) 阿兹伯格综合症(Asperger's Syndrome)
B 新冠 (COVID-19) 感染(Infection), 病毒(Virus), 肺炎(pneumonia)
C 凤凰网(Media Organization) 应该 (Should be),讨论 (discuss),看法 (view)

The example of ’凤凰网‘(Media organization)on the left and ‘新冠’(Covid-19) on the right,Because the word ‘凤凰网’ often appears in the back of some news, it is difficult to predict the meaning of the word because there is not enough information in the context and there is a lot of noise,On the contrary, the word '新冠' is rich in contextual information, so the predicted value is also relatively accurate. image This example shows the understand of '耗子尾汁' by both CBOW and Skip-gram models. Both models accurately understand the semantic words, but the similarity between the two words understood by the CBOW model is higher image 1

Model A B C Accuracy
CBOW 21 13 1 97.10%
Skip-gram 17 14 4 88.57%

The result of OOV ’ 耗子尾汁’

Word Translation Similarity
好自为之 Take care of yourself 0.99997896
particle (in Chinese) 0.99997878
i 0.99997693
马保国 Baoguo Ma 0.99997658
Also 0.99997264
and 0.99997222
particle (in Chinese) 0.99997193

About the Author

JiaKai Gu
E-mail: [email protected]
Jason J. Jung
Department of Computer Engineering, Chung-Ang University 84, Heukseok-ro, Dongjak-gu, Seoul, Republic of Korea 06974
Tel.: +82-2-820-5136
Fax: +82-2-820-5301
E-mail: [email protected]

Cite this project

   author = {Gu, JiaKai and Li, Gen and Vo, Nam D. and Jung, Jason J.},
   title = {Contextual Word2Vec Model for Understanding Chinese Out of Vocabularies on Online Social Media},
   journal = {International Journal on Semantic Web and Information Systems (IJSWIS)},
   volume = {18},
   number = {1},
   pages = {1-14},
   ISSN = {1552-6283},
   DOI = {10.4018/IJSWIS.309428},
   url = { },
   year = {2022},
   type = {Journal Article}

Data source


In this project, the authors propose to use contextual Word2Vec model for understanding OOV (out of vocabulary). The OOV is extracted by using left-right entropy and point information entropy. They choose to use Word2Vec to construct the word vector space and CBOW (continuous bag of words) to obtain the contextual information of the words.






