CRF-Cut: Sentence Segmentation

The objective of CRF-Cut (Conditional Random Fields - Cut) is to cut sentences and we will able to utilize these sentences.

The process of training is to get sentences and we will tokenize words and assign label for each word I: Inside of sentence and E: End of sentence.

The result of CRF-Cut is trained by different datasets are as follows:

dataset-train	dataset-validate	I-precision	I-recall	I-fscore	E-precision	E-recall	E-fscore	space-correct
Ted	Ted	0.99	0.99	0.99	0.74	0.70	0.72	0.82
Ted	Orchid	0.95	0.99	0.97	0.73	0.24	0.36	0.73
Ted	Fake review	0.98	0.99	0.98	0.86	0.70	0.77	0.78
Orchid	Ted	0.98	0.98	0.98	0.56	0.59	0.58	0.71
Orchid	Orchid	0.98	0.99	0.99	0.85	0.71	0.77	0.87
Orchid	Fake review	0.97	0.99	0.98	0.77	0.63	0.69	0.70
Fake review	Ted	0.99	0.95	0.97	0.42	0.85	0.56	0.56
Fake review	Orchid	0.97	0.96	0.96	0.48	0.59	0.53	0.67
Fake review	Fake review	1	1	1	0.98	0.96	0.97	0.97
Ted + Orchid + Fake review	Ted	0.99	0.98	0.99	0.66	0.77	0.71	0.78
Ted + Orchid + Fake review	Orchid	0.98	0.98	0.98	0.73	0.66	0.69	0.82
Ted + Orchid + Fake review	Fake review	1	1	1	0.98	0.95	0.96	0.96

Google colab:

Train 1 dataset: https://colab.research.google.com/drive/12nszk-N5LwpHzitlYvhNWVUDSBj30Z1Y
Train 3 datasets: https://colab.research.google.com/drive/1qPEuLZdNNsxhURn8HK7DLi7gbA84coWZ

Sentence Breaking Journal

What doesn't work

POS-perceptron
Larger features than window = 2, max_n_gram = 3
Number of verbs to the left and right
Rule-based override
L2 regularization - also not practical
POS-artagger - not really too slow
ORCHID - different domains get totally different results

What to try

TNC

What worked

Fake "convolutions" of window = 2, max_n_gram = 3
L1 regularization of 1
Predict end of sentence (space) instead of beginning of sentence
Custom POS - only faster convergence
Try with ORCHID to compare performance more fairly - 87% vs 95% SOTA

Requirements

pythainlp
python-crfsuite
pandas
numpy
scikit-learn
tqdm

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
data		data
models		models
notebooks		notebooks
scrape_scripts		scrape_scripts
train_scripts		train_scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CRF-Cut: Sentence Segmentation

Sentence Breaking Journal

What doesn't work

What to try

What worked

Requirements

About

Releases

Packages

Contributors 3

Languages

vistec-AI/crfcut

Folders and files

Latest commit

History

Repository files navigation

CRF-Cut: Sentence Segmentation

Sentence Breaking Journal

What doesn't work

What to try

What worked

Requirements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages