Putting an end to “It's all Greek to me.”
This is a classifier that identifies Greek text as Cypriot Greek (CG) or Standard Modern Greek (SMG).
For more information, you can read my thesis: A Classifier to Distinguish Between Cypriot Greek and Standard Modern Greek.
Index of Jupyter Notebooks |
---|
1. Obtaining CG and SMG tweets Code used to collect tweets |
2. Data Analysis Analyzing the corpus |
3. Building the Classifier Building the CG-SMG classifier |
The corpus can be found in the Data
directory. It was collected by me personally and labeled into CG and SMG by separating text into files.
Index of files in corpus |
---|
CG Facebook CG text collected from Facebook posts and comments |
CG Twitter CG text collected from tweets |
CG Other CG text collected from forum posts, blog and news article comments |
SMG Facebook SMG text collected from Facebook posts and comments |
SMG Twitter SMG text collected from tweets |
SMG Other SMG text collected from forum posts, blog and news article comments |
Feel free to use the corpus or a subset of it in any kind of project as long as you provide a link to this repository.
In order to run the code, either clone the repository and run Jupyter Notebooks locally, or click on the Binder badge at the top of this readme to instantly run the notebooks on a remote server. If you choose the latter option, you still need to use nltk.download()
in order to download the required NLTK modules.
If you are only interested in running the classifier with your own text as input, go to the last section of 3. Building the Classifier.
H. Z. Sababa — hb20007 — [email protected]
Distributed under the MIT license. See LICENSE
for more information.