NaiveBayesClassifier

Made by Christos Kormaris

Programming Language: Python 3

Unzip the compressed file LingspamDataset.zip in the same directory where the Python files are. 700 train documents and 260 test documents will reside inside the uncompressed folder.

Feature Selection with Information Gain

Let's begin by denoting the variable C, which takes the values: C=1 for spam documents and C=0 for ham documents. First, run the python file feature_selection_with_information_gain.py to generate the features tokens that we'll use. Run: shell python feature_selection_with_information_gain.py Feature selection for the most useful feature tokens, using Information Gain (IG) has been implemented. The feature tokens are boolean, as in they take boolean values, 0 if the token does not appear in a text or 1 if the token appears in a text. The boolean values are assigned while generating the feature vectors of each text. At the start of the program, all the train files of the corpus are being parsed and we count in how many spam or ham documents in total, each word appears. The results are being saved in dictionary data structures with the names: feature_spam_frequency, feature_ham_frequency and feature_frequency respectively, with feature tokens being the keys and frequencies being the values. These dictionary variables are used for the calculation of the probabilities of the Information Gain algorithm. We calculate the entropy H(C) and print it. We proceed by adding each candidate feature token to the IG dictionary, while calculating its IG score. Finally, we select the top m feature tokens, on terms of the highest Information Gain (IG) scores. The Information Gain score of each feature token is calculated using the formula:

where i=0 and i=1 are the boolean values that a feature token may take, indicating if it appears or not in a text. Concretely, the feature X that reduces the entropy less is the most desired candidate feature because it can discriminate the category of a document more efficiently. The number of feature tokens that the feature selection algorithm returns is set to 1000. The number of feature tokens to use depends on the classification task we want to execute. For our Naive-Bayes spam-ham classifier a number of 1000 feature tokens is a good choice.

In addition, inside of block comments, there is an alternative way for calculating the Information Gain score, by using the following formula:

which is the absolute difference between the conditional probabilities for the 2 classes (spam or ham). The tokens for which this absolute difference is bigger are selected as feature tokens of the feature vectors, based on which we will classify our corpus files. Using the feature tokens of this formula, slightly worse accuracy has been observed. In the end of the program feature_selection_with_ig.py, two dictionaries will be created, which contain the feature tokens that will be used for the Naive-Bayes classifier and they are saved in the files: spam_feature_dictionary.txt and ham_feature_dictionary.txt.

Perform Naive-Bayes Classification

After the feature selection step, run naive_bayes_classifier.py to start the classification process. Run: shell python naive_bayes_classifier.py First, the classifier counts the frequency of each feature token in the spam documents, and then in the ham documents. We use boolean features. The total frequency of each spam and ham feature token is saved in two dictionaries: token_frequencies_in_spam_class and token_frequencies_in_ham_class. We also how many words the documents of the spam class have and the number of words the ham documents have. We use the acquired frequencies to estimate the Laplace probabilities that will determine the correct category of the test data. The Laplace probability estimate classification calculates the conditional probability of each test document's feature vector for each 2 categories (spam or ham) and categorizes the document in the class for which the conditional probability is higher i.e. for the feature vector X_i of the test document i we calculate the probabilities p(X_i|C=1) and p(X_i|C=0) (reminder: C=1 for spam, C=0 for ham). The class that has the higher probability will be the predicted class of the train document. For the calculation of the probabilies we also use Laplace smoothing, adding 1 in the enumerator and |V| in the denominator (where |V| is the number of distinct words in the corpus, i.e. is the vocabulary size of the corpus). This is the formula that is used for calculating the probability of the feature token i, belonging to a spam document:

To calculate the probability of the entire feature vector belonging to the spam class we multiply the probability of each separate feature token belonging to the spam class. The exact formula is:

We do the same for the ham class. We can omit the term P(featureVector) since it's the same for both 2 classes. The probability of the 2 which is bigger, indicates the category that the given test document and its feature vector is more likely to belong to. Also, it is a good idea to use the numerically stable logsumexp trick, thus taking sum of exponentials, rather than multiplications of probabilities.

The execution results of the classifier delivered an accuracy of 96.92 %, while using 1000 features tokens, and also the same accuracy, while using 100 features tokens.

Notes

Statistics show that the Naive-Bayes classifier has a higher accuracy on a small corpus, rather than a big corpus.
Console execution results are included in the console_outputs folder.

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
LingspamDataset		LingspamDataset
console_outputs		console_outputs
feature_dictionary_1000_tokens		feature_dictionary_1000_tokens
feature_dictionary_100_tokens		feature_dictionary_100_tokens
.gitignore		.gitignore
LICENSE		LICENSE
LingspamDataset.zip		LingspamDataset.zip
README.md		README.md
UNZIP_LINGSPAM_DATASET_HERE		UNZIP_LINGSPAM_DATASET_HERE
feature_dictionary.txt		feature_dictionary.txt
feature_selection_with_information_gain.py		feature_selection_with_information_gain.py
naive_bayes_classifier.py		naive_bayes_classifier.py
nltk_download.py		nltk_download.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NaiveBayesClassifier

Feature Selection with Information Gain

Perform Naive-Bayes Classification

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

chriskormaris/NaiveBayesClassifierInPython

Folders and files

Latest commit

History

Repository files navigation

NaiveBayesClassifier

Feature Selection with Information Gain

Perform Naive-Bayes Classification

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages