Terrence Li (Langer81) and Hunter S. DiCicco (dicicch)
Under the direction of Dr. Dongwon Lee, Ph.D., Dr. S. Shyam Sundar, Ph.D. and the SysFake team under Penn State's department of Journalism
For information about how this research is supported, please visit the homepage linked above.
In an online news media environment governed by conflicting interests vying for readers' attention, it is becoming increasingly difficult for the average consumer to tell factual news apart from sensational, fake or otherwise. We imagine a capability (say, a browser extension) that leverages statistical learning based on linguistic and metadata features of a news article in question to provide users with an estimate of whether an article is genuine or not.
The goal of this project is to train and validate a multinomial C-Support Vector Machine for the purpose of classifying vectorized news articles under the following categories:
- 1: Real
- 2: Fake
- 3: Opinion
- 5: Polarized
- 7: Satire
- 9: Promotional Content
- 11: Corrections
In doing so we hope to create a characteristic model consisting of only significant contributions from the most representative features.
For more details on how these labels are defined, see the project's companion paper “Fake News” is not simply False Information: A Concept Explication and Taxonomy of Online Content
The resulting model will then be tested against humans in an experiment in which both parties will be asked to classify the same set of new articles.
feature_extraction.py
- feature_extraction.ArticleVector
is a class that handles vectorization. You can find the definitions of the features (individual elements) in the fill_vector()
method.
classifier.py
- Handles SVM initialization.
validator.py
- Contains useful validation routines.
data
- Contains final vectors, some raw data and some intermediate data. This will be reorganized in the near future.
Features from the companion explication paper that have been implemented so far:
- Reputable URL ending (taken from "reputable_news_sources.txt") | boolean
- whether or not a URL is from a reputable news source | boolean
- number of times "Today" is written / total number of words | float
- number of grammar mistakes | int
- number of quotations / total number of words | float
- number of past tense instances / total number of words | float
- number of present tense instances / total number of words | float
- number of times "should" is written / total number of words | float
- whether or not "opinion" is in the URL | boolean
- number of words that are in all caps / total number of words | float
- whether or not a URL is from a satire news source | boolean
- number of apa errors | int
- number of proper nouns that occur / total number of words | float
- number of interjections that occur / total number of words | float
- number of times "you" occcurs / total number of words | float
- Whether a URL has a dot gov ending / total number of words | float
- whether a URL is from an unreputable site (taken from "unreputable_news_sources.txt") | boolean
Important features that have not been implemented:
- Presence of fact-checking
- Verification of impartial reporting
- Narrative conflict
- Human-centric writing
- Prominence
- Written by named, publically known news staff
- Presence of an About Us section
- Presence of emotionally charged words
- Metadata
- Un/verified sources listed
real | fake | opinion | polarized | satire | |
---|---|---|---|---|---|
recall | 0.70 | 0.96 | 0.03 | 0.30 | 0.90 |
precision | 0.83 | 0.88 | 0.50 | 0.25 | 0.51 |
f1 | 0.76 | 0.91 | 0.06 | 0.27 | 0.65 |
# misclassified | 68 | 10 | 217 | 156 | 24 |
57.59 percent correct overall, 645 correct out of 1120
Out of 10 trials done with 5-fold randomized-parameter cross-validation.
These results are from the officically supported model, which is a logistic stochastic gradient descent learner applied over the Nyström approximation of the RBF kernel of the taxonomy dataset.
Recall: tp / (tp + fn)
Precision: tp / (tp + fp)
precision | recall | f1-score | support | |
---|---|---|---|---|
real | 0.55 | 0.98 | 0.71 | 319 |
fake | 0.53 | 0.53 | 0.53 | 528 |
opinion | 0.88 | 0.16 | 0.27 | 313 |
polarized | 0.52 | 0.39 | 0.44 | 383 |
satire | 0.49 | 0.79 | 0.61 | 204 |
promotional | 0.00 | 0.00 | 0.00 | 22 |
accuracy | 0.54 | 1769 | ||
macro avg | 0.50 | 0.48 | 0.43 | 1769 |
weighted avg | 0.58 | 0.54 | 0.50 | 1769 |
And the associated confusion matrix:
real | fake | opinion | polarized | satire | promotional | |
---|---|---|---|---|---|---|
real | 312 | 4 | 0 | 3 | 0 | 0 |
fake | 86 | 282 | 0 | 114 | 46 | 0 |
opinion | 132 | 42 | 51 | 9 | 79 | 0 |
polarized | 25 | 183 | 5 | 148 | 22 | 0 |
satire | 11 | 20 | 2 | 9 | 162 | 0 |
promotional | 0 | 0 | 0 | 0 | 22 | 0 |
In order to use the classifier, first you must collect data. To do this use the prepare_data() method from classifier.py. The input is a dictionary with data text files as keys and their corresponding labels. see training_file_dict as an example.
- support_vector_machine = classifier.svm_classifier(train_X_uncombined, train_Y_uncombined)
- svm_predictions = classifier.run_predictions(support_vector_machine, test_X_uncombined, test_Y_uncombined)
- get_statistics(test_Y_uncombined, svm_predictions)
- validate(support_vector_machine, test_X_uncombined, test_Y_uncombined) ^^these lines of code will be how you run the classifier for validation.
Important note data is separated out into urls, vectors, and then split into training and testing. There is currently no centralized collection of data (this will be rectified soon). For example "Fake News" data will have 5 files:
- fake_news_urls-testing.txt - text file with fake news urls separated by spaces for testing
- fake_news_urls-training.txt - text file with fake news urls separated by spaces for training
- fake_news_urls.txt - All fake news URLs compiled into one text file.
- fake_news_vectors-testing.txt - The corresponding fake news testing URLs, from fake_news_urls-testing but vectorized into their respective features.
- fake_news_vectors-training.txt - The corresponding fake news training URLs, from fake_news_urls-training but vectorized into their respective features.
name | cell | |
---|---|---|
Terence G. Li | 814-308-4495 | [email protected] |
Hunter S. DiCicco | 609-815-5122 | [email protected] |