SysFake

^{Terrence Li (Langer81) and Hunter S. DiCicco (dicicch)}

^{Under the direction of Dr. Dongwon Lee, Ph.D., Dr. S. Shyam Sundar, Ph.D. and the SysFake team under Penn State's department of Journalism}

^{For information about how this research is supported, please visit the homepage linked above.}

Overview

Problem Statement

In an online news media environment governed by conflicting interests vying for readers' attention, it is becoming increasingly difficult for the average consumer to tell factual news apart from sensational, fake or otherwise. We imagine a capability (say, a browser extension) that leverages statistical learning based on linguistic and metadata features of a news article in question to provide users with an estimate of whether an article is genuine or not.

Goal

The goal of this project is to train and validate a multinomial C-Support Vector Machine for the purpose of classifying vectorized news articles under the following categories:

1: Real
2: Fake
3: Opinion
5: Polarized
7: Satire
9: Promotional Content
11: Corrections

In doing so we hope to create a characteristic model consisting of only significant contributions from the most representative features.

For more details on how these labels are defined, see the project's companion paper “Fake News” is not simply False Information: A Concept Explication and Taxonomy of Online Content

The resulting model will then be tested against humans in an experiment in which both parties will be asked to classify the same set of new articles.

Repository Content

Important modules:

feature_extraction.py - feature_extraction.ArticleVector is a class that handles vectorization. You can find the definitions of the features (individual elements) in the fill_vector() method.

classifier.py - Handles SVM initialization.

validator.py - Contains useful validation routines.

Important directories:

data - Contains final vectors, some raw data and some intermediate data. This will be reorganized in the near future.

Features from the companion explication paper that have been implemented so far:

Reputable URL ending (taken from "reputable_news_sources.txt") | boolean
whether or not a URL is from a reputable news source | boolean
number of times "Today" is written / total number of words | float
number of grammar mistakes | int
number of quotations / total number of words | float
number of past tense instances / total number of words | float
number of present tense instances / total number of words | float
number of times "should" is written / total number of words | float
whether or not "opinion" is in the URL | boolean
number of words that are in all caps / total number of words | float
whether or not a URL is from a satire news source | boolean
number of apa errors | int
number of proper nouns that occur / total number of words | float
number of interjections that occur / total number of words | float
number of times "you" occcurs / total number of words | float
Whether a URL has a dot gov ending / total number of words | float
whether a URL is from an unreputable site (taken from "unreputable_news_sources.txt") | boolean

Important features that have not been implemented:

Presence of fact-checking
Verification of impartial reporting
Narrative conflict
Human-centric writing
Prominence
Written by named, publically known news staff
Presence of an About Us section
Presence of emotionally charged words
Metadata
Un/verified sources listed

Current Performance:

First trial performance on full dataset:

	real	fake	opinion	polarized	satire
recall	0.70	0.96	0.03	0.30	0.90
precision	0.83	0.88	0.50	0.25	0.51
f1	0.76	0.91	0.06	0.27	0.65
# misclassified	68	10	217	156	24

57.59 percent correct overall, 645 correct out of 1120

Best generalizable model performance:

Out of 10 trials done with 5-fold randomized-parameter cross-validation.

These results are from the officically supported model, which is a logistic stochastic gradient descent learner applied over the Nyström approximation of the RBF kernel of the taxonomy dataset.

Recall: tp / (tp + fn)

Precision: tp / (tp + fp)

	precision	recall	f1-score	support
real	0.55	0.98	0.71	319
fake	0.53	0.53	0.53	528
opinion	0.88	0.16	0.27	313
polarized	0.52	0.39	0.44	383
satire	0.49	0.79	0.61	204
promotional	0.00	0.00	0.00	22

accuracy			0.54	1769
macro avg	0.50	0.48	0.43	1769
weighted avg	0.58	0.54	0.50	1769

And the associated confusion matrix:

	real	fake	opinion	polarized	satire
real	312	4	0	3	0
fake	86	282	0	114	46
opinion	132	42	51	9	79
polarized	25	183	5	148	22
satire	11	20	2	9	162
promotional	0	0	0	0	22

Usage guide:

In order to use the classifier, first you must collect data. To do this use the prepare_data() method from classifier.py. The input is a dictionary with data text files as keys and their corresponding labels. see training_file_dict as an example.

support_vector_machine = classifier.svm_classifier(train_X_uncombined, train_Y_uncombined)
svm_predictions = classifier.run_predictions(support_vector_machine, test_X_uncombined, test_Y_uncombined)
get_statistics(test_Y_uncombined, svm_predictions)
validate(support_vector_machine, test_X_uncombined, test_Y_uncombined) ^^these lines of code will be how you run the classifier for validation.

Important note data is separated out into urls, vectors, and then split into training and testing. There is currently no centralized collection of data (this will be rectified soon). For example "Fake News" data will have 5 files:

fake_news_urls-testing.txt - text file with fake news urls separated by spaces for testing
fake_news_urls-training.txt - text file with fake news urls separated by spaces for training
fake_news_urls.txt - All fake news URLs compiled into one text file.
fake_news_vectors-testing.txt - The corresponding fake news testing URLs, from fake_news_urls-testing but vectorized into their respective features.
fake_news_vectors-training.txt - The corresponding fake news training URLs, from fake_news_urls-training but vectorized into their respective features.

Author Contacts

name	cell	email
Terence G. Li	814-308-4495	[email protected]
Hunter S. DiCicco	609-815-5122	[email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 215 Commits
cli		cli
data		data
figures		figures
models		models
notebooks		notebooks
.gitignore		.gitignore
CONTRIBUTORS.md		CONTRIBUTORS.md
README.md		README.md
__init__.py		__init__.py
ap_style_checker.py		ap_style_checker.py
ap_style_checker_tests.py		ap_style_checker_tests.py
classifier.py		classifier.py
data_collection_with_newspaper.py		data_collection_with_newspaper.py
feature_engineering.py		feature_engineering.py
feature_extraction.py		feature_extraction.py
requirements.txt		requirements.txt
test-file.py		test-file.py
tf_idf_vectorizer.py		tf_idf_vectorizer.py
validator.py		validator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SysFake

Overview

Problem Statement

Goal

Repository Content

Important modules:

Important directories:

Current Performance:

First trial performance on full dataset:

Best generalizable model performance:

Usage guide:

Author Contacts

About

Uh oh!

Releases

Packages

Uh oh!

Languages

dicicch/SysFake

Folders and files

Latest commit

History

Repository files navigation

SysFake

Overview

Problem Statement

Goal

Repository Content

Important modules:

Important directories:

Current Performance:

First trial performance on full dataset:

Best generalizable model performance:

Usage guide:

Author Contacts

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages