Supplementary Materials for "Detecting Harmful Medical Advice by Analyzing the Characteristics of Retweeters"
This repo is home to the supplementary materials for my final project for CS8396: Data Privacy in Biomedicine (Spring 2020) under Dr. Bradly Malin at Vanderbilt University. From the abstract:
I study the ability of a model to discern, for tweets about the COVID-19 crisis and based on the characteristics of users who retweet it, whether or not a given article or tweet provides beneficial medical advice or could lead to a harmful outcome by promoting harmful medical practices or providing incomplete information that could lead to panicked action against the current medical wisdom. The model analyzes the characteristics of people who retweet the article, the pattern of how the article is retweeted, and what twitter uses say when retweeting the article. The study aims to support future work to identify and reduce the spread of panic-inducing misinformation and disinformation in an effort to help authorities better respond to health-threatening epidemics and pandemics.
The full paper can be found in this repo.
The code in this repo contains the jupyter notebook used to train and test the model as well as an additional notebook used to split the dataset into train and test segments. The tweet-uploader
folder contains the code used to upload the tweets (contained in line-delimited JSON files downloaded using twarc) to MongoDB, and the tweet-uploader/viewer
folder contains the simple web application used to rate these tweets.
This study does not reveal groundbreaking findings, but it did allow me to investigate a side of computer science whence I have not gone before: machine learning. In the process of coming up with the model used in this paper, I trudged through deep learning models in an attempt to model the work done by Y. Liu and Y.-F. Wu. Without any prior experience in this field, I ultimately was not successful in getting a meaningful result from PyTorch, but I later discovered simpler machine learning techniques (logistic regression and random forests) that led to a more meaningful result. The actual tweets used to train this model are not able to be shared due to the terms of service for Twitter's API, but the tweet IDs, along with my rating for each tweet, can be found in the tweets.json file.
My hope at the conclusion of this project is to continue investigating machine learning technologies, as I understand quite a bit more about the statistics behind these models after using them, and they not only make a lot more sense but seem quite a bit more useful and less "magic" to me.