Skip to content

Latest commit

 

History

History
46 lines (29 loc) · 2.63 KB

README.md

File metadata and controls

46 lines (29 loc) · 2.63 KB

👑 Coronada: Tweets about COVID-19 👑

👑 About this repository 👑

I got curious about what people were saying about this virus. So, I scraped a bunch of data off Twitter using a variety of terms suggested to me in this Twitter thread. This package is designed to work in Python3 because Python2 is deprecated. Update your Python and join the future, friends!

In principle, if you want to install this package, here are the steps you should take:

  1. Clone this repository wherever you clone your repositories (git clone https://github.com/BayesForDays/coronada.git)
  2. Go to the cloned directory (cd ./coronada/)
  3. Create a new virtual environment in the repository directory and activate it.
  4. Install this package with pip: pip install -e . --upgrade
  5. Enjoy! (Do not profit! Don't be a jerk.)

👑 Scraping process 👑

I have provided code that allows you to initiate your own scraping process, since I only gathered a ~2.2 million tweet sample gathered over the course of about 12 hours over March 13-14, 2020 (Central Daylight Time). The code corona_swabber.py (get it? 🤒) takes your Twitter API credentials as arguments and prints the keyword search results as they stream in.

You will need the following

  • Consumer key
  • Consumer key secret
  • Access token
  • Access token secret

I piped the output to another file:

python corona_swabber.py [arguments] > path/to/jsons_out.txt

If you are fancy, you can add e.g. different stopping criteria. However, I don't get paid to be fancy anymore. Feel free to submit a PR if you have improvements.

The sky's the limit! There are some tricky components to the Twitter API, namely how it handles quote tweets, retweets, and tweets over a certain length. More information about Tweepy's capabilities can be found here.

With reagent.py, you can extract Tweet data by JSON path. It prints the result to the command line.

python reagent.py --jsonfile [json file] --extractpaths tag,tag2:subtag1,... > /path/to/output.txt

👑 Reproducibility and Data Privacy 👑

Because of privacy laws (e.g. in case a user locks their account), I am only able to provide tweet ids to you in mar14-tweets.txt. If you want to scrape these specific tweets, the tweepy interface allows you to do so fairly trivially.