Skip to content

Collection of scripts for data scraping tasks on various health forums

License

Notifications You must be signed in to change notification settings

hwsamuel/MedSpider

Repository files navigation

MedSpider Health Forums Data Collector License: AGPL v3

MedSpider is a collection of scripts that help with web scraping tasks in order to gather online conversations from health forums. MedSpider targets the following listed online forums, which are categorized by the type of interaction between Patients (P) and Medics (M).

General Requirements

  • Python 2.7
  • Latest version of lxml installed via pip install lxml==4.1.0
  • Pandas is also needed for some of the scrapers
  1. Please note that the BMJ's Doc2Doc forum is discontinued, the scraper uses cached web pages from Wayback Machine/Internet Archive
  2. Specify the output directory to write results to by editing the doc2doc.py file's main entry point, e.g. Spidey().crawl('doc2doc') (default is doc2doc if not specified)
  3. Run the script via command line or terminal python doc2doc.py which will create tab-separated output files in the output directory you specified
  1. This scraper will require registration of a medic-related account on DocCheck
  2. Specify the output directory to write results to by editing the doccheck.py file's main entry point, e.g. Spidey().crawl('doccheck') (default is doccheck if not specified)
  3. Run the script via command line or terminal python doccheck.py which will create tab-separated output files in the specified directory: blogs.tsv, comments.tsv, and topics.tsv
  1. Specify the output directory to write results to by editing the ehealthforum.py file's main entry point, e.g. Spidey().crawl('ehealthforum') (default is ehealthforum if not specified)
  2. Run the script via command line or terminal python ehealthforum.py which will create a tab-separated output file called chats.tsv in the specified directory
  3. To run the unit tests, use pytest -q ehealthforum.py

Scrape the Doctors Lounge Forum in 3 Steps [P2M]

  1. Specify the output directory to write results to by editing the doctorslounge.py file's main entry point, e.g. Spidey().crawl('doctorslounge') (default is doctorslounge if not specified)
  2. Run the script via command line or terminal python doctorslounge.py which will create a tab-separated output file called discussions.tsv in the specified directory
  3. To run the unit tests, use pytest -q doctorslounge.py
  1. Specify the output directory to write results to by editing the ohn.py file's main entry point, e.g. Spidey().crawl('ohn') (default is ohn if not specified)
  2. Run the script via command line or terminal python ohn.py which will create a tab-separated output file called chats.tsv in the specified directory
  3. To run the unit tests, use pytest -q ohn.py
  1. Specify the output file to write results to by editing the hopkins.py file's main entry point, e.g. Spidey().crawl('hopkins') (default is 'hopkins' if not specified)
  2. Run the script via command line or terminal python hopkins.py which will create a tab-separated output file called discussions.tsv in the specified directory
  3. To run the unit tests, use pytest -q hopkins.py

Scrape the Health Stack Exchange Q&A Forums in 3 Steps [P2P]

  1. Specify the output directory (must exist) to write results to by editing the healthse.py file's main entry point, e.g. Spidey().crawl('healthse') (default is 'healthse' if not specified)
  2. Run the script via command line or terminal python healthse.py which will create a collection of tab-separated output files (please note that Stack Exchange has rate limits): questions.tsv, answers.tsv, question_comments.tsv, and answer_comments.tsv.
  3. To run the unit tests, use pytest -q healthse.py

Parse the Health Stack Exchange Q&A Archives in 3 Steps [P2P]

  1. Download the health.stackexchange.com.7z archive file and extract it using 7-Zip, it has Ubuntu and Windows versions
  2. Note the dataset folder where the extracted XML files are located
  3. The SEParser.py script can create question pairs using the XML files via python SEParse.py dataset-folder, for example python SEParse.py SEparse. It will save the results to a CSV file within the dataset folder (in the case of the example, the file will be called SEparse.csv). The script can be modified to perform other extraction and parsing tasks from the XML files.

About

Collection of scripts for data scraping tasks on various health forums

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published

Languages