GitHub - hwsamuel/MedSpider: Collection of scripts for data scraping tasks on various health forums

MedSpider Health Forums Data Collector

MedSpider is a collection of scripts that help with web scraping tasks in order to gather online conversations from health forums. MedSpider targets the following listed online forums, which are categorized by the type of interaction between Patients (P) and Medics (M).

General Requirements

Python 2.7
Latest version of lxml installed via pip install lxml==4.1.0
Pandas is also needed for some of the scrapers

BMJ's Doc2Doc Discussions [M2M]

Please note that the BMJ's Doc2Doc forum is discontinued, the scraper uses cached web pages from Wayback Machine/Internet Archive
Specify the output directory to write results to by editing the doc2doc.py file's main entry point, e.g. Spidey().crawl('doc2doc') (default is doc2doc if not specified)
Run the script via command line or terminal python doc2doc.py which will create tab-separated output files in the output directory you specified

DocCheck Blogs [M2M]

This scraper will require registration of a medic-related account on DocCheck
Specify the output directory to write results to by editing the doccheck.py file's main entry point, e.g. Spidey().crawl('doccheck') (default is doccheck if not specified)
Run the script via command line or terminal python doccheck.py which will create tab-separated output files in the specified directory: blogs.tsv, comments.tsv, and topics.tsv

eHealth Forum Questions [P2M]

Specify the output directory to write results to by editing the ehealthforum.py file's main entry point, e.g. Spidey().crawl('ehealthforum') (default is ehealthforum if not specified)
Run the script via command line or terminal python ehealthforum.py which will create a tab-separated output file called chats.tsv in the specified directory
To run the unit tests, use pytest -q ehealthforum.py

Scrape the Doctors Lounge Forum in 3 Steps [P2M]

Specify the output directory to write results to by editing the doctorslounge.py file's main entry point, e.g. Spidey().crawl('doctorslounge') (default is doctorslounge if not specified)
Run the script via command line or terminal python doctorslounge.py which will create a tab-separated output file called discussions.tsv in the specified directory
To run the unit tests, use pytest -q doctorslounge.py

Scrape the Optimal Health Network (OHN) Live Chat Archives in 3 Steps [P2M]

Specify the output directory to write results to by editing the ohn.py file's main entry point, e.g. Spidey().crawl('ohn') (default is ohn if not specified)
Run the script via command line or terminal python ohn.py which will create a tab-separated output file called chats.tsv in the specified directory
To run the unit tests, use pytest -q ohn.py

Johns Hopkins Breast Center Expert Answers in 3 Steps [P2M]

Specify the output file to write results to by editing the hopkins.py file's main entry point, e.g. Spidey().crawl('hopkins') (default is 'hopkins' if not specified)
Run the script via command line or terminal python hopkins.py which will create a tab-separated output file called discussions.tsv in the specified directory
To run the unit tests, use pytest -q hopkins.py

Scrape the Health Stack Exchange Q&A Forums in 3 Steps [P2P]

Specify the output directory (must exist) to write results to by editing the healthse.py file's main entry point, e.g. Spidey().crawl('healthse') (default is 'healthse' if not specified)
Run the script via command line or terminal python healthse.py which will create a collection of tab-separated output files (please note that Stack Exchange has rate limits): questions.tsv, answers.tsv, question_comments.tsv, and answer_comments.tsv.
To run the unit tests, use pytest -q healthse.py

Parse the Health Stack Exchange Q&A Archives in 3 Steps [P2P]

Download the health.stackexchange.com.7z archive file and extract it using 7-Zip, it has Ubuntu and Windows versions
Note the dataset folder where the extracted XML files are located
The SEParser.py script can create question pairs using the XML files via python SEParse.py dataset-folder, for example python SEParse.py SEparse. It will save the results to a CSV file within the dataset folder (in the case of the example, the file will be called SEparse.csv). The script can be modified to perform other extraction and parsing tasks from the XML files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MedSpider Health Forums Data Collector

General Requirements

BMJ's Doc2Doc Discussions [M2M]

DocCheck Blogs [M2M]

eHealth Forum Questions [P2M]

Scrape the Doctors Lounge Forum in 3 Steps [P2M]

Scrape the Optimal Health Network (OHN) Live Chat Archives in 3 Steps [P2M]

Johns Hopkins Breast Center Expert Answers in 3 Steps [P2M]

Scrape the Health Stack Exchange Q&A Forums in 3 Steps [P2P]

Parse the Health Stack Exchange Q&A Archives in 3 Steps [P2P]

About

Releases

Sponsor this project

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github		.github
SEparse		SEparse
doc2doc		doc2doc
doccheck		doccheck
doctorslounge		doctorslounge
ehealthforum		ehealthforum
healthse		healthse
hopkins		hopkins
ohn		ohn
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
SEparse.py		SEparse.py
doc2doc.py		doc2doc.py
doccheck.py		doccheck.py
doccheck_auth.py		doccheck_auth.py
doctorslounge.py		doctorslounge.py
ehealthforum.py		ehealthforum.py
healthse.py		healthse.py
hopkins.py		hopkins.py
ohn.py		ohn.py

License

hwsamuel/MedSpider

Folders and files

Latest commit

History

Repository files navigation

MedSpider Health Forums Data Collector

General Requirements

BMJ's Doc2Doc Discussions [M2M]

DocCheck Blogs [M2M]

eHealth Forum Questions [P2M]

Scrape the Doctors Lounge Forum in 3 Steps [P2M]

Scrape the Optimal Health Network (OHN) Live Chat Archives in 3 Steps [P2M]

Johns Hopkins Breast Center Expert Answers in 3 Steps [P2M]

Scrape the Health Stack Exchange Q&A Forums in 3 Steps [P2P]

Parse the Health Stack Exchange Q&A Archives in 3 Steps [P2P]

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Languages

Packages