BitcoinTalk Scraper

Requirements:

Install requirements

LIB	DOC	IMPORT
bs4	BeautifulSoup	`from bs4 import BeautifulSoup as bs`
cursor	Cursor	`import cursor`
spacy	Spacy	`import spacy`
matplotlib	Matplotlib	`import matplotlib.pyplot as plt`
mplcursors	Mplcursors	`import mplcursors`

python -m pip install -r requirements.txt

Features:

DownloadHTML-v2

Check if pages of the BitcoinTalk forum are missing and download them if necessary:
RUN: python DownloadHTML-v2.py

Download all pages of the BitcoinTalk forum:
RUN: python DownloadHTML-v2.py [ -u | --update ]

It will create a directory containing all html pages of the forum, with the following architecture:

├── BitcoinTalk-Forum
│   ├── Hardware (childboard name)
│   │   ├── 0a6b038fb2ee35ae4941dd86fcad7c1858cef8df (hashed topic title using sha1 encoding)
│   │   │   ├── 1.html (html page of the topic)
│   │   │   ├── 2.html
│   │   │   └── 3.html
│   │   ├── 7dc40b9c0c457b9986b4a710705b67d0b035f9e8
│   │   │   └── 1.html
│   │   ├── 85ea1f75700b12789b153d1ea1e952738b84efb1
│   │   │   ├── 1.html
│   │   │   ├── 2.html
│   │   │   ├── 3.html
│   │   │   ├── 4.html
│   │   │   ├── 5.html
│   │   │   ├── [...]
│   │   ├── 93fc59b432ec70f3b7d72bedac883df6ce1af86a
│   │   ├── [...]
│   ├── Mining Software (miners) [...]
│   ├── Mining speculation [...]
│   ├── Mining support [...]
│   └── Pools [...]

Scraper

Iterate over downloaded files and retrieve all informations:
RUN: python Scraper.py

The scraper will create a file named BitcoinTalk-data.json.
It is a large file containing all informations about all scraped topics.

TextAnalysis

Iterate over all topics, create B-O-W (bags-of-words) and compute TF-IDF (term frequency-inverse document frequency) on each word (excluding stop-words & punctuation).

Before running this program, please make sure you already ran DownloadHTML-v2.py and Scraper.py in order to create a BitcoinTalk-data.json file:
RUN: python TextAnalysis.py

It will analyse data inside BitcoinTalk-data.json, and will create a analysis_results.json file containing (as its name says) the results of the analysis:
For each word, of each post, of each topic:

number of occurrences
term frequency
inverse document frequency
term frequency * inverse document frequency (tf-idf)
metadata relative to the topic
- total number of words
- number of documents (posts)

If you need words to be ignored while analyzing all topics, you can add them inside ./WORDS/ignore.json:

{
    "ignore": [
        "pool",
        "pools",
        "hi",
        "ve",
        "hey",
        "test"
    ]
}

Documentation about TF-IDF
Thomas Péan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

BitcoinTalk Scraper

Requirements:

Install requirements

Features:

DownloadHTML-v2

Scraper

TextAnalysis

Files

README.md

Latest commit

History

README.md

File metadata and controls

BitcoinTalk Scraper

Requirements:

Install requirements

Features:

DownloadHTML-v2

Scraper

TextAnalysis