BitcoinTalk Scraper

Requirements:

Install requirements

LIB	DOC	IMPORT
bs4	BeautifulSoup	`from bs4 import BeautifulSoup as bs`
cursor	Cursor	`import cursor`
spacy	Spacy	`import spacy`
matplotlib	Matplotlib	`import matplotlib.pyplot as plt`
mplcursors	Mplcursors	`import mplcursors`

python -m pip install -r requirements.txt

Features:

DownloadHTML-v2

Check if pages of the BitcoinTalk forum are missing and download them if necessary:
RUN: python DownloadHTML-v2.py

Download all pages of the BitcoinTalk forum:
RUN: python DownloadHTML-v2.py [ -u | --update ]

It will create a directory containing all html pages of the forum, with the following architecture:

├── BitcoinTalk-Forum
│   ├── Hardware (childboard name)
│   │   ├── 0a6b038fb2ee35ae4941dd86fcad7c1858cef8df (hashed topic title using sha1 encoding)
│   │   │   ├── 1.html (html page of the topic)
│   │   │   ├── 2.html
│   │   │   └── 3.html
│   │   ├── 7dc40b9c0c457b9986b4a710705b67d0b035f9e8
│   │   │   └── 1.html
│   │   ├── 85ea1f75700b12789b153d1ea1e952738b84efb1
│   │   │   ├── 1.html
│   │   │   ├── 2.html
│   │   │   ├── 3.html
│   │   │   ├── 4.html
│   │   │   ├── 5.html
│   │   │   ├── [...]
│   │   ├── 93fc59b432ec70f3b7d72bedac883df6ce1af86a
│   │   ├── [...]
│   ├── Mining Software (miners) [...]
│   ├── Mining speculation [...]
│   ├── Mining support [...]
│   └── Pools [...]

Scraper

Iterate over downloaded files and retrieve all informations:
RUN: python Scraper.py

The scraper will create a file named BitcoinTalk-data.json.
It is a large file containing all informations about all scraped topics.

TextAnalysis

Iterate over all topics, create B-O-W (bags-of-words) and compute TF-IDF (term frequency-inverse document frequency) on each word (excluding stop-words & punctuation).

Before running this program, please make sure you already ran DownloadHTML-v2.py and Scraper.py in order to create a BitcoinTalk-data.json file:
RUN: python TextAnalysis.py

It will analyse data inside BitcoinTalk-data.json, and will create a analysis_results.json file containing (as its name says) the results of the analysis:
For each word, of each post, of each topic:

number of occurrences
term frequency
inverse document frequency
term frequency * inverse document frequency (tf-idf)
metadata relative to the topic
- total number of words
- number of documents (posts)

If you need words to be ignored while analyzing all topics, you can add them inside ./WORDS/ignore.json:

{
    "ignore": [
        "pool",
        "pools",
        "hi",
        "ve",
        "hey",
        "test"
    ]
}

Documentation about TF-IDF
Thomas Péan

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
DownloadHTML		DownloadHTML
Scraper		Scraper
TextAnalysis		TextAnalysis
Visualizer		Visualizer
WORDS		WORDS
assests		assests
.gitignore		.gitignore
README.md		README.md
imt_btc.py		imt_btc.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BitcoinTalk Scraper

Requirements:

Install requirements

Features:

DownloadHTML-v2

Scraper

TextAnalysis

About

Releases

Packages

Languages

BobyCow/BitcoinTalk

Folders and files

Latest commit

History

Repository files navigation

BitcoinTalk Scraper

Requirements:

Install requirements

Features:

DownloadHTML-v2

Scraper

TextAnalysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages