LIB | DOC | IMPORT |
---|---|---|
bs4 | BeautifulSoup | from bs4 import BeautifulSoup as bs |
cursor | Cursor | import cursor |
spacy | Spacy | import spacy |
matplotlib | Matplotlib | import matplotlib.pyplot as plt |
mplcursors | Mplcursors | import mplcursors |
python -m pip install -r requirements.txt
Check if pages of the BitcoinTalk forum are missing and download them if necessary:
RUN:python DownloadHTML-v2.py
Download all pages of the BitcoinTalk forum:
RUN:python DownloadHTML-v2.py [ -u | --update ]
It will create a directory containing all html pages of the forum, with the following architecture:
├── BitcoinTalk-Forum
│ ├── Hardware (childboard name)
│ │ ├── 0a6b038fb2ee35ae4941dd86fcad7c1858cef8df (hashed topic title using sha1 encoding)
│ │ │ ├── 1.html (html page of the topic)
│ │ │ ├── 2.html
│ │ │ └── 3.html
│ │ ├── 7dc40b9c0c457b9986b4a710705b67d0b035f9e8
│ │ │ └── 1.html
│ │ ├── 85ea1f75700b12789b153d1ea1e952738b84efb1
│ │ │ ├── 1.html
│ │ │ ├── 2.html
│ │ │ ├── 3.html
│ │ │ ├── 4.html
│ │ │ ├── 5.html
│ │ │ ├── [...]
│ │ ├── 93fc59b432ec70f3b7d72bedac883df6ce1af86a
│ │ ├── [...]
│ ├── Mining Software (miners) [...]
│ ├── Mining speculation [...]
│ ├── Mining support [...]
│ └── Pools [...]
Iterate over downloaded files and retrieve all informations:
RUN:python Scraper.py
The scraper will create a file named BitcoinTalk-data.json
.
It is a large file containing all informations about all scraped topics.
Iterate over all topics, create B-O-W (bags-of-words) and compute TF-IDF (term frequency-inverse document frequency) on each word (excluding stop-words & punctuation).
Before running this program, please make sure you already ran DownloadHTML-v2.py and Scraper.py in order to create a
BitcoinTalk-data.json
file:
RUN:python TextAnalysis.py
It will analyse data inside BitcoinTalk-data.json
, and will create a analysis_results.json
file containing (as its name says) the results of the analysis:
For each word, of each post, of each topic:
- number of occurrences
- term frequency
- inverse document frequency
- term frequency * inverse document frequency (tf-idf)
- metadata relative to the topic
- total number of words
- number of documents (posts)
If you need words to be ignored while analyzing all topics, you can add them inside ./WORDS/ignore.json
:
{
"ignore": [
"pool",
"pools",
"hi",
"ve",
"hey",
"test"
]
}
Documentation about TF-IDF
Thomas Péan