The repository contains scripts and auxiliary files used in masters HSE work "Representation of minor languages of Russia on the Internet: quantitative description and data analysis".
Counts and prints to stdout different stats about internet text collections: amount of all web-sites, amount of whole-downloaded web-sites, amount of web-pages, amount of tokens, token median per page, whole-downloaded ratio to all web-sites
receives options: -l - to process single folder with jsons -m - to process all lang folders with jsons -u - for folder with lists of whole-downloaded web-sites
For what it is all about, see repo of the project: (sorry, Russian). Works with text collections similar to web-site collections:
Calculated data with other data merged presented here:
uses whois-api and ip-api to receive domains registration info Perhaps, works only in linux systems
options: -u - for list with web-sites and langs -a - for list with ambigious web-sites and langs -g - to add geo ip information to the whois-file
you can find all gathered whois information here:
drawing script for all registration stats info
Script works in two stages:
gets all links from all htmls and save them into json file options -l or -w - to pass folders with htmls outputs folder with json: links_from_htmls
gets all connections between different languages options -c (connection mode) and -u (files with url_types from web-site: output:
- files with links info to tsv file (see
- files with graph info to stdout (see
- creates folder link_graphs with plenty of .dot files
NB! Implicitly uses langs_all_info_merged.tsv from data folder in this repo.
Miniscript to draw all dot files saved from
Auxiliary scipt to parse Wikipedia page with table of wikipedias in all languages and all information about them
Script with some regularly used functions, such as file reading
There are two ipynb files with Russian comments for data analysis
includes data preparation, analysis of correlations, regression analysis and hierarchical clustering
draws graphs for token median per page and scatter plot of token median per page and web-pages includes comments in Russian