This code is pretty old (2014), and could use some cleanup. I don't expect that anyone will ever use it. Most learning progress is made when you try your hand at stuff that you have absolutely no business attempting, this is one of my attempts.
This is the code that I wrote for the task at http://www.haskellers.com/jobs/61 , where the goal is to use the methods from a recent paper by Mikolov et al., to generate a graph describing word similarity, based on context, by turning the words into vector representations, and running PCA on the result.
cabal install --dependencies
cabal build
cabal run -- scrape http://tinyurl.com/mczfmz9
cabal run makeCorpus
# optional, improve quality and better progress reports
mv corpus.txt corpus_unshuffled.txt
sort --random-sort corpus_unshuffled.txt > corpus.txt
cabal run train
# the 2.5 might need to be larger or smaller
cabal run -- plot outwords.txt --limit 110 --filter 2.5
# feh pca.png
# gnuplot # and enter:
# plot "plot1.dat" using 2:3:1 with labels
pca.png
: the main output file, a 1024x1024 graph with the amount of words in the --limitplot1.dat
: the file used as input for gnuplot, see the instructions for a way to run gnuplot manuallyerror.png
: a graph of the error function over the number of iterationsoutwords.txt
: the vector representations of all the words (sorted descending by frequency)search.txt
: a cache of the search results from the arxiv searchcorpus.txt
: outputted by makeCorpus, each line contains a sentencepdfs/*.pdf
: the downloaded pdfs by the scraping.
Sadly, it seems that the amount of data isn't enough to generate a word representation that is quite as nice as the one that can be obtained from running plot --binary
on the (gunzipped) sample vector file from the paper authors..
For some more design rationale, please see observations.txt