We have built corpus for Kazakh language from Wikipedia dump (https://dumps.wikimedia.org/kkwiki/). Using a tool from Jones Evans (http://www.evanjones.ca/software/wikipedia2text.html) to parse data, and nltk to build n-grams.
A total of 20 million words were collected. With almost 600 thousand words of different derivations.