This is a Python package for processing WikiMedia dump files for Wiktionary to produce language data in JSON format for use with the Dictum Finnish dictionary project.
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .
NOTE: This currently is only known to work with a older Wiktionary dumps,
e.g. enwiktionary-20230101-pages-articles.xml.bz2
. There are incompatibilities with the Lua code in more recent dumps (2025).
. .venv/bin/activate
./build.sh
Extract data from Wiktionary dumps, expand the macros and output JSON data.
Extract all Finnish language data:
./wiktionarymunge.py --lang=fi -e output/all/dict data/fiwiktionary-20230101-pages-articles.xml.bz2
This script is an extra utility that takes a Wiktionary inflection template and returns the inflection data in JSON format.
./fi-inflect.py 'koira|rou|t|d|a'
./fi-inflect.py 'tulla|rii|d|t|el|ä'
Extract files from the cache. This is only useful after successfully running wiktionarymunge.py
.
./cache-get.py -l en Module:string/char
Reads dict/*.json
files and creates a map.json.gz
file containing a map of all words from their inflected forms to their root form, as well as separated four letter map files in maps/????.json.gz
which are grouped by their first four letters.
cd output/all
../../map-forms.py
Combines the JSON dictionary entries into four letter files which contain all the words that begin with those four letters.
cd output/all
../../compress-dict.py
Read a word frequency file and add frequency counts to dictionary entries.
cd output/all
../../add-frequencies.py word_freq.txt