Skip to content

CauldronDevelopmentLLC/wiktionarymunge

Repository files navigation

wiktionarymunge

This is a Python package for processing WikiMedia dump files for Wiktionary to produce language data in JSON format for use with the Dictum Finnish dictionary project.

Quick start

Setup environment

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .

Get the data

NOTE: This currently is only known to work with a older Wiktionary dumps, e.g. enwiktionary-20230101-pages-articles.xml.bz2. There are incompatibilities with the Lua code in more recent dumps (2025).

Build dictionary JSON

. .venv/bin/activate
./build.sh

Tools

wiktionarymunge.py

Extract data from Wiktionary dumps, expand the macros and output JSON data.

Extract all Finnish language data:

./wiktionarymunge.py --lang=fi -e output/all/dict data/fiwiktionary-20230101-pages-articles.xml.bz2

fi-inflect.py

This script is an extra utility that takes a Wiktionary inflection template and returns the inflection data in JSON format.

./fi-inflect.py 'koira|rou|t|d|a'
./fi-inflect.py 'tulla|rii|d|t|el|ä'

cache-get.py

Extract files from the cache. This is only useful after successfully running wiktionarymunge.py.

./cache-get.py -l en Module:string/char

map-forms.py

Reads dict/*.json files and creates a map.json.gz file containing a map of all words from their inflected forms to their root form, as well as separated four letter map files in maps/????.json.gz which are grouped by their first four letters.

cd output/all
../../map-forms.py

compress-dict.py

Combines the JSON dictionary entries into four letter files which contain all the words that begin with those four letters.

cd output/all
../../compress-dict.py

add-frequencies.py

Read a word frequency file and add frequency counts to dictionary entries.

cd output/all
../../add-frequencies.py word_freq.txt

About

Extracts language data in JSON format from Wiktionary dump files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published