wiktionarymunge

This is a Python package for processing WikiMedia dump files for Wiktionary to produce language data in JSON format for use with the Dictum Finnish dictionary project.

Quick start

Setup environment

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .

Get the data

NOTE: This currently is only known to work with a older Wiktionary dumps, e.g. enwiktionary-20230101-pages-articles.xml.bz2. There are incompatibilities with the Lua code in more recent dumps (2025).

Build dictionary JSON

. .venv/bin/activate
./build.sh

Tools

wiktionarymunge.py

Extract data from Wiktionary dumps, expand the macros and output JSON data.

Extract all Finnish language data:

./wiktionarymunge.py --lang=fi -e output/all/dict data/fiwiktionary-20230101-pages-articles.xml.bz2

fi-inflect.py

This script is an extra utility that takes a Wiktionary inflection template and returns the inflection data in JSON format.

./fi-inflect.py 'koira|rou|t|d|a'
./fi-inflect.py 'tulla|rii|d|t|el|ä'

cache-get.py

Extract files from the cache. This is only useful after successfully running wiktionarymunge.py.

./cache-get.py -l en Module:string/char

map-forms.py

Reads dict/*.json files and creates a map.json.gz file containing a map of all words from their inflected forms to their root form, as well as separated four letter map files in maps/????.json.gz which are grouped by their first four letters.

cd output/all
../../map-forms.py

compress-dict.py

Combines the JSON dictionary entries into four letter files which contain all the words that begin with those four letters.

cd output/all
../../compress-dict.py

add-frequencies.py

Read a word frequency file and add frequency counts to dictionary entries.

cd output/all
../../add-frequencies.py word_freq.txt

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
wikimunge		wikimunge
wiktionarymunge		wiktionarymunge
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
add-frequencies.py		add-frequencies.py
build.sh		build.sh
cache-get.py		cache-get.py
compress-dict.py		compress-dict.py
dump-templates.py		dump-templates.py
expand.py		expand.py
fi-inflect.py		fi-inflect.py
inflection-notes.txt		inflection-notes.txt
map-forms.py		map-forms.py
pyproject.toml		pyproject.toml
wiktionarymunge.py		wiktionarymunge.py
word-freq.py		word-freq.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

wiktionarymunge

Quick start

Setup environment

Get the data

Build dictionary JSON

Tools

wiktionarymunge.py

fi-inflect.py

cache-get.py

map-forms.py

compress-dict.py

add-frequencies.py

About

Uh oh!

Releases

Packages

Languages

License

CauldronDevelopmentLLC/wiktionarymunge

Folders and files

Latest commit

History

Repository files navigation

wiktionarymunge

Quick start

Setup environment

Get the data

Build dictionary JSON

Tools

wiktionarymunge.py

fi-inflect.py

cache-get.py

map-forms.py

compress-dict.py

add-frequencies.py

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages