wiktionary-tools

Wiktionary scraping, parsing, and annotation tools from the UniMorph Project

Warning

These scripts are provided for reference purposes. Support is not guaranteed.

Prerequisites

ZIMply (requires Python3)
Wiktionary ZIM archive
(OPTIONAL) Sample raw and annotated table templates

Usage

Extract a list of lemma words from Wiktionary ZIM archive (nopic files preferred for smaller file sizes) for each language listed in languages file (one per line). Each lemma is associated with a possible part of speech. Results appear as LANGUAGE_lemma_list_fast.txt files.

python zim_extract_lemmaList.py -zimfile wiktionary_en_all_nopic_2017-08.zim -langfile languages.txt

Extract html candidate pages form Wiktionary ZIM archive for languages listed in languages file (one per line). These pages are likely to contain morphological inflection tables. Results appear in a candidate_pages directory.

python zim_extract_all.py -zimfile wiktionary_en_all_nopic_2017-08.zim -langfile languages.txt

Create annotation templates from candidate pages. This generates raw_tables and annotated_tables directories.

python extract_example_tables.py -candidates_dir candidate_pages

After annotation (see guide below, and examples in wiktionary-annotation repo for reference), create morphologically annotated wordlists for all lemmas in each language. Annotation directory must contain both raw_tables and annotated_tables sub-directories. Results are generated in a tabular_results directory in the form of LANGUAGE_tabular_paradigms.txt files. fixParadigms.py removes paradigms that generated exceptions during parsing, resulting in LANGUAGE_tabular_paradigms_norm.txt output.

python extract_tabular_data.py -candidates_dir candidate_pages -annotation_dir . -language Ukrainian
python fixParadigms.py -language Ukrainian

Annotation Guide

Select a language and table to annotate from the raw tables directory. For example, raw_tables/Adyghe/N_001427_(6,3)_example.csv

The file names follow the convention: POS_LEMMACOUNT_SHAPE_example.csv

For example, looking at N_001427_(6,3)_example.csv:
- This is a table type associated with noun lemmas.
- There are 1427 lemmas that share this table template.
- The shape of the extracted table is 6 rows by 3 columns.
Copy the selected file to the annotated tables directory. This will be the version that you will annotate. For example, annotated_tables/Adyghe/N_001427_(6,3)_example.csv

DO NOT MODIFY THE ORIGINAL RAW FILE. The annotation pipeline relies on finding the differences between the unannotated and annotated files.
Open the copy in an editor. You will be greeted with a table of forms and grammatical descriptors as shown below.

AVOID USING EXCEL. IT MANGLES UNICODE CHARACTERS BY DEFAULT. LIBREOFFICE IS RECOMMENDED INSTEAD.
In order to cleanly extract all the inflected forms associated with a lemma with a particular table template, you just need to replace each inflected cell in the example template file with the correct UniMorph tags that correspond to it. It can help to look up the example lemma in Wiktionary online as additional reference. See below for an example.

DO NOT MODIFY ANY OF THE TABLE CELLS THAT DON’T CORRESPOND TO INFLECTED FORMS (e.g., any cells that correspond to grammatical descriptors like nominative, genitive, plural, etc.).
Save the modified file, and that’s it!

Here are some additional guidelines.

Avoid adding labels for innate lexical properties of words, especially if these aren’t EXPLICITLY written somewhere in the table you are annotating. These may include lexical gender or animacy for nouns, or aspect for Slavic verb systems. Focus instead on non-innate inflectional features that are marked in the table (e.g., noun case, verb tense, aspect for non-Slavic verbs).

POSSIBLE EXCEPTION: IF a lexical property IS marked in the table you are annotating, AND you are reasonably sure that the same table structure isn’t shared by words with different lexical properties, it probably doesn’t hurt to include the lexical property if UniMorph has a tag for it.
If you spot any cells that are wrong or shouldn't be extracted because they don't involve inflected forms of the correct part of speech, simply don't assign any UniMorph tags to them, and leave them as is. This could mean ignoring entire files, but that's fine.
Many table templates have a low count (~1), meaning only one lemma has that table type associated with it. It's up to you if you think it's worth annotating those.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
images		images
.gitignore		.gitignore
AUTHORS		AUTHORS
LICENSE		LICENSE
README.md		README.md
extract_example_tables.py		extract_example_tables.py
extract_tabular_data.py		extract_tabular_data.py
fixParadigms.py		fixParadigms.py
languages.txt		languages.txt
zim_extract_all.py		zim_extract_all.py
zim_extract_lemmaList.py		zim_extract_lemmaList.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wiktionary-tools

Warning

Prerequisites

Usage

Annotation Guide

About

Releases

Packages

Contributors 2

Languages

License

unimorph/wiktionary-tools

Folders and files

Latest commit

History

Repository files navigation

wiktionary-tools

Warning

Prerequisites

Usage

Annotation Guide

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages