GitHub - indweller/wiki-extractor

This repository contains the code used for extracting surface names from the Wikipedia XML dump, created for my Bachelor's thesis project. The code is largely based on this and this. This code containing my previous analysis of surface names might be helpful.

To use this repository, first install all the requirements and activate the virtual environment.
Before running the code, ensure that the paths in all the files are correct.
The feeder script splits the wikipedia dump into chunks to feed to the python scripts.
To run the id-extractor or index extractor or text-extractor, simply change the corresponding line in the feeder sript.
The index-extractor outputs the list of indices which will be used for further extraction.
The id-extractor outputs the article ids and article names for the articles and redirects.
The text-extractor outputs the sentences and surface names.
The surface-extractor outputs the frequencies of the surface names for every article. The input to the surface-extractor must be sorted based on the destination entity corresponding to the surface name. Since the csv file would be very large, you might have to use external sorting.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
env		env
README.md		README.md
feeder.sh		feeder.sh
id-extractor.py		id-extractor.py
index-extractor.py		index-extractor.py
requirements.txt		requirements.txt
surface-extractor.py		surface-extractor.py
text-extractor.py		text-extractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

indweller/wiki-extractor

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages