turkish-academic-text-harvest

This repository contains scripts for downloading articles from Dergipark, a Turkish academic website, as well as Turkish theses. It provides functionality to convert PDF files to text and filter them to produce a dataset for further analysis and research.

The repository is organized into the following directories:

scrapers/: This directory contains scripts for scraping content from Dergipark and the Turkish National Thesis Center. It helps acquire relevant academic materials.
extractors/: This directory includes tools for text extraction from PDF documents.
- parallel_parser.py: This script extracts text from PDFs concurrently, improving the process's efficiency.
- extractor.py: It extracts and filters text from either PDF files or pre-parsed texts, preparing the text for further analysis.
- kenlm_score.py: This script uses a KenLM language model to score sentences within the documents, assisting in evaluating their linguistic quality.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
extractors		extractors
scrapers		scrapers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

turkish-academic-text-harvest

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

boun-tabi-LMG/turkish-academic-text-harvest

Folders and files

Latest commit

History

Repository files navigation

turkish-academic-text-harvest

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages