text-scraper

Fast python scraper for downloading texts from Project Gutenberg or Wikisource (Romanian only)

Clone the repository:

git clone https://github.com/tudord14/text-scraper.git
cd text-scraper

Install requirements:
```
pip install -r requirements.txt
```

Main App =>

Fully working cli app where the user chooses between Gutenberg or Wikisource(RO), then desired author name should be written... easy! The user should be really careful when writing authors names as the program is really sensitive to diacritics, especially for romanian authors. Even for swedish authors for example one should copy the name from the internet!!! I will include a txt file in the project which will have a lot of romanian authors names!!! For example...

Ion Creanga ❌
Ion Creangă ✅

If noted then do:

python main.py

Individual scripts =>

If the user (for some reason) wants to independently run the wikisource or gutenberg process he can...

Wikisource(RO)

python wikisource_RO_texts_extraction.py

Gutenberg

python author_texts_extraction_epub_to_txt.py

Then cleanup...

python author_directory_cleanup.py

Disclaimer

This tool directly scrapes publicly available text from the respective websites it does not use any official API!!!!!! By using this scraper, you assume full responsibility for your actions and compliance with all relevant legal and ethical obligations!!!!! The author of this repository(me!) bears no liability for any misuse or resulting issues!!!!

Have fun!!!!

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
LICENSE		LICENSE
README.md		README.md
author_directory_cleanup.py		author_directory_cleanup.py
author_names.txt		author_names.txt
author_texts_extraction_epub_to_txt.py		author_texts_extraction_epub_to_txt.py
full_class.py		full_class.py
main.py		main.py
requirements.txt		requirements.txt
wikisource_RO_texts_extraction.py		wikisource_RO_texts_extraction.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

text-scraper

Main App =>

Individual scripts =>

Disclaimer

About

Releases

Packages

Languages

License

tudord14/text-scraper

Folders and files

Latest commit

History

Repository files navigation

text-scraper

Main App =>

Individual scripts =>

Disclaimer

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages