Fast python scraper for downloading texts from Project Gutenberg or Wikisource (Romanian only)
- Clone the repository:
git clone https://github.com/tudord14/text-scraper.git cd text-scraper
- Install requirements:
pip install -r requirements.txt
Fully working cli app where the user chooses between Gutenberg or Wikisource(RO), then desired author name should be written... easy! The user should be really careful when writing authors names as the program is really sensitive to diacritics, especially for romanian authors. Even for swedish authors for example one should copy the name from the internet!!! I will include a txt file in the project which will have a lot of romanian authors names!!! For example...
Ion Creanga ❌
Ion Creangă ✅
If noted then do:
python main.py
If the user (for some reason) wants to independently run the wikisource or gutenberg process he can...
- Wikisource(RO)
python wikisource_RO_texts_extraction.py
- Gutenberg
Then cleanup...
python author_texts_extraction_epub_to_txt.py
python author_directory_cleanup.py
This tool directly scrapes publicly available text from the respective websites it does not use any official API!!!!!! By using this scraper, you assume full responsibility for your actions and compliance with all relevant legal and ethical obligations!!!!! The author of this repository(me!) bears no liability for any misuse or resulting issues!!!!
Have fun!!!!