Skip to content

crawly is a python tool to scrape and download scientific papers using Google Scholar, SciHub

License

Notifications You must be signed in to change notification settings

lokeshsk/crawly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

crawly

Crawly is a Python tool for downloading scientific papers using Google Scholar, SciHub. The tool is divided into two part (scripts), crawly.py tries to download papers list from google scholar and save it in CSV file (Authors, Paper Title, Year, Publication, URL) and sci-hub_downloader.py tries to search papers using Paper Title on Scihub and downloads pdf on user system.

It uses selenium to automate the process of downloading papers from sci-hub.

How to use

  • Install required packages using requirements.txt (I would recommend to create a virtual python/conda environment)

    pip install -r requirements.txt
    
  • Run crawly.py script to crawl(scrape) the google scholar for a given query

    python crawly.py
    
    • The script checks whether internet is connect or not, if connected then asks for user inputs like:
      • Keywords to search on google scholar (example: object detection using machine learning)
      • File name by which you want to create CSV file (example: demo)
      • Page range you want to scrape like 1-3 or 4-8 (example: 1-5)*
      • Also asks for optional input whether you want to sort the papers year wise (asc for ascending order and desc for descending order) (example: asc)
    • After all inputs, the crawly will start crawling (scraping) through the google scholar pages.
    • Once done, it will save file as "file_name_page_range".csv (for example: demo_1-5.csv, as the file name given was demo)
  • Run scihub_downloader.py to search and download papers from schi-hub using the created CSV file (NOTE: This script uses selenium, to automate the download process, However, we take user input after every 10 files).

    python scihub_downloader.py
    

NOTE: To run scihub_downloader.py, chromewebdriver is required. Once chromewebdriver is downloaded extract the file and set environment path variable.

Contributions

Feel free to contribute to this project by proposing any change, fix, and enhancement on the dev branch

To do

  • Testing
  • Code documentation
  • General improvements
    • Add other sources for pdf download
    • Add summarization tool(script) for downloaded pdfs
    • Make scripts for robust against request blocks and other errors
    • Create proper pip package for easy use

Disclaimer

This application is for educational purposes only. I do not take responsibility for what you choose to do with this application.

Credits

Donation

If you like this project, you can give me a cup of tea :)

paypal

About

crawly is a python tool to scrape and download scientific papers using Google Scholar, SciHub

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages