Table of Contents
The goal was to scrape over 320k wine reviews to be used in a recommender project. First the scraper package was developed simplescraper. Due to the fact that there were so many links to scrape 4 servers were used. The servers were set up on cloud.google.com
.
miniconda
and samssimplescraper
were installed on each server. The data folder on each server was mounted to a single cloud.google
storage bucket. All of the scarped html files were saved there.
The scraped data will be cleaned and formated for use in a machine learning project which will soon be linked here.
Follow the installation instructions. The docstrings have detailed explainations for use.
- Access to servers via googcloud, AWS or other means. If the scrape job is small it can be done from a home computer.
- Install
miniconda
or some sort of python support on the server(s). - Install
gcsfuse
on server to mount the bucket for data storage. - For large jobs a mounted storage of some sort would be recommended.
- To replicate the code here use samssimplescraper
Once the server has python running then the code can be run with only the following package installed:
Install pip
package
pip install samssimplescraper==0.1.3
Code can be used as is for learning purposes. Or it can be adapted to user's goal and run in the server shell or locally. Follow the Roadmap and feel free to get in touch with any and all questions or comments.
-
Create server instance(s) and bucket on
googlecloud
or other provider. -
Clone this repository on to the server:
git clone https://github.com/SamuelAdamsMcGuire/remote_data_collection
- Create the following folder structure:
├── data
│ ├── pickled_lists
│ └── scraped_html
├── logs
├── config.py
├── links_scraper.py
├── sitemap_scraper.py
└── status_checker.py
-
The
config.py
is only necessary if you wish to use logging to recieve an email. There is an example of the needed creditials inconfig_example.py
. Some tweaks to your email settings may be necessary in order to recieve the emails. Note: Logs are also saved on the server. -
If a large amount of links are being scraped (here there are over 320k) then make sure to first mount the data folder to a bucket. On
googlecloud
first create the bucket and in the server shell mount using the following command:
gcsfuse example-bucket /path/to/mount
-
Adapt code to your project or replicate this one.
-
Scrape sitemap(s)
python sitemap_scraper.py
- Scrape links. When running large scrape jobs it is also clever to run the process in the background:
nohup python links_scraper.py &
- Check on progress using the status checker:
python status_checker.py
See the open issues for a full list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE
for more information.
Samuel Adams McGuire - [email protected]
Linkedin: LinkedIn
Project Link: https://github.com/SamuelAdamsMcGuire/wine_data_collection
Pypi Link for samssimplescraper: https://pypi.org/project/samssimplescraper/0.1.3/