-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcornerstone-webscraping-1.html
137 lines (124 loc) · 7.24 KB
/
cornerstone-webscraping-1.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="Learn how to scrape a large set of products using Python. Step-by-step guide with tools, technologies, and code examples.">
<title>Scraping a Large Set of Products - Page 1</title>
<meta property="og:image" content="https://seijmonsbergen.com/wp-content/uploads/2021/10/9618C311-AA12-42FA-85BE-204F41A2CF04-removebg.png">
<meta property="og:url" content="https://diego9621.github.io/">
<link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png">
<link rel="icon" type="image/png" sizes="32x32" href="/favicon-32x32.png">
<link rel="icon" type="image/png" sizes="16x16" href="/favicon-16x16.png">
<link rel="manifest" href="/site.webmanifest">
<link rel="stylesheet" href="styles.css">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/all.min.css">
<link rel="canonical" href="https://diego9621.github.io/cornerstone-webscraping-1.html">
<!-- Google site verification -->
<meta name="google-site-verification" content="pxUf802VTAASjqVZvlySS0cyPYUlPphLGaAmWqLu3V8">
<!-- Google tag (gtag.js) -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-N2CPCFLGWB"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-N2CPCFLGWB');
</script>
</head>
<body>
<header>
<nav>
<ul>
<li><a href="index.html#home">Home</a></li>
<li><a href="index.html#about">About</a></li>
<li><a href="index.html#projects">Projects</a></li>
<li><a href="index.html#blog">Blog</a></li>
<li><a href="index.html#contact">Contact</a></li>
</ul>
</nav>
</header>
<main>
<div class="banner">
<h1>Scraping a Large Set of Products</h1>
<p>Automated Data Collection from Mayesh’s Online Flower Shop</p>
</div>
<article class="project-article">
<p>This project involves scraping an online wholesale company, <a href="https://www.mayesh.com/shop?perPage=100&sortBy=Name-ASC&pageNumb=1&date=&is_sales_rep=0&is_e_sales=0&criteria={}&criteriaInt={}&search=&s_search=" target="_blank">Mayesh</a>. The shop contains thousands of flowers, and our goal is to collect comprehensive data including images, prices, and descriptions.</p>
<h2>1. Project Overview</h2>
<p>The objective of this project is to build an automated web scraping tool that can extract all relevant product information from Mayesh's online shop. The data collected will be stored in a structured format for further analysis and utilization.</p>
<h2>2. Tools & Technologies</h2>
<ul>
<li>
<strong>Programming Language:</strong>
<a href="https://www.python.org/" target="_blank">Python</a>
</li>
<li>
<strong>Web Scraping Libraries:</strong>
<ul>
<li>
<a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="_blank">Beautiful Soup</a> - For parsing HTML content
</li>
<li>
<a href="https://scrapy.org/" target="_blank">Scrapy</a> - For large-scale scraping with built-in scheduling and data storage
</li>
<li>
<a href="https://selenium.dev/" target="_blank">Selenium</a> - For dynamic content handling and web automation
</li>
</ul>
</li>
<li>
<strong>Data Storage:</strong>
<ul>
<li>
<a href="https://pandas.pydata.org/" target="_blank">Pandas</a> - For data manipulation and exporting to CSV
</li>
<li>
<a href="https://www.mongodb.com/" target="_blank">MongoDB</a> - For storing data in a NoSQL database
</li>
</ul>
</li>
<li>
<strong>Development Environment:</strong>
<a href="https://code.visualstudio.com/" target="_blank">Visual Studio Code</a>
</li>
<li>
<strong>Version Control:</strong>
<a href="https://github.com/" target="_blank">GitHub</a>
</li>
</ul>
<h2>3. Installation and Setup</h2>
<p>Before starting with the data collection, ensure you have Python installed and set up your environment with the necessary libraries. Here are the quick steps to get started:</p>
<h3>3.1 Install Python</h3>
<p>Make sure Python is installed on your system. You can download it from <a href="https://www.python.org/downloads/" target="_blank">python.org</a>. After installation, you can check the version by running:</p>
<pre><code>python --version</code></pre>
<h3>3.2 Setting Up a Virtual Environment</h3>
<p>It's a good practice to use a virtual environment for your Python projects. Here's how you can set it up:</p>
<pre><code>python -m venv myenv
source myenv/bin/activate # On Windows use `myenv\\Scripts\\activate`</code></pre>
<h3>3.3 Install Required Libraries</h3>
<p>Install all required Python libraries with pip. Here are the commands to install the main libraries used in this project:</p>
<pre><code>pip install Scrapy BeautifulSoup4 selenium pandas pymongo</code></pre>
<h2>4. Data Collection</h2>
<p>For data collection, we utilized a combination of <a href="https://scrapy.org/" target="_blank">Scrapy</a> and <a href="https://selenium.dev/" target="_blank">Selenium</a>. Scrapy was used for its speed and efficiency in scraping static content, while Selenium handled dynamic content loading. The collected data includes:</p>
<ul>
<li>Product Name</li>
<li>Price</li>
<li>Image URL</li>
<li>Description</li>
</ul>
<p>We will provide more details on the data collection process in the next page.</p>
<nav class="pagination">
<a href="cornerstone-webscraping-2.html">Next »</a>
</nav>
</article>
</main>
<footer>
<ul class="social-links">
<li><a href="https://github.com/your-profile" target="_blank"><i class="fab fa-github"></i> GitHub</a></li>
<li><a href="https://linkedin.com/in/your-profile" target="_blank"><i class="fab fa-linkedin"></i> LinkedIn</a></li>
<li><a href="https://twitter.com/your-profile" target="_blank"><i class="fab fa-twitter"></i> Twitter</a></li>
</ul>
<p>© [Your Name] 2024 | All Rights Reserved</p>
</footer>
</body>
</html>