This repository contains code and resources for automated web scraping of dynamic content using Selenium. The project demonstrates how to use Selenium to scrape data from web pages that load content dynamically.
Automated-Web-Scraping-of-Dynamic-Content-with-Selenium/
│
├── .gitattributes
├── Web Scraping Code File.ipynb
└── quotes.csv
.gitattributes
: Configuration file to ensure consistent handling of files across different operating systems.Web Scraping Code File.ipynb
: Jupyter Notebook containing the code for web scraping using Selenium.quotes.csv
: CSV file containing the scraped data.
This project focuses on using Selenium for web scraping to handle dynamic content that is loaded via JavaScript. The project involves the following key steps:
- Set up Selenium WebDriver for automated browsing.
- Navigate to the target website and interact with web elements to load dynamic content.
- Extract the required data and save it to a CSV file.
- Web Scraping: Selenium, BeautifulSoup
- Libraries: Pandas, NumPy, Selenium
- Data Storage: CSV
To get started with this project, follow the steps below:
Make sure you have the following installed:
- Python 3.x
- Jupyter Notebook
- Required Python libraries (listed in
requirements.txt
) - Selenium WebDriver for your browser (e.g., ChromeDriver for Google Chrome)
- Clone this repository to your local machine:
git clone https://github.com/Harshraj1301/Automated-Web-Scraping-of-Dynamic-Content-with-Selenium.git
- Navigate to the project directory:
cd Automated-Web-Scraping-of-Dynamic-Content-with-Selenium
- Install the required Python libraries:
pip install -r requirements.txt
- Download the appropriate Selenium WebDriver for your browser and ensure it is in your system's PATH.
- Open the Jupyter Notebook:
jupyter notebook "Web Scraping Code File.ipynb"
- Follow the instructions in the notebook to run the code cells and scrape dynamic content using Selenium.
The notebook Web Scraping Code File.ipynb
includes the following steps:
- Setting Up Selenium WebDriver: Instructions for setting up Selenium WebDriver and navigating to the target website.
- Interacting with Web Elements: Code for interacting with web elements to load dynamic content.
- Data Extraction: Code for extracting the required data using Selenium and BeautifulSoup.
- Data Storage: Saving the scraped data to a CSV file.
Here are the contents of the notebook:
pwd
pip install selenium
import csv
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Recording the start time
start_time = time.time()
# Initializing Chrome WebDriver
driver = webdriver.Chrome()
# Opening the website
driver.get("http://quotes.toscrape.com/js/")
# Initializing lists to store data
quotes_data = []
# Creating a WebDriverWait object with a timeout of 10 seconds
wait = WebDriverWait(driver, 10)
# Iterating through all pages
while True:
# Finding all the quotes on the current page
quotes = driver.find_elements(By.CLASS_NAME, "quote")
# Extracting data for each quote on the current page
for quote in quotes:
quote_text = quote.find_element(By.CLASS_NAME, "text").text
author = quote.find_element(By.CLASS_NAME, "author").text
tags = "|".join(tag.text for tag in quote.find_elements(By.CLASS_NAME, "tag"))
# Appending data to the list
quotes_data.append([author, quote_text, tags])
# Checking if there is a next page
try:
next_page = driver.find_element(By.XPATH, "//li[@class='next']/a")
if 'disabled' in next_page.get_attribute("class"):
print("Reached the last page.")
break
except:
print("Next page button not found. Exiting.")
break
# Scrolling to the next page element
print("Scrolling to the next page.")
ActionChains(driver).move_to_element(next_page).perform()
# Clicking the "Next" link
print("Clicking the next page.")
next_page.click()
# Closing the WebDriver
driver.quit()
# Recording the end time
end_time = time.time()
# Calculating the total run time
total_run_time = end_time - start_time
print(f"Total run time: {total_run_time} seconds")
# Saving the data to a CSV file
with open("quotes.csv", "w", newline="", encoding="utf-8") as csv_file:
csv_writer = csv.writer(csv_file)
# Writing header row
csv_writer.writerow(["Author", "Quote", "Tags"])
# Writing quote data
csv_writer.writerows(quotes_data)
print("Quotes have been scraped and saved to quotes.csv.")
The notebook includes the results of the web scraping process, showcasing the extracted data saved in the quotes.csv
file.
If you'd like to contribute to this project, please follow these steps:
- Fork the repository.
- Create a new branch:
git checkout -b feature-branch-name
- Make your changes and commit them:
git commit -m 'Add some feature'
- Push to the branch:
git push origin feature-branch-name
- Submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
- This project was created as part of an assignment by Harshraj Jadeja.
- Thanks to the open-source community for providing valuable resources and libraries for web scraping.
Feel free to modify this README.md
file as per your specific requirements and project details.