Skip to content

A Python tool for recursively scraping websites to identify and log external links to a CSV file.

License

Notifications You must be signed in to change notification settings

ShutoYamada/external-link-checker

Repository files navigation

External Link Checker

External Link Checker is a Python-based tool designed to help website administrators maintain the integrity of their external links. It scrapes a website recursively from the homepage, identifies all external links, and periodically checks their validity to ensure they are not leading to malicious or inactive sites. This project aims to prevent potential security risks and protect brand reputation by ensuring that all external links remain safe and relevant over time.

Features

  • Recursively scrape links from a given base URL
  • Identify and collect external links
  • Save external links, their source pages, and safety status to a CSV file
  • Disable SSL verification to handle sites with SSL/TLS issues
  • Proxy support for network configurations
  • Exclude non-http/https links from safety checks but include them in the CSV output

Requirements

  • Python 3.x
  • requests
  • beautifulsoup4
  • urllib3
  • python-dotenv
  • google-cloud-webrisk

Getting Started

Prerequisites

  • Docker and Docker Compose installed
  • Visual Studio Code with Remote - Containers extension installed
  • WSL (Windows Subsystem for Linux) set up

Setup

  1. Clone the repository:

    git clone https://github.com/ShutoYamada/external-link-checker.git
    cd external-link-checker
  2. Create a .env file in the project root and add your Google API key:

    GOOGLE_API_KEY=your_api_key_here
    
  3. Open the project in Visual Studio Code.

  4. When prompted, select "Reopen in Container" to open the project in a devcontainer.

Usage

  1. Ensure you are in the root directory of the project.

  2. Run the following command to scrape external links and save the results to a CSV file:

    python external_link_checker.py https://yourcompanywebsite.com output.csv

    Replace https://yourcompanywebsite.com with the URL of the website you want to scrape, and output.csv with the desired output file name.

Environment Variables

  • GOOGLE_API_KEY: Your Google API key for accessing the Web Risk API.

Project Structure

root/
├── .devcontainer/
│ ├── Dockerfile
│ └── devcontainer.json
├── requirements.txt
├── external_link_checker.py
└── .env (not included in version control)
└── .env.example

Contributions

Contributions are welcome! Please fork the repository and submit a pull request with your improvements.

License

This project is licensed under the MIT License.

About

A Python tool for recursively scraping websites to identify and log external links to a CSV file.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published