External Link Checker

External Link Checker is a Python-based tool designed to help website administrators maintain the integrity of their external links. It scrapes a website recursively from the homepage, identifies all external links, and periodically checks their validity to ensure they are not leading to malicious or inactive sites. This project aims to prevent potential security risks and protect brand reputation by ensuring that all external links remain safe and relevant over time.

Features

Recursively scrape links from a given base URL
Identify and collect external links
Save external links, their source pages, and safety status to a CSV file
Disable SSL verification to handle sites with SSL/TLS issues
Proxy support for network configurations
Exclude non-http/https links from safety checks but include them in the CSV output

Requirements

Python 3.x
requests
beautifulsoup4
urllib3
python-dotenv
google-cloud-webrisk

Getting Started

Prerequisites

Docker and Docker Compose installed
Visual Studio Code with Remote - Containers extension installed
WSL (Windows Subsystem for Linux) set up

Setup

Clone the repository:

git clone https://github.com/ShutoYamada/external-link-checker.git
cd external-link-checker

Create a .env file in the project root and add your Google API key:
```
GOOGLE_API_KEY=your_api_key_here
```
Open the project in Visual Studio Code.
When prompted, select "Reopen in Container" to open the project in a devcontainer.

Usage

Ensure you are in the root directory of the project.
Run the following command to scrape external links and save the results to a CSV file:
```
python external_link_checker.py https://yourcompanywebsite.com output.csv
```
Replace https://yourcompanywebsite.com with the URL of the website you want to scrape, and output.csv with the desired output file name.

Environment Variables

GOOGLE_API_KEY: Your Google API key for accessing the Web Risk API.

Project Structure

root/
├── .devcontainer/
│ ├── Dockerfile
│ └── devcontainer.json
├── requirements.txt
├── external_link_checker.py
└── .env (not included in version control)
└── .env.example

Contributions

Contributions are welcome! Please fork the repository and submit a pull request with your improvements.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.devcontainer		.devcontainer
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
external_link_checker.py		external_link_checker.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

External Link Checker

Features

Requirements

Getting Started

Prerequisites

Setup

Usage

Environment Variables

Project Structure

Contributions

License

About

Releases

Packages

Languages

License

ShutoYamada/external-link-checker

Folders and files

Latest commit

History

Repository files navigation

External Link Checker

Features

Requirements

Getting Started

Prerequisites

Setup

Usage

Environment Variables

Project Structure

Contributions

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages