Webpage Link Extractor

A small, typed Python CLI that extracts unique, absolute http(s) links from a single webpage and writes them to a file or stdout. It handles relative URLs, honors <base href>, removes fragments (#...), filters non-web schemes, and includes robust request handling (timeouts, retries, custom User-Agent).

Features

Converts relative → absolute URLs (urljoin), respects <base href>
Filters out mailto:, javascript:, tel:, data: links
Deduplicates and optionally --sorts output
Retries transient HTTP errors (429/5xx) and uses a reusable requests.Session
Verifies HTML content-type before parsing
CLI flags for output path, timeout, and User-Agent
Clear exit codes and helpful error messages

Prerequisites

Python 3.10+
Packages: requests, beautifulsoup4 Install:

pip install -U requests beautifulsoup4

Tip: Use a virtual environment (python -m venv .venv && source .venv/bin/activate on macOS/Linux, or .venv\Scripts\Activate.ps1 on Windows).

Usage

Assuming the file is named link_extractor.py (replace with your filename if different):

# Print links to stdout
python link_extractor.py https://example.com

# Save to a file
python link_extractor.py https://example.com -o links.txt

# Sort output alphabetically
python link_extractor.py https://example.com -o links.txt --sort

# Increase/adjust timeout (seconds)
python link_extractor.py https://example.com --timeout 15

# Use a custom User-Agent (may help avoid 403s)
python link_extractor.py https://example.com --user-agent "MyCrawler/1.0 ([email protected])"

Command-line options

positional arguments:
  url                     The page URL to scrape (e.g., https://example.com)

optional arguments:
  -o, --output PATH       Write links to file (default: stdout)
  --timeout SECONDS       Request timeout (default: 10)
  --user-agent STRING     Custom User-Agent header
  --sort                  Sort links before output

Output

One URL per line
Only http/https links
Fragments removed (e.g., https://site/page#section → https://site/page)
Unique links (set semantics)

Exit codes

0 – Success
1 – Runtime error (network/HTTP, non-HTML content, write failure)
2 – Invalid input (e.g., URL doesn’t start with http:// or https://)

Examples

# Extract and pipe to grep
python link_extractor.py https://example.com | grep "/blog/"

# Save, sort, and then count
python link_extractor.py https://example.com -o links.txt --sort && wc -l links.txt

Notes & Etiquette

Be polite: respect the site’s terms of use, robots.txt, and rate limits.
This tool processes a single page. It does not crawl multiple pages.
Some sites build links dynamically with JavaScript; this script parses server-returned HTML only.
If you encounter 403 Forbidden or similar, try a realistic --user-agent. Some sites block unknown clients.
If the server responds with non-HTML content, the script will exit with an error indicating the Content-Type.

Troubleshooting

403/429 errors: set a custom --user-agent; try again later (server throttling); ensure you’re allowed to fetch the page.
Timeouts: raise --timeout or check your network connection.
SSL/Certificates: ensure your system certificates are up to date; avoid disabling verification.
Empty output: page may not contain anchor tags with usable href values, or links may be built via client-side JS.

Roadmap (optional enhancements)

--same-domain-only filtering
JSON/CSV output formats
Simple crawler mode with politeness (robots, delay, concurrency limits)

License: MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
link_extractor.py		link_extractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Webpage Link Extractor

Features

Prerequisites

Usage

Command-line options

Output

Exit codes

Examples

Notes & Etiquette

Troubleshooting

Roadmap (optional enhancements)

About

Uh oh!

Releases

Packages

Languages

License

trixxmanaty/extract-urls-to-file

Folders and files

Latest commit

History

Repository files navigation

Webpage Link Extractor

Features

Prerequisites

Usage

Command-line options

Output

Exit codes

Examples

Notes & Etiquette

Troubleshooting

Roadmap (optional enhancements)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages