Pyrolysate is a Python library and CLI tool for parsing and validating URLs and email addresses. It breaks down URLs and emails into their component parts, validates against IANA's official TLD list, and outputs structured data in JSON, CSV, or text format.
The library offers both a programmer-friendly API and a command-line interface, making it suitable for both development integration and quick data processing tasks. It handles single entries or large datasets efficiently using Python's generator functionality, and provides flexible input/output options including file processing with custom delimiters.
- Extract scheme, subdomain, domain, TLD, port, path, query, and fragment components
- Support for complex URL patterns including ports, queries, and fragments
- Support for IP addresses in URLs
- Support for both direct input and file processing via CLI or API
- Output as JSON, CSV, or text format through CLI or API
- Extract username, mail server, and domain components
- Support for plus addressing (e.g., [email protected])
- Support for both direct input and file processing via CLI or API
- Output as JSON, CSV, or text format through CLI or API
- Automatic updates from IANA's official TLD list
- Local TLD file caching for offline use
- Fallback to common TLDs if both online and local sources fail
- Process single or multiple entries
- Support for government domain emails (.gov.tld)
- Custom delimiters for file input
- Multiple output formats with .txt format as default (JSON, CSV, text)
- Pretty-printed or minified JSON output
- Console output or file saving options
- Memory-efficient processing of large datasets using Python generators
- Support for compressed input files:
- ZIP archives (processes all text files within .zip)
- GZIP (.gz)
- BZIP2 (.bz2)
- LZMA (.xz, .lzma)
- Type hints for better IDE support
- Comprehensive docstrings
- Modular design for easy integration
- Command-line interface for quick testing
pip install pyrolysate
- Clone the repository
git clone https://github.com/dawnandrew100/pyrolysate.git
cd pyrolysate
- Create and activate a virtual environment
# Using hatch (recommended)
hatch env create
# Or using venv
python -m venv .venv
# Windows
.venv\Scripts\activate
# Unix/MacOS
source .venv/bin/activate
- Install in development mode
# Using hatch
hatch run dev
# Or using pip
pip install -e .
# Using hatch (recommended)
hatch run pyro -u example.com
# Or using the CLI directly
pyro -u example.com
The CLI command pyro
will be available after installation. If the command isn't found, ensure Python's Scripts directory is in your PATH.
from pyrolysate import parse_input_file
urls = parse_input_file("urls.txt")
emails = parse_input_file("emails.csv", delimiter=",")
- JSON (prettified or minified)
- CSV
- Text (default)
- File output with custom naming
- Console output
from pyrolysate import email
result = email.parse_email("[email protected]")
result = email.parse_email("[email protected]")
emails = ["[email protected]", "[email protected]"]
result = email.parse_email_array(emails)
json_output = email.to_json("[email protected]")
json_output = email.to_json(["[email protected]", "[email protected]"])
email.to_json_file("output", "[email protected]")
email.to_json_file("output", ["[email protected]", "[email protected]"])
csv_output = email.to_csv("[email protected]")
csv_output = email.to_csv(["[email protected]", "[email protected]"])
email.to_csv_file("output", "[email protected]")
email.to_csv_file("output", ["[email protected]", "[email protected]"])
from pyrolysate import url
result = url.parse_url("https://www.example.com/path?q=test#fragment")
urls = ["example.com", "https://www.test.org"]
result = url.parse_url_array(urls)
json_output = url.to_json("example.com")
json_output = url.to_json(["example.com", "test.org"])
url.to_json_file("output", "example.com")
url.to_json_file("output", ["example.com", "test.org"])
csv_output = url.to_csv("example.com")
csv_output = url.to_csv(["example.com", "test.org"])
url.to_csv_file("output", "example.com")
url.to_csv_file("output", ["example.com", "test.org"])
pyro -h
pyro -u example.com
pyro -u example1.com example2.com
pyro -u -i urls.txt
pyro -u -i urls.csv -d ","
pyro -e [email protected]
pyro -e [email protected] [email protected] -j -o output
pyro -u -i urls.txt -c -o parsed_urls
pyro -e -i emails.txt -d "," -o output
pyro -e [email protected] -j -np
# Parse log file
pyro -u -i server.log
# Parse compressed log file
pyro -u -i server.log.gz
# Parse BZIP2 compressed file
pyro -e -i emails.txt.bz2
# Parse ZIP archive containing logs and text files
pyro -u -i archive.zip
Method | Parameters | Description |
---|---|---|
parse_email(email_str) |
email_str: str |
Parses single email address |
parse_email_array(emails) |
emails: list[str] |
Parses list of email addresses |
to_json(emails, prettify=True) |
emails: str|list[str] , prettify: bool |
Converts to JSON format |
to_json_file(file_name, emails, prettify=True) |
file_name: str , emails: list[str] , prettify: bool |
Converts and saves JSON to file |
to_csv(emails) |
emails: str|list[str] |
Converts to CSV format |
to_csv_file(file_name, emails) |
file_name: str , emails: list[str] |
Converts and saves CSV to file |
Method | Parameters | Description |
---|---|---|
parse_url(url_str, tlds=[]) |
url_str: str , tlds: list[str] |
Parses single URL |
parse_url_array(urls, tlds=[]) |
urls: list[str] , tlds: list[str] |
Parses list of URLs |
to_json(urls, prettify=True) |
urls: str|list[str] , prettify: bool |
Converts to JSON format |
to_json_file(file_name, urls, prettify=True) |
file_name: str , urls: list[str] , prettify: bool |
Converts and saves JSON to file |
to_csv(urls) |
urls: str|list[str] |
Converts to CSV format |
to_csv_file(file_name, urls) |
file_name: str , urls: list[str] |
Converts and saves CSV to file |
get_tld(path_to_tlds_file='tld.txt') |
path_to_tlds_file: str = 'tld.txt' |
Fetches current TLD list from IANA |
local_tld_file(file_name) |
file_name: str |
Fetches and stores get_tld() output as a local txt file |
Method | Parameters | Description |
---|---|---|
parse_input_file(input_file_name, delimiter='\n') |
input_file_name: str , delimiter: str |
Parses input file into python list by delimiter |
Argument | Type | Value when argument is omitted | Description |
---|---|---|---|
target |
str |
None |
Email or URL string(s) to process |
-u , --url |
flag |
False |
Specify URL input |
-e , --email |
flag |
False |
Specify Email input |
-i , --input_file |
str |
None |
Input file name with extension |
-o , --output_file |
str |
None |
Output file name without extension |
-c , --csv |
flag |
False |
Save output as CSV format |
-j , --json |
flag |
False |
Save output as JSON format |
-np , --no_prettify |
flag |
True |
Turn off prettified JSON output |
-d , --delimiter |
str |
'\n' |
Delimiter for input file parsing |
Format | Extension | Description |
---|---|---|
Text | .txt | Plain text files |
Log | .log | Plain text log files |
CSV | .csv | Comma-separated values |
ZIP | .zip | Archives containing text files |
GZIP | .gz | GZIP compressed files |
BZIP2 | .bz2 | BZIP2 compressed files |
LZMA | .xz, .lzma | LZMA compressed files |
Field | Description | Example |
---|---|---|
input | Full email | [email protected] |
username | Part before + or @ symbol | user |
plus_address | Optional part between + and @ | tag |
mail_server | Domain before TLD | gmail |
domain | Top-level domain | com |
Example output:
{"[email protected]":
{
"username": "user",
"plus_address": "tag",
"mail_server": "gmail",
"domain": "com"
}
}
email,username,plus_address,mail_server,domain
[email protected],user,tag,gmail,com
Field | Description | Example |
---|---|---|
scheme | Protocol | https |
subdomain | Domain prefix | www |
second_level_domain | Main domain | example |
top_level_domain | Domain suffix | com |
port | Port number | 443 |
path | URL path | blog/post |
query | Query parameters | q=test |
fragment | URL fragment | section1 |
Example output:
{"https://www.example.com:443/blog/post?q=test#section1":
{
"scheme": "https",
"subdomain": "www",
"second_level_domain": "example",
"top_level_domain": "com",
"port": "443",
"path": "blog/post",
"query": "q=test",
"fragment": "section1"
}
}
url,scheme,subdomain,second_level_domain,top_level_domain,port,path,query,fragment
https://www.example.com:443/blog/post?q=test#section1,https,www,example,com,443,blog/post,q=test,section1
- Standard:
[email protected]
- Plus Addresses:
[email protected]
- Government:
[email protected]
- Basic:
example.com
- With subdomain:
www.example.com
- With scheme:
https://example.com
- With path:
example.com/path/to/file.txt
- With port:
example.com:8080
- With query:
example.com/search?q=test
- With fragment:
example.com#section1
- IP addresses:
192.168.1.1:8080
- Government domains:
agency.gov.uk
- Full complex URLs:
https://www.example.gov.uk:8080/path?q=test#section1
- Plain text files (.txt)
- Plain text log files (.log)
- Comma-separated values (.csv)
- ZIP archives containing text files (.zip)
- GZIP compressed files (.gz)
- BZIP2 compressed files (.bz2)
- LZMA compressed files (.xz, .lzma)
- Processes all text files within the archive (.txt, .csv, .log)
- Handles nested directories
- Continues processing if some files are corrupted
- UTF-8 encoding expected for text files
- Text file (default)
- JSON file (prettified or minified)
- CSV file
- Console output