Pyrolysate

Pyrolysate is a Python library and CLI tool for parsing and validating URLs and email addresses. It breaks down URLs and emails into their component parts, validates against IANA's official TLD list, and outputs structured data in JSON, CSV, or text format.

The library offers both a programmer-friendly API and a command-line interface, making it suitable for both development integration and quick data processing tasks. It handles single entries or large datasets efficiently using Python's generator functionality, and provides flexible input/output options including file processing with custom delimiters.

Features

URL Parsing

Extract scheme, subdomain, domain, TLD, port, path, query, and fragment components
Support for complex URL patterns including ports, queries, and fragments
Support for IP addresses in URLs
Support for both direct input and file processing via CLI or API
Output as JSON, CSV, or text format through CLI or API

Email Parsing

Extract username, mail server, and domain components
Support for plus addressing (e.g., [email protected])
Support for both direct input and file processing via CLI or API
Output as JSON, CSV, or text format through CLI or API

Top Level Domain Validation

Automatic updates from IANA's official TLD list
Local TLD file caching for offline use
Fallback to common TLDs if both online and local sources fail

Flexible Input/Output

Process single or multiple entries
Support for government domain emails (.gov.tld)
Custom delimiters for file input
Multiple output formats with .txt format as default (JSON, CSV, text)
Pretty-printed or minified JSON output
Console output or file saving options
Memory-efficient processing of large datasets using Python generators
Support for compressed input files:
- ZIP archives (processes all text files within .zip)
- GZIP (.gz)
- BZIP2 (.bz2)
- LZMA (.xz, .lzma)

Developer Friendly

Type hints for better IDE support
Comprehensive docstrings
Modular design for easy integration
Command-line interface for quick testing

🚀 Installation

From PyPI

pip install pyrolysate

For Development

Clone the repository

git clone https://github.com/dawnandrew100/pyrolysate.git
cd pyrolysate

Create and activate a virtual environment

# Using hatch (recommended)
hatch env create

# Or using venv
python -m venv .venv
# Windows
.venv\Scripts\activate
# Unix/MacOS
source .venv/bin/activate

Install in development mode

# Using hatch
hatch run dev

# Or using pip
pip install -e .

Verify Installation

# Using hatch (recommended)
hatch run pyro -u example.com

# Or using the CLI directly
pyro -u example.com

The CLI command pyro will be available after installation. If the command isn't found, ensure Python's Scripts directory is in your PATH.

Usage

Input File Parsing

from pyrolysate import parse_input_file

Parse file with default newline delimiter

urls = parse_input_file("urls.txt")

Parse file with custom delimiter

emails = parse_input_file("emails.csv", delimiter=",")

Supported Outputs

JSON (prettified or minified)
CSV
Text (default)
File output with custom naming
Console output

Email Parsing

from pyrolysate import email

Parse single email

result = email.parse_email("[email protected]")

Parse plus addressed email

result = email.parse_email("[email protected]")

Parse multiple emails

emails = ["[email protected]", "[email protected]"]
result = email.parse_email_array(emails)

Convert to JSON

json_output = email.to_json("[email protected]")
json_output = email.to_json(["[email protected]", "[email protected]"])

Save to JSON file

email.to_json_file("output", "[email protected]")
email.to_json_file("output", ["[email protected]", "[email protected]"])

Convert to CSV

csv_output = email.to_csv("[email protected]")
csv_output = email.to_csv(["[email protected]", "[email protected]"])

Save to CSV file

email.to_csv_file("output", "[email protected]")
email.to_csv_file("output", ["[email protected]", "[email protected]"])

URL Parsing

from pyrolysate import url

Parse single URL

result = url.parse_url("https://www.example.com/path?q=test#fragment")

Parse multiple URLs

urls = ["example.com", "https://www.test.org"]
result = url.parse_url_array(urls)

Convert to JSON

json_output = url.to_json("example.com")
json_output = url.to_json(["example.com", "test.org"])

Save to JSON file

url.to_json_file("output", "example.com")
url.to_json_file("output", ["example.com", "test.org"])

Convert to CSV

csv_output = url.to_csv("example.com")
csv_output = url.to_csv(["example.com", "test.org"])

Save to CSV file

url.to_csv_file("output", "example.com")
url.to_csv_file("output", ["example.com", "test.org"])

Command Line Interface

CLI help

pyro -h

Parse single URL

pyro -u example.com

Parse multiple URLs

pyro -u example1.com example2.com

Parse URLs from file (one per line by default)

pyro -u -i urls.txt

Parse URLs from CSV file with comma delimiter

pyro -u -i urls.csv -d ","

Parse email with plus addressing

pyro -e [email protected]

Parse multiple emails and save as JSON

pyro -e [email protected] [email protected] -j -o output

Parse URLs from file and save as CSV

pyro -u -i urls.txt -c -o parsed_urls

Parse emails from file with comma delimiter

pyro -e -i emails.txt -d "," -o output

Parse emails with non-prettified JSON output

pyro -e [email protected] -j -np

Parse different file types

# Parse log file
pyro -u -i server.log

# Parse compressed log file
pyro -u -i server.log.gz

# Parse BZIP2 compressed file
pyro -e -i emails.txt.bz2

# Parse ZIP archive containing logs and text files
pyro -u -i archive.zip

API Reference

Email Class

Method	Parameters	Description
`parse_email(email_str)`	`email_str: str`	Parses single email address
`parse_email_array(emails)`	`emails: list[str]`	Parses list of email addresses
`to_json(emails, prettify=True)`	`emails: str\|list[str]`, `prettify: bool`	Converts to JSON format
`to_json_file(file_name, emails, prettify=True)`	`file_name: str`, `emails: list[str]`, `prettify: bool`	Converts and saves JSON to file
`to_csv(emails)`	`emails: str\|list[str]`	Converts to CSV format
`to_csv_file(file_name, emails)`	`file_name: str`, `emails: list[str]`	Converts and saves CSV to file

URL Class

Method	Parameters	Description
`parse_url(url_str, tlds=[])`	`url_str: str`, `tlds: list[str]`	Parses single URL
`parse_url_array(urls, tlds=[])`	`urls: list[str]`, `tlds: list[str]`	Parses list of URLs
`to_json(urls, prettify=True)`	`urls: str\|list[str]`, `prettify: bool`	Converts to JSON format
`to_json_file(file_name, urls, prettify=True)`	`file_name: str`, `urls: list[str]`, `prettify: bool`	Converts and saves JSON to file
`to_csv(urls)`	`urls: str\|list[str]`	Converts to CSV format
`to_csv_file(file_name, urls)`	`file_name: str`, `urls: list[str]`	Converts and saves CSV to file
`get_tld(path_to_tlds_file='tld.txt')`	`path_to_tlds_file: str = 'tld.txt'`	Fetches current TLD list from IANA
`local_tld_file(file_name)`	`file_name: str`	Fetches and stores `get_tld()` output as a local txt file

Miscellaneous

Method	Parameters	Description
`parse_input_file(input_file_name, delimiter='\n')`	`input_file_name: str`, `delimiter: str`	Parses input file into python list by delimiter

CLI Reference

Argument	Type	Value when argument is omitted	Description
`target`	`str`	`None`	Email or URL string(s) to process
`-u`, `--url`	`flag`	`False`	Specify URL input
`-e`, `--email`	`flag`	`False`	Specify Email input
`-i`, `--input_file`	`str`	`None`	Input file name with extension
`-o`, `--output_file`	`str`	`None`	Output file name without extension
`-c`, `--csv`	`flag`	`False`	Save output as CSV format
`-j`, `--json`	`flag`	`False`	Save output as JSON format
`-np`, `--no_prettify`	`flag`	`True`	Turn off prettified JSON output
`-d`, `--delimiter`	`str`	`'\n'`	Delimiter for input file parsing

Input File Support

Format	Extension	Description
Text	.txt	Plain text files
Log	.log	Plain text log files
CSV	.csv	Comma-separated values
ZIP	.zip	Archives containing text files
GZIP	.gz	GZIP compressed files
BZIP2	.bz2	BZIP2 compressed files
LZMA	.xz, .lzma	LZMA compressed files

Output Formats

Email Parse Output

Field	Description	Example
input	Full email	[email protected]
username	Part before + or @ symbol	user
plus_address	Optional part between + and @	tag
mail_server	Domain before TLD	gmail
domain	Top-level domain	com

Example output:

{"[email protected]": 
    {
    "username": "user",
    "plus_address": "tag",
    "mail_server": "gmail",
    "domain": "com"
    }
}

email,username,plus_address,mail_server,domain
[email protected],user,tag,gmail,com

URL Parse Output

Field	Description	Example
scheme	Protocol	https
subdomain	Domain prefix	www
second_level_domain	Main domain	example
top_level_domain	Domain suffix	com
port	Port number	443
path	URL path	blog/post
query	Query parameters	q=test
fragment	URL fragment	section1

Example output:

{"https://www.example.com:443/blog/post?q=test#section1": 
    {
    "scheme": "https",
    "subdomain": "www",
    "second_level_domain": "example",
    "top_level_domain": "com",
    "port": "443",
    "path": "blog/post",
    "query": "q=test",
    "fragment": "section1"
    }
}

url,scheme,subdomain,second_level_domain,top_level_domain,port,path,query,fragment
https://www.example.com:443/blog/post?q=test#section1,https,www,example,com,443,blog/post,q=test,section1

Supported Formats

Email Formats

Standard: [email protected]
Plus Addresses: [email protected]
Government: [email protected]

URL Formats

Basic: example.com
With subdomain: www.example.com
With scheme: https://example.com
With path: example.com/path/to/file.txt
With port: example.com:8080
With query: example.com/search?q=test
With fragment: example.com#section1
IP addresses: 192.168.1.1:8080
Government domains: agency.gov.uk
Full complex URLs: https://www.example.gov.uk:8080/path?q=test#section1

Input File Support

Plain text files (.txt)
Plain text log files (.log)
Comma-separated values (.csv)
ZIP archives containing text files (.zip)
GZIP compressed files (.gz)
BZIP2 compressed files (.bz2)
LZMA compressed files (.xz, .lzma)

ZIP Archive Support

Processes all text files within the archive (.txt, .csv, .log)
Handles nested directories
Continues processing if some files are corrupted
UTF-8 encoding expected for text files

Outputs

Text file (default)
JSON file (prettified or minified)
CSV file
Console output

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
pyrolysate		pyrolysate
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

License

lignum-vitae/pyrolysate

Folders and files

Latest commit

History

Repository files navigation

Pyrolysate

Features

URL Parsing

Email Parsing

Top Level Domain Validation

Flexible Input/Output

Developer Friendly

🚀 Installation

From PyPI

For Development

Verify Installation

Usage

Input File Parsing

Parse file with default newline delimiter

Parse file with custom delimiter

Supported Outputs

Email Parsing

Parse single email

Parse plus addressed email

Parse multiple emails

Convert to JSON

Save to JSON file

Convert to CSV

Save to CSV file

URL Parsing

Parse single URL

Parse multiple URLs

Convert to JSON

Save to JSON file

Convert to CSV

Save to CSV file

Command Line Interface

CLI help

Parse single URL

Parse multiple URLs

Parse URLs from file (one per line by default)

Parse URLs from CSV file with comma delimiter

Parse email with plus addressing

Parse multiple emails and save as JSON

Parse URLs from file and save as CSV

Parse emails from file with comma delimiter

Parse emails with non-prettified JSON output

Parse different file types

API Reference

Email Class

URL Class

Miscellaneous

CLI Reference

Input File Support

Output Formats

Email Parse Output

URL Parse Output

Supported Formats

Email Formats

URL Formats

Input File Support

ZIP Archive Support

Outputs

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Languages