Skip to content

lignum-vitae/pyrolysate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Static Badge Python Version from PEP 621 TOML PyPI version GitHub License GitHub branch check runs

Pyrolysate

Pyrolysate is a Python library and CLI tool for parsing and validating URLs and email addresses. It breaks down URLs and emails into their component parts, validates against IANA's official TLD list, and outputs structured data in JSON, CSV, or text format.

The library offers both a programmer-friendly API and a command-line interface, making it suitable for both development integration and quick data processing tasks. It handles single entries or large datasets efficiently using Python's generator functionality, and provides flexible input/output options including file processing with custom delimiters.

Features

URL Parsing

  • Extract scheme, subdomain, domain, TLD, port, path, query, and fragment components
  • Support for complex URL patterns including ports, queries, and fragments
  • Support for IP addresses in URLs
  • Support for both direct input and file processing via CLI or API
  • Output as JSON, CSV, or text format through CLI or API

Email Parsing

  • Extract username, mail server, and domain components
  • Support for plus addressing (e.g., [email protected])
  • Support for both direct input and file processing via CLI or API
  • Output as JSON, CSV, or text format through CLI or API

Top Level Domain Validation

  • Automatic updates from IANA's official TLD list
  • Local TLD file caching for offline use
  • Fallback to common TLDs if both online and local sources fail

Flexible Input/Output

  • Process single or multiple entries
  • Support for government domain emails (.gov.tld)
  • Custom delimiters for file input
  • Multiple output formats with .txt format as default (JSON, CSV, text)
  • Pretty-printed or minified JSON output
  • Console output or file saving options
  • Memory-efficient processing of large datasets using Python generators
  • Support for compressed input files:
    • ZIP archives (processes all text files within .zip)
    • GZIP (.gz)
    • BZIP2 (.bz2)
    • LZMA (.xz, .lzma)

Developer Friendly

  • Type hints for better IDE support
  • Comprehensive docstrings
  • Modular design for easy integration
  • Command-line interface for quick testing

🚀 Installation

From PyPI

pip install pyrolysate

For Development

  1. Clone the repository
git clone https://github.com/dawnandrew100/pyrolysate.git
cd pyrolysate
  1. Create and activate a virtual environment
# Using hatch (recommended)
hatch env create

# Or using venv
python -m venv .venv
# Windows
.venv\Scripts\activate
# Unix/MacOS
source .venv/bin/activate
  1. Install in development mode
# Using hatch
hatch run dev

# Or using pip
pip install -e .

Verify Installation

# Using hatch (recommended)
hatch run pyro -u example.com

# Or using the CLI directly
pyro -u example.com

The CLI command pyro will be available after installation. If the command isn't found, ensure Python's Scripts directory is in your PATH.

Usage

Input File Parsing

from pyrolysate import parse_input_file

Parse file with default newline delimiter

urls = parse_input_file("urls.txt")

Parse file with custom delimiter

emails = parse_input_file("emails.csv", delimiter=",")

Supported Outputs

  • JSON (prettified or minified)
  • CSV
  • Text (default)
  • File output with custom naming
  • Console output

Email Parsing

from pyrolysate import email

Parse single email

result = email.parse_email("[email protected]")

Parse plus addressed email

result = email.parse_email("[email protected]")

Parse multiple emails

emails = ["[email protected]", "[email protected]"]
result = email.parse_email_array(emails)

Convert to JSON

json_output = email.to_json("[email protected]")
json_output = email.to_json(["[email protected]", "[email protected]"])

Save to JSON file

email.to_json_file("output", "[email protected]")
email.to_json_file("output", ["[email protected]", "[email protected]"])

Convert to CSV

csv_output = email.to_csv("[email protected]")
csv_output = email.to_csv(["[email protected]", "[email protected]"])

Save to CSV file

email.to_csv_file("output", "[email protected]")
email.to_csv_file("output", ["[email protected]", "[email protected]"])

URL Parsing

from pyrolysate import url

Parse single URL

result = url.parse_url("https://www.example.com/path?q=test#fragment")

Parse multiple URLs

urls = ["example.com", "https://www.test.org"]
result = url.parse_url_array(urls)

Convert to JSON

json_output = url.to_json("example.com")
json_output = url.to_json(["example.com", "test.org"])

Save to JSON file

url.to_json_file("output", "example.com")
url.to_json_file("output", ["example.com", "test.org"])

Convert to CSV

csv_output = url.to_csv("example.com")
csv_output = url.to_csv(["example.com", "test.org"])

Save to CSV file

url.to_csv_file("output", "example.com")
url.to_csv_file("output", ["example.com", "test.org"])

Command Line Interface

CLI help

pyro -h

Parse single URL

pyro -u example.com

Parse multiple URLs

pyro -u example1.com example2.com

Parse URLs from file (one per line by default)

pyro -u -i urls.txt

Parse URLs from CSV file with comma delimiter

pyro -u -i urls.csv -d ","

Parse email with plus addressing

Parse multiple emails and save as JSON

Parse URLs from file and save as CSV

pyro -u -i urls.txt -c -o parsed_urls

Parse emails from file with comma delimiter

pyro -e -i emails.txt -d "," -o output

Parse emails with non-prettified JSON output

pyro -e [email protected] -j -np

Parse different file types

# Parse log file
pyro -u -i server.log

# Parse compressed log file
pyro -u -i server.log.gz

# Parse BZIP2 compressed file
pyro -e -i emails.txt.bz2

# Parse ZIP archive containing logs and text files
pyro -u -i archive.zip

API Reference

Email Class

Method Parameters Description
parse_email(email_str) email_str: str Parses single email address
parse_email_array(emails) emails: list[str] Parses list of email addresses
to_json(emails, prettify=True) emails: str|list[str], prettify: bool Converts to JSON format
to_json_file(file_name, emails, prettify=True) file_name: str, emails: list[str], prettify: bool Converts and saves JSON to file
to_csv(emails) emails: str|list[str] Converts to CSV format
to_csv_file(file_name, emails) file_name: str, emails: list[str] Converts and saves CSV to file

URL Class

Method Parameters Description
parse_url(url_str, tlds=[]) url_str: str, tlds: list[str] Parses single URL
parse_url_array(urls, tlds=[]) urls: list[str], tlds: list[str] Parses list of URLs
to_json(urls, prettify=True) urls: str|list[str], prettify: bool Converts to JSON format
to_json_file(file_name, urls, prettify=True) file_name: str, urls: list[str], prettify: bool Converts and saves JSON to file
to_csv(urls) urls: str|list[str] Converts to CSV format
to_csv_file(file_name, urls) file_name: str, urls: list[str] Converts and saves CSV to file
get_tld(path_to_tlds_file='tld.txt') path_to_tlds_file: str = 'tld.txt' Fetches current TLD list from IANA
local_tld_file(file_name) file_name: str Fetches and stores get_tld() output as a local txt file

Miscellaneous

Method Parameters Description
parse_input_file(input_file_name, delimiter='\n') input_file_name: str, delimiter: str Parses input file into python list by delimiter

CLI Reference

Argument Type Value when argument is omitted Description
target str None Email or URL string(s) to process
-u, --url flag False Specify URL input
-e, --email flag False Specify Email input
-i, --input_file str None Input file name with extension
-o, --output_file str None Output file name without extension
-c, --csv flag False Save output as CSV format
-j, --json flag False Save output as JSON format
-np, --no_prettify flag True Turn off prettified JSON output
-d, --delimiter str '\n' Delimiter for input file parsing

Input File Support

Format Extension Description
Text .txt Plain text files
Log .log Plain text log files
CSV .csv Comma-separated values
ZIP .zip Archives containing text files
GZIP .gz GZIP compressed files
BZIP2 .bz2 BZIP2 compressed files
LZMA .xz, .lzma LZMA compressed files

Output Formats

Email Parse Output

Field Description Example
input Full email [email protected]
username Part before + or @ symbol user
plus_address Optional part between + and @ tag
mail_server Domain before TLD gmail
domain Top-level domain com

Example output:

{"[email protected]": 
    {
    "username": "user",
    "plus_address": "tag",
    "mail_server": "gmail",
    "domain": "com"
    }
}
email,username,plus_address,mail_server,domain
[email protected],user,tag,gmail,com

URL Parse Output

Field Description Example
scheme Protocol https
subdomain Domain prefix www
second_level_domain Main domain example
top_level_domain Domain suffix com
port Port number 443
path URL path blog/post
query Query parameters q=test
fragment URL fragment section1

Example output:

{"https://www.example.com:443/blog/post?q=test#section1": 
    {
    "scheme": "https",
    "subdomain": "www",
    "second_level_domain": "example",
    "top_level_domain": "com",
    "port": "443",
    "path": "blog/post",
    "query": "q=test",
    "fragment": "section1"
    }
}
url,scheme,subdomain,second_level_domain,top_level_domain,port,path,query,fragment
https://www.example.com:443/blog/post?q=test#section1,https,www,example,com,443,blog/post,q=test,section1

Supported Formats

Email Formats

URL Formats

  • Basic: example.com
  • With subdomain: www.example.com
  • With scheme: https://example.com
  • With path: example.com/path/to/file.txt
  • With port: example.com:8080
  • With query: example.com/search?q=test
  • With fragment: example.com#section1
  • IP addresses: 192.168.1.1:8080
  • Government domains: agency.gov.uk
  • Full complex URLs: https://www.example.gov.uk:8080/path?q=test#section1

Input File Support

  • Plain text files (.txt)
  • Plain text log files (.log)
  • Comma-separated values (.csv)
  • ZIP archives containing text files (.zip)
  • GZIP compressed files (.gz)
  • BZIP2 compressed files (.bz2)
  • LZMA compressed files (.xz, .lzma)

ZIP Archive Support

  • Processes all text files within the archive (.txt, .csv, .log)
  • Handles nested directories
  • Continues processing if some files are corrupted
  • UTF-8 encoding expected for text files

Outputs

  • Text file (default)
  • JSON file (prettified or minified)
  • CSV file
  • Console output

About

API and CLI that convert URLs and emails to CSV, JSON, or text with outputs to console or to a file

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages