Skip to content

"Sixerl" is a fixed-length, sliceable URL identifier system for efficient database storage and querying.

License

Notifications You must be signed in to change notification settings

valyu-network/sxsurl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SXURL: Slice eXact URL Identifier ("sixerl")

Crates.io Documentation License Build Status

SXURL (pronounced "Sixerl") is a fixed-length, sliceable URL identifier system designed for efficient database storage and querying. It converts URLs into deterministic 256-bit identifiers where each URL component occupies a fixed position, enabling fast substring-based filtering and indexing.

Features

  • 🔒 Fixed-length: All SXURL identifiers are exactly 256 bits (64 hex characters)
  • 📏 Sliceable: Each URL component has a fixed position for substring filtering
  • 🔄 Deterministic: Same input always produces the same output
  • 🛡️ Collision-resistant: Uses SHA-256 hashing for component fingerprinting
  • 🌐 Standards-compliant: Supports IDNA, Public Suffix List, and standard URL schemes
  • ⚡ Zero-copy: Efficient parsing and encoding with minimal allocations
  • 🧪 Thoroughly tested: 100+ comprehensive tests covering edge cases

Quick Start

Add this to your Cargo.toml:

[dependencies]
sxurl = "0.1"

Basic Usage

use sxurl::{encode_url_to_hex, decode_hex, matches_component};

// Encode a URL to SXURL
let sxurl_hex = encode_url_to_hex("https://docs.rs/serde")?;
println!("SXURL: {}", sxurl_hex); // 64 hex characters

// Decode and inspect components
let decoded = decode_hex(&sxurl_hex)?;
println!("Scheme: {}", decoded.header.scheme);
println!("Has subdomain: {}", decoded.header.subdomain_present);

// Fast component matching for filtering
let is_docs_rs = matches_component(&sxurl_hex, "domain", "docs")?;
assert!(is_docs_rs);

let is_rs_tld = matches_component(&sxurl_hex, "tld", "rs")?;
assert!(is_rs_tld);

URL Parsing Utilities

use sxurl::{split_url, parse_query, get_anchor, strip_anchor, join_url_path};

// Parse URL into all components at once
let parts = split_url("https://api.github.com/repos?page=1#readme")?;
println!("Domain: {}", parts.domain);        // "github"
println!("Subdomain: {:?}", parts.subdomain); // Some("api")
println!("Anchor: {:?}", parts.anchor);       // Some("readme")

// Work with query parameters
let params = parse_query("https://example.com?foo=bar&page=2")?;
println!("Page: {:?}", params.get("page"));   // Some("2")

// Handle anchors (fragments)
let anchor = get_anchor("https://docs.rs#search")?; // Some("search")
let clean_url = strip_anchor("https://example.com#top")?; // "https://example.com"

// Join URLs properly
let api_url = join_url_path("https://api.example.com/v1", "users")?;
// Result: "https://api.example.com/v1/users"

Database Integration Example

use sxurl::encode_url_to_hex;

// Store URLs efficiently in your database
let urls = vec![
    "https://docs.rs/serde",
    "https://crates.io/crates/tokio",
    "https://github.com/rust-lang/rust",
];

for url in urls {
    let sxurl = encode_url_to_hex(url)?;
    // Store sxurl as CHAR(64) or BINARY(32) in your database
    println!("INSERT INTO urls (sxurl, original_url) VALUES ('{}', '{}')", sxurl, url);
}

// Query by domain efficiently using substring matching
// SELECT * FROM urls WHERE sxurl LIKE '_______________example_______________'
//                                       ^domain slice position (chars 7-22)

SXURL Format

The 256-bit SXURL has this fixed layout:

Component Hex Range Bits Description
header [0..3) 12 Version, scheme, flags
tld_hash [3..7) 16 Top-level domain hash
domain [7..22) 60 Domain name hash
subdomain [22..30) 32 Subdomain hash
port [30..34) 16 Port number
path [34..49) 60 Path hash
params [49..58) 36 Query parameters hash
fragment [58..64) 24 Fragment hash

Header Format (12 bits)

  • Version (4 bits): Currently always 1
  • Scheme (3 bits): 0=https, 1=http, 2=ftp
  • Flags (5 bits): Component presence indicators
    • Bit 4: Subdomain present
    • Bit 3: Query parameters present
    • Bit 2: Fragment present
    • Bit 1: Non-default port present
    • Bit 0: Reserved (always 0)

Advanced Usage

Custom Encoder

use sxurl::SxurlEncoder;

let encoder = SxurlEncoder::new();

// Encode to bytes
let sxurl_bytes = encoder.encode("https://example.com")?;
assert_eq!(sxurl_bytes.len(), 32); // Always 32 bytes

// Encode to hex string
let sxurl_hex = encoder.encode_to_hex("https://example.com")?;
assert_eq!(sxurl_hex.len(), 64); // Always 64 hex chars

Component Filtering

use sxurl::{encode_url_to_hex, matches_component};

let urls = vec![
    "https://api.github.com/repos/rust-lang/rust",
    "https://docs.github.com/en/rest",
    "https://github.blog/2023-01-01/announcement",
];

// Find all GitHub URLs
for url in &urls {
    let sxurl = encode_url_to_hex(url)?;
    if matches_component(&sxurl, "domain", "github")? {
        println!("GitHub URL: {}", url);
    }
}

// Find all API endpoints
for url in &urls {
    let sxurl = encode_url_to_hex(url)?;
    if matches_component(&sxurl, "subdomain", "api")? {
        println!("API URL: {}", url);
    }
}

Hash Function Access

use sxurl::ComponentHasher;

// Access individual hash functions
let tld_hash = ComponentHasher::hash_tld("com")?;
let domain_hash = ComponentHasher::hash_domain("example")?;
let path_hash = ComponentHasher::hash_path("/api/v1/users")?;

println!("TLD hash: 0x{:04x}", tld_hash);
println!("Domain hash: 0x{:015x}", domain_hash);
println!("Path hash: 0x{:015x}", path_hash);

Supported URL Schemes

  • https (scheme code 0) - Default port 443
  • http (scheme code 1) - Default port 80
  • ftp (scheme code 2) - Default port 21

Use Cases

Database Indexing

Store URLs as fixed-length identifiers with efficient B-tree indexing:

CREATE TABLE url_index (
    sxurl CHAR(64) PRIMARY KEY,
    original_url TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

-- Index by domain (characters 7-22)
CREATE INDEX idx_domain ON url_index (SUBSTRING(sxurl, 8, 15));

-- Index by TLD (characters 3-7)
CREATE INDEX idx_tld ON url_index (SUBSTRING(sxurl, 4, 4));

URL Deduplication

Quickly identify duplicate URLs across different formats:

use sxurl::encode_url_to_hex;
use std::collections::HashSet;

let mut seen_urls = HashSet::new();

let urls = vec![
    "https://Example.Com/Path",
    "https://example.com/path",    // Same after normalization
    "https://example.com/path/",   // Different path
];

for url in urls {
    let sxurl = encode_url_to_hex(url)?;
    if seen_urls.insert(sxurl) {
        println!("New URL: {}", url);
    } else {
        println!("Duplicate URL: {}", url);
    }
}

Fast Domain Filtering

Filter large URL datasets by domain without parsing:

// Filter millions of URLs by domain using simple string operations
let domain_filter = "1a2b3c4d5e6f7890abc"; // Hash of target domain
let urls_in_domain: Vec<_> = all_sxurls
    .iter()
    .filter(|sxurl| &sxurl[7..22] == domain_filter)
    .collect();

Error Handling

All functions return Result<T, SxurlError>. Common error cases:

use sxurl::{encode_url_to_hex, SxurlError};

match encode_url_to_hex("invalid-url") {
    Ok(sxurl) => println!("Success: {}", sxurl),
    Err(SxurlError::InvalidScheme) => println!("Unsupported URL scheme"),
    Err(SxurlError::HostNotDns) => println!("Host must be a DNS name, not IP"),
    Err(SxurlError::InvalidLength) => println!("Invalid SXURL format"),
    Err(e) => println!("Other error: {}", e),
}

Performance

SXURL is designed for high-performance applications:

  • Encoding: ~1-5 μs per URL (depending on complexity)
  • Decoding: ~100-500 ns per SXURL
  • Component matching: ~50-100 ns per comparison
  • Memory usage: Fixed 32 bytes per SXURL, minimal temporary allocations

Benchmarks on a modern CPU:

encode_simple_url      time: 1.2 μs
encode_complex_url     time: 4.8 μs
decode_sxurl          time: 245 ns
component_match       time: 67 ns

Technical Details

Hash Function

SXURL uses labeled SHA-256 hashing: H_n(label, data) = lower_n(SHA256(label || 0x00 || data))

  • Collision resistance: ~2^(n/2) where n is the bit width
  • Domain separation: Different labels produce different hashes for same data
  • Deterministic: Same input always produces same output

URL Normalization

  • Scheme and host converted to lowercase
  • IDNA (Internationalized Domain Names) support
  • Public Suffix List (PSL) for proper domain/TLD splitting
  • IP addresses rejected (DNS names only)

Component Handling

  • Empty components stored as zero (not hashed)
  • Default ports (80, 443, 21) stored explicitly
  • Query parameters and fragments preserved as-is
  • Path normalization preserves original structure

Testing

Run the comprehensive test suite:

cargo test                    # Run all tests
cargo test --doc              # Test documentation examples
cargo test --release          # Test optimized builds
cargo bench                   # Run performance benchmarks

The library includes 100+ tests covering:

  • Specification compliance
  • Edge cases and error conditions
  • Round-trip consistency
  • Hash collision resistance
  • Performance regression testing

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

Specification

SXURL follows a formal specification available in SXURL-SPEC.md. The implementation is designed to be compatible with other SXURL implementations in different languages.

Changelog

See CHANGELOG.md for detailed release history.

License

Licensed under the Apache License, Version 2.0 (LICENSE or http://www.apache.org/licenses/LICENSE-2.0).

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be licensed under the Apache License, Version 2.0, without any additional terms or conditions.


SXURL - Efficient, deterministic URL identifiers for modern applications.

About

"Sixerl" is a fixed-length, sliceable URL identifier system for efficient database storage and querying.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages