Skip to content

feat: Add checksum-only API to ByteStorage (decouple integrity from compression) #13

@27Bslash6

Description

@27Bslash6

Summary

ByteStorage currently couples LZ4 compression with xxHash3-64 integrity checking. All store()/retrieve() operations require both features enabled together. This prevents using the Rust xxHash3 implementation for integrity-only use cases.

Current State

// StorageEnvelope::new() requires ALL features
#[cfg(all(feature = "compression", feature = "checksum"))]
pub fn new(data: Vec<u8>, format: String) -> Result<Self, ByteStorageError>

// No checksum-only path exists

The Python Arrow/Orjson serializers bypass ByteStorage entirely and use their own Blake3 checksums because:

  1. LZ4 compression is ineffective on Arrow IPC (columnar) and JSON (already compact)
  2. No way to get just the xxHash3 checksum without compression overhead

Proposed Change

Add a checksum-only API that provides xxHash3-64 integrity without compression:

// Option A: New feature-gated methods
#[cfg(feature = "checksum")]
impl ByteStorage {
    pub fn checksum(&self, data: &[u8]) -> [u8; 8];
    pub fn verify_checksum(&self, data: &[u8], expected: &[u8; 8]) -> bool;
}

// Option B: Separate IntegrityChecker struct
pub struct IntegrityChecker;
impl IntegrityChecker {
    pub fn compute(data: &[u8]) -> [u8; 8];
    pub fn verify(data: &[u8], expected: &[u8; 8]) -> bool;
}

Benefits

  1. Consistency: All serializers use same xxHash3-64 algorithm via Rust FFI
  2. Performance: Arrow/Orjson get 19x faster checksums (36 GB/s vs Blake3's 2 GB/s)
  3. Space: 8-byte checksums vs 32-byte Blake3 (24 bytes saved per item)
  4. No wasted cycles: Skip LZ4 where compression is ineffective

Current Workaround

Use xxhash Python package directly in Arrow/Orjson serializers (Option B from discussion). This provides algorithm consistency without Rust changes, but adds a Python dependency.

Context

  • Related discussion: xxHash3 migration in ByteStorage (2025-12-05)
  • Affected files: arrow_serializer.py, orjson_serializer.py currently use Blake3
  • Architecture doc: strategy/saas-protocol-v1.0.md

Acceptance Criteria

  • Checksum-only API available without enabling compression feature
  • PyO3 bindings expose checksum functions
  • Documentation updated
  • Benchmark comparing Python xxhash vs Rust FFI overhead

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions