66[ ![ CI] ( https://github.com/dahlem/torchcachex/actions/workflows/ci.yml/badge.svg )] ( https://github.com/dahlem/torchcachex/actions )
77[ ![ codecov] ( https://codecov.io/gh/dahlem/torchcachex/branch/main/graph/badge.svg )] ( https://codecov.io/gh/dahlem/torchcachex )
88
9- ** Drop-in PyTorch module caching with Arrow IPC + SQLite backend**
9+ ** Drop-in PyTorch module caching with Arrow IPC + in-memory index backend**
1010
1111` torchcachex ` provides transparent, per-sample caching for non-trainable PyTorch modules with:
1212- ✅ ** O(1) append-only writes** via incremental Arrow IPC segments
13- - ✅ ** O(1) batched lookups** via SQLite index + Arrow memory-mapping
13+ - ✅ ** O(1) batched lookups** via in-memory index + Arrow memory-mapping
1414- ✅ ** Native tensor storage** with automatic dtype preservation
1515- ✅ ** LRU hot cache** for in-process hits
1616- ✅ ** Async writes** (non-blocking forward pass)
@@ -422,7 +422,7 @@ Wraps a PyTorch module to add transparent per-sample caching.
422422
423423### ` ArrowIPCCacheBackend `
424424
425- Persistent cache using Arrow IPC segments with SQLite index for O(1) operations.
425+ Persistent cache using Arrow IPC segments with in-memory index for O(1) operations.
426426
427427** Storage Format:**
428428```
@@ -431,7 +431,7 @@ cache_dir/module_id/
431431 segment_000000.arrow # Incremental Arrow IPC files
432432 segment_000001.arrow
433433 ...
434- index.db # SQLite with WAL mode
434+ index.pkl # Pickled dict: key → (segment_id, row_offset)
435435 schema.json # Auto-inferred Arrow schema
436436```
437437
@@ -446,22 +446,23 @@ cache_dir/module_id/
446446- ` current_rank ` (Optional[ int] ): Current process rank (default: None)
447447
448448** Methods:**
449- - ` get_batch(keys, map_location="cpu") ` : O(1) batch lookup via SQLite index + memory-mapped Arrow
449+ - ` get_batch(keys, map_location="cpu") ` : O(1) batch lookup via in-memory index + memory-mapped Arrow
450450- ` put_batch(items) ` : O(1) append-only write to pending buffer
451451- ` flush() ` : Force flush pending writes to new Arrow segment
452452
453453** Features:**
454454- ** O(1) writes** : New data appended to incremental segments, no rewrites
455- - ** O(1) reads** : SQLite index points directly to (segment_id, row_offset)
455+ - ** O(1) reads** : In-memory dict index points directly to (segment_id, row_offset)
456456- ** Native tensors** : Automatic dtype preservation via Arrow's type system
457457- ** Schema inference** : Automatically detects structure on first write
458- - ** Crash safety** : Atomic commits via SQLite WAL + temp file approach
458+ - ** Crash safety** : Automatic index rebuild from segments on corruption
459+ - ** No database dependencies** : Simple pickle-based index persistence
459460
460461## Architecture
461462
462463### Storage Design
463464
464- torchcachex uses a hybrid Arrow IPC + SQLite architecture optimized for billion-scale caching:
465+ torchcachex uses a hybrid Arrow IPC + in-memory index architecture optimized for billion-scale caching:
465466
466467** Components:**
467468
@@ -471,11 +472,12 @@ torchcachex uses a hybrid Arrow IPC + SQLite architecture optimized for billion-
471472 - Memory-mapped for zero-copy reads
472473 - Each segment contains a batch of cached samples
473474
474- 2 . ** SQLite Index** (` index.db ` )
475- - WAL (Write-Ahead Logging) mode for concurrent reads
475+ 2 . ** Pickle Index** (` index.pkl ` )
476+ - In-memory Python dict backed by pickle persistence
476477 - Maps cache keys to (segment_id, row_offset)
477- - O(1) lookups via primary key index
478- - Tracks segment metadata (file paths, row counts)
478+ - O(1) lookups via dict access
479+ - Atomic persistence with temp file swap
480+ - Auto-rebuilds from segments on corruption
479481
4804823 . ** Schema File** (` schema.json ` )
481483 - Auto-inferred from first forward pass
@@ -488,8 +490,9 @@ torchcachex uses a hybrid Arrow IPC + SQLite architecture optimized for billion-
488490put_batch() → pending buffer → flush() → {
489491 1. Create Arrow RecordBatch
490492 2. Write to temp segment file
491- 3. Update SQLite index (atomic transaction)
493+ 3. Update in-memory index dict
492494 4. Atomic rename temp → final
495+ 5. Persist index.pkl (atomic)
493496}
494497```
495498
@@ -498,7 +501,7 @@ put_batch() → pending buffer → flush() → {
498501```
499502get_batch() → {
500503 1. Check LRU cache (in-memory)
501- 2. Query SQLite for (segment_id, row_offset)
504+ 2. Query in-memory index for (segment_id, row_offset)
502505 3. Memory-map Arrow segment
503506 4. Extract rows (zero-copy)
504507 5. Reconstruct tensors with correct dtype
@@ -508,10 +511,10 @@ get_batch() → {
508511** Scalability Properties:**
509512
510513- ** Writes** : O(1) - append new segment, update index
511- - ** Reads** : O(1) - direct index lookup + memory-map
514+ - ** Reads** : O(1) - direct dict lookup + memory-map
512515- ** Memory** : O(working set) - only LRU + current segment in memory
513516- ** Disk** : O(N) - one entry per sample across segments
514- - ** Crash Recovery** : Atomic - incomplete segments ignored, SQLite WAL ensures consistency
517+ - ** Crash Recovery** : Atomic - incomplete segments ignored, index auto-rebuilds from segments if corrupted
515518
516519### Schema Inference
517520
0 commit comments