-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
The Fingerprinter struct holds a buffer: Vec<u8> for the lifetime of each source, but fingerprinting only happens briefly during glob scans (every 1-10 seconds). Between scans, the buffer sits idle. With N sources, that's N persistent buffers held permanently. Each buffer is sized to max_line_bytes (up to 2 MiB in our config), making this a major memory problem (For our case 2 MiB × 100 sources = ~200 MiB just for fingerprinter buffers).
Option A: Make buffer a local variable (less intrusive)
Instead of holding the buffer in the Fingerprinter struct, allocate it as a local variable inside fingerprint() and drop it when done:
// Before (current):
pub struct Fingerprinter {
strategy: FingerprintStrategy,
max_line_length: usize,
max_fingerprint_bytes: usize,
ignore_not_found: bool,
buffer: Vec<u8>, // held for entire process lifetime
}
// After (proposed):
pub struct Fingerprinter {
strategy: FingerprintStrategy,
max_line_length: usize,
max_fingerprint_bytes: usize,
ignore_not_found: bool,
// no buffer field
}Option B: Share Fingerprinter across sources (more impactful at scale)
With fingerprint() taking &self (after Option A), a single Fingerprinter can be shared across all sources via Arc:
// In topology builder or source factory:
let shared_fingerprinter = Arc::new(Fingerprinter::new(...));
// Each FileServer gets a clone of the Arc
file_server.fingerprinter = Arc::clone(&shared_fingerprinter);Version
vector 0.53.0
Debug Output
Example Data
From profiling tikv-jemalloc
| Allocator | Growth | % of Total | Description |
|---|---|---|---|
| Fingerprinter::new | 396 MB | 43.5% | Per-source buffer sized to max_line_bytes |
| file_source total | 428 MB | 47.0% | Includes Fingerprinter + FileWatcher |
| S3Sink disk_v2::BufferReader | 256 MB | 28.1% | Deserializing events from disk buffer |
Additional Context
No response
References
No response