Skip to content

Reduce Fingerprinter memory overhead #24848

@hage1005

Description

@hage1005

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

The Fingerprinter struct holds a buffer: Vec<u8> for the lifetime of each source, but fingerprinting only happens briefly during glob scans (every 1-10 seconds). Between scans, the buffer sits idle. With N sources, that's N persistent buffers held permanently. Each buffer is sized to max_line_bytes (up to 2 MiB in our config), making this a major memory problem (For our case 2 MiB × 100 sources = ~200 MiB just for fingerprinter buffers).

Option A: Make buffer a local variable (less intrusive)

Instead of holding the buffer in the Fingerprinter struct, allocate it as a local variable inside fingerprint() and drop it when done:

// Before (current):
pub struct Fingerprinter {
    strategy: FingerprintStrategy,
    max_line_length: usize,
    max_fingerprint_bytes: usize,
    ignore_not_found: bool,
    buffer: Vec<u8>,  // held for entire process lifetime
}

// After (proposed):
pub struct Fingerprinter {
    strategy: FingerprintStrategy,
    max_line_length: usize,
    max_fingerprint_bytes: usize,
    ignore_not_found: bool,
    // no buffer field
}

Option B: Share Fingerprinter across sources (more impactful at scale)

With fingerprint() taking &self (after Option A), a single Fingerprinter can be shared across all sources via Arc:

  // In topology builder or source factory:
  let shared_fingerprinter = Arc::new(Fingerprinter::new(...));

  // Each FileServer gets a clone of the Arc
  file_server.fingerprinter = Arc::clone(&shared_fingerprinter);

Version

vector 0.53.0

Debug Output


Example Data

From profiling tikv-jemalloc

Allocator Growth % of Total Description
Fingerprinter::new 396 MB 43.5% Per-source buffer sized to max_line_bytes
file_source total 428 MB 47.0% Includes Fingerprinter + FileWatcher
S3Sink disk_v2::BufferReader 256 MB 28.1% Deserializing events from disk buffer

Additional Context

No response

References

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions