Skip to content

dnstap TCP source leaks CLOSE_WAIT sockets on remote connection close, exhausting RequestLimiter permit pool over ~48h #24838

@vinuthna-m

Description

@vinuthna-m

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

The dnstap source in TCP mode leaks sockets in CLOSE_WAIT state whenever the DNS client closes its side of the connection. Vector acknowledges the FIN but never calls close() on its end, so the socket remains open indefinitely. Over approximately 48 hours these accumulate until the RequestLimiter permit pool (controlled by max_frame_handling_tasks, default 1000) is exhausted, after which no new dnstap frames can be processed and TCP socket error / connection_failed errors flood the logs at millions of occurrences. Restarting Vector resets the socket count to zero and the cycle repeats.

Configuration

sources:
  source_dnstap:
    address: 0.0.0.0:9001
    mode: tcp
    type: dnstap

Version

v0.50.0

Debug Output


Example Data

ss -tn state close-wait 'sport = :9001' | wc -l on each host after ~48h uptime:

CLOSE_WAIT count:
DNS Server targeted hosts:
prd-01 -> 1873
prd-03 -> 1945
prd-04 ->1980

Non Targeted hosts
prd-05 -> 1

Additional Context

Root cause hypothesis

Vector's FrameStream reader in the dnstap TCP source does not handle EOF on the TCP connection. When the DNS client sends FIN, the expected close sequence is:

DNS client ──FIN──► Vector (client initiates close)
DNS client ◄──ACK── Vector (Vector ACKs)
DNS client ◄──FIN── Vector (Vector closes its end) ← does not happen
Vector acknowledges the FIN but the FrameStream reader never subsequently calls close() on the socket, leaving it in CLOSE_WAIT permanently.

Workaround

Periodic restart of the Vector process (every ~36h) before the permit pool exhausts. Not viable long-term.

References

#23392 and PR #23448. PR #23448 addressed a throughput regression in the RequestLimiter by raising the permit ceiling to max_frame_handling_tasks (default 1000). However it did not fix the underlying socket leak — raising the ceiling only extends the failure cycle proportionally rather than eliminating it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions