-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
The dnstap source in TCP mode leaks sockets in CLOSE_WAIT state whenever the DNS client closes its side of the connection. Vector acknowledges the FIN but never calls close() on its end, so the socket remains open indefinitely. Over approximately 48 hours these accumulate until the RequestLimiter permit pool (controlled by max_frame_handling_tasks, default 1000) is exhausted, after which no new dnstap frames can be processed and TCP socket error / connection_failed errors flood the logs at millions of occurrences. Restarting Vector resets the socket count to zero and the cycle repeats.
Configuration
sources:
source_dnstap:
address: 0.0.0.0:9001
mode: tcp
type: dnstap
Version
v0.50.0
Debug Output
Example Data
ss -tn state close-wait 'sport = :9001' | wc -l on each host after ~48h uptime:
CLOSE_WAIT count:
DNS Server targeted hosts:
prd-01 -> 1873
prd-03 -> 1945
prd-04 ->1980
Non Targeted hosts
prd-05 -> 1
Additional Context
Root cause hypothesis
Vector's FrameStream reader in the dnstap TCP source does not handle EOF on the TCP connection. When the DNS client sends FIN, the expected close sequence is:
DNS client ──FIN──► Vector (client initiates close)
DNS client ◄──ACK── Vector (Vector ACKs)
DNS client ◄──FIN── Vector (Vector closes its end) ← does not happen
Vector acknowledges the FIN but the FrameStream reader never subsequently calls close() on the socket, leaving it in CLOSE_WAIT permanently.
Workaround
Periodic restart of the Vector process (every ~36h) before the permit pool exhausts. Not viable long-term.
References
#23392 and PR #23448. PR #23448 addressed a throughput regression in the RequestLimiter by raising the permit ceiling to max_frame_handling_tasks (default 1000). However it did not fix the underlying socket leak — raising the ceiling only extends the failure cycle proportionally rather than eliminating it.