Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"} #17914

Open
froque opened this issue Jun 24, 2024 · 7 comments

Comments

@froque
Copy link

froque commented Jun 24, 2024

Agent Environment

$ sudo datadog-agent version 
Agent 7.54.1 - Commit: 44d1992 - Serialization version: v5.0.114 - Go version: go1.21.9

Describe what happened:

After upgrading to 7.54.0, Kafka consumer lag checks started to fail

Describe what you expected:

Expected Datadog Agent to continue to get Kafka consumer lag offsets from Kafka cluster.

Steps to reproduce the issue:

  • Upgrade to v7.54.0 or v7.54.1
  • Configure Datadog to check Kafka consumer offsets
$ sudo cat /etc/datadog-agent/conf.d/kafka_consumer.d/conf.yaml
init_config:

instances:
  - kafka_connect_str:
      - <redacted>
    security_protocol: SASL_SSL
    sasl_mechanism: PLAIN
    sasl_plain_username: <redacted>

    sasl_plain_password: <redacted>

    kafka_consumer_offsets: true
    monitor_unlisted_consumer_groups: true
  • perform a check
$ sudo datadog-agent check kafka_consumer


  Running Checks
  ==============
    
    kafka_consumer (4.3.0)
    ----------------------
      Instance ID: kafka_consumer:24b8757764ea1a30 [ERROR]
      Configuration Source: file:/etc/datadog-agent/conf.d/kafka_consumer.d/conf.yaml
      Total Runs: 1
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 5.099s
      Last Execution Date : 2024-06-24 09:11:07 WEST / 2024-06-24 08:11:07 UTC (1719216667000)
      Last Successful Execution Date : Never
      Error: Unable to connect to the AdminClient. This is likely due to an error in the configuration.
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/datadog_checks/kafka_consumer/kafka_consumer.py", line 34, in check
          self.client.request_metadata_update()
        File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/datadog_checks/kafka_consumer/client.py", line 180, in request_metadata_update
          self.kafka_client.list_topics(None, timeout=self.config._request_timeout)
        File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/confluent_kafka/admin/__init__.py", line 603, in list_topics
          return super(AdminClient, self).list_topics(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/datadog_checks/base/checks/base.py", line 1224, in run
          self.check(instance)
        File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/datadog_checks/kafka_consumer/kafka_consumer.py", line 36, in check
          raise Exception(
      Exception: Unable to connect to the AdminClient. This is likely due to an error in the configuration.

  Metadata
  ========
    config.hash: kafka_consumer:24b8757764ea1a30
    config.provider: file

Additional environment details (Operating System, Cloud provider, etc):

@froque
Copy link
Author

froque commented Jun 24, 2024

As a workaround, disabling tls_verify or setting tls_ca_cert works

$ tail -n2 /etc/datadog-agent/conf.d/kafka_consumer.d/conf.yaml
    tls_verify: false
    tls_ca_cert: /opt/datadog-agent/embedded/ssl/certs/cacert.pem

@FlorentClarret
Copy link
Member

FlorentClarret commented Jun 24, 2024

Hello @froque! Thanks for opening this issue and the workaround. I'm going to transfer the issue to integrations-core because this is where the integrations lives. I'll let them know so they'll be able to take care of this.

@FlorentClarret FlorentClarret transferred this issue from DataDog/datadog-agent Jun 24, 2024
@HadhemiDD
Copy link
Contributor

@froque can you open a support case? Also, you can use the script in tests/python_client/script.py to run a barebones connection directly to the cluster for debugging. This script will attempt a connection and then fetch all of the consumer groups for that configuration. Please include it with the support case along with a Debug flare.

@froque
Copy link
Author

froque commented Jun 24, 2024

$ /opt/datadog-agent/embedded/bin/python script.py 
bootstrap.servers=<redacted>
socket.timeout.ms=5000
client.id=dd-agent
security.protocol=sasl_ssl
ssl.endpoint.identification.algorithm=none
enable.ssl.certificate.verification=true
sasl.mechanism=PLAIN
sasl.username=<redacted>
sasl.password=*****
Connecting to AdminClient
%3|1719239854.080|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:16000069:STORE routines::unregistered scheme: scheme=file
%3|1719239854.080|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:80000002:system library::No such file or directory: calling stat(/usr/local/ssl/certs)
%3|1719239854.080|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:16000069:STORE routines::unregistered scheme: scheme=file
%3|1719239854.080|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:80000002:system library::No such file or directory: calling stat(/usr/local/ssl/certs)
%3|1719239854.080|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:16000069:STORE routines::unregistered scheme: scheme=file
%3|1719239854.081|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:80000002:system library::No such file or directory: calling stat(/usr/local/ssl/certs)
%3|1719239854.081|FAIL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: SSL handshake failed: error:0A000086:SSL routines::certificate verify failed: broker certificate could not be verified, verify that ssl.ca.location is correctly configured or root CA certificates are installed (install ca-certificates package) (after 34ms in state SSL_HANDSHAKE)
%3|1719239855.009|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:16000069:STORE routines::unregistered scheme: scheme=file
%3|1719239855.009|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:80000002:system library::No such file or directory: calling stat(/usr/local/ssl/certs)
%3|1719239855.010|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:16000069:STORE routines::unregistered scheme: scheme=file
%3|1719239855.010|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:80000002:system library::No such file or directory: calling stat(/usr/local/ssl/certs)
%3|1719239855.010|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:16000069:STORE routines::unregistered scheme: scheme=file
%3|1719239855.010|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:80000002:system library::No such file or directory: calling stat(/usr/local/ssl/certs)
%3|1719239855.010|FAIL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: SSL handshake failed: error:0A000086:SSL routines::certificate verify failed: broker certificate could not be verified, verify that ssl.ca.location is correctly configured or root CA certificates are installed (install ca-certificates package) (after 32ms in state SSL_HANDSHAKE, 1 identical error(s) suppressed)
^CTraceback (most recent call last):
  File "/home/pminds/script.py", line 87, in <module>
    main()
  File "/home/pminds/script.py", line 80, in main
    results = future.result()
              ^^^^^^^^^^^^^^^
  File "/opt/datadog-agent/embedded/lib/python3.11/concurrent/futures/_base.py", line 451, in result
    self._condition.wait(timeout)
  File "/opt/datadog-agent/embedded/lib/python3.11/threading.py", line 327, in wait
    waiter.acquire()
KeyboardInterrupt

From what I have already explored, it seems that in version v7.54.0 it expects a file in /usr/local/ssl/certs and not in /opt/datadog-agent/embedded/ssl/certs/ like in v7.53.0.

@froque
Copy link
Author

froque commented Jun 26, 2024

Your logs were successfully uploaded. For future reference, your internal case id is 1751844

@HadhemiDD
Copy link
Contributor

HadhemiDD commented Jun 27, 2024

From what I have already explored, it seems that in version v7.54.0 it expects a file in /usr/local/ssl/certs and not in /opt/datadog-agent/embedded/ssl/certs/ like in v7.53.0.

=> @froque
Can you elaborate on where did you find this change?
Also, can you try to use port 9091 instead for the kafka broker (update the config on kafka side) and set the same port on datadog side (in the script.py) then try to run the script again and see if it works?

@froque
Copy link
Author

froque commented Jun 27, 2024

@HadhemiDD I messed around in differences between the v73 and v74 debian files.

❯ wget --quiet https://apt.datadoghq.com/pool/d/da/datadog-agent_7.53.0-1_amd64.deb
❯ wget --quiet https://apt.datadoghq.com/pool/d/da/datadog-agent_7.54.0-1_amd64.deb
❯ mkdir v7.53 v7.54
❯ ar --output v7.53 x datadog-agent_7.53.0-1_amd64.deb 
❯ ar --output v7.54 x datadog-agent_7.54.0-1_amd64.deb 
❯ tar --directory=v7.53 -Jxf v7.53/data.tar.xz
❯ tar --directory=v7.54 -Jxf v7.54/data.tar.xz

I noticed that librdkafka is no longer in the same path

❯ find -name \*librdkafka\*so\* -type f
./v7.53/opt/datadog-agent/embedded/lib/librdkafka++.so.1
./v7.53/opt/datadog-agent/embedded/lib/librdkafka.so.1
./v7.54/opt/datadog-agent/embedded/lib/python3.11/site-packages/confluent_kafka.libs/librdkafka-27145264.so.1

And a new libcrypto exists

❯ find -name \*libcrypto\*so\* -type f| sort                 
./v7.53/opt/datadog-agent/embedded/lib/libcrypto.so.3
./v7.53/opt/datadog-agent/embedded/lib/python3.11/site-packages/psycopg2_binary.libs/libcrypto-7d0e8add.so.1.1
./v7.54/opt/datadog-agent/embedded/lib/libcrypto.so.3
./v7.54/opt/datadog-agent/embedded/lib/python3.11/site-packages/aerospike.libs/libcrypto-e31f2095.so.3
./v7.54/opt/datadog-agent/embedded/lib/python3.11/site-packages/confluent_kafka.libs/libcrypto-b840c11b.so.3
./v7.54/opt/datadog-agent/embedded/lib/python3.11/site-packages/psycopg2_binary.libs/libcrypto-7d0e8add.so.1.1

searching for some strings

❯ rgrep '/opt/datadog-agent/embedded/ssl/certs' v7* 
grep: v7.53/opt/datadog-agent/embedded/lib/libcrypto.so.3: binary file matches
grep: v7.54/opt/datadog-agent/embedded/lib/libcrypto.so.3: binary file matches
❯ rgrep '/usr/local/ssl/certs' v7*
grep: v7.54/opt/datadog-agent/embedded/lib/python3.11/site-packages/confluent_kafka.libs/libcrypto-b840c11b.so.3: binary file matches
grep: v7.54/opt/datadog-agent/embedded/lib/python3.11/site-packages/aerospike.libs/libcrypto-e31f2095.so.3: binary file matches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants