Skip to content

Conversation

bouwkast
Copy link
Contributor

@bouwkast bouwkast commented Oct 16, 2025

What does this PR do?
Removes the pii_safe parameter introduced in PR #4728 to ensure telemetry error logs comply with constant message requirements. We are not supposed to be sending dynamic message content to the logs intake for various reasons (that is what we got agreement to send, we have deduplication logic to limit number of logs, and we don't want any risk whatsoever of sending up PII).

Motivation:

Logs have been blocked from being ingested for many months now with exclusion filters.
We'd like to get logs back in ASAP as we want to start proactively resolving errors in dd-trace- libraries that are surfaced from these logs as they go into our Error Tracking product.

Change log entry

None.

Additional Notes:

I am unfamiliar with dd-trace-rb and I asked Claude Code to help with the revert of pii_safe, please review this PR as such.

This is based off of my document for my findings for Ruby here https://docs.google.com/document/d/16SFeOp7yxzT6dM1O0TKtWkgDsiq6SeV8P_MAYboCdOI/edit?tab=t.o3m3y2eqq0gx

There are a few instances of string interpolation used, I've looked into the flow for the values that they'd be and I think they map to a small subset of known strings from what I can tell, so I am fine overlooking them, but there is risk that we interpolate other future messages:

How to test the change?

🤷
I removed / altered tests associated with this parameter so if the current test suite passes then hopefully good.

@github-actions github-actions bot added core Involves Datadog core libraries profiling Involves Datadog profiling labels Oct 16, 2025
@github-actions
Copy link

github-actions bot commented Oct 16, 2025

Thank you for updating Change log entry section 👏

Visited at: 2025-10-16 20:54:50 UTC

@github-actions
Copy link

github-actions bot commented Oct 16, 2025

Typing analysis

This PR does not change typing compared to the base branch.

@pr-commenter
Copy link

pr-commenter bot commented Oct 16, 2025

Benchmarks

Benchmark execution time: 2025-10-21 14:25:15

Comparing candidate commit fb3cbec in PR branch steven/error-logs-remediation with baseline commit 408c712 in branch master.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 43 metrics, 2 unstable metrics.

scenario:profiling - Allocations (baseline)

  • 🟩 throughput [+251559.156op/s; +265044.525op/s] or [+5.019%; +5.288%]

@datadog-datadog-prod-us1
Copy link
Contributor

datadog-datadog-prod-us1 bot commented Oct 16, 2025

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage
Patch Coverage: 90.91%
Total Coverage: 98.55% (+0.16%)

View detailed report

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: fb3cbec | Docs | Was this helpful? Give us feedback!

@bouwkast bouwkast marked this pull request as ready for review October 16, 2025 21:52
@bouwkast bouwkast requested review from a team as code owners October 16, 2025 21:52
@bouwkast
Copy link
Contributor Author

bouwkast commented Oct 16, 2025

I'm unsure how to get Check Pull Request CI Status / all-jobs-are-green (pull_request)Failing after 32m to pass

Edit: it has now seemingly passed 🤷

@lloeki lloeki force-pushed the steven/error-logs-remediation branch from a8ae180 to 8984258 Compare October 17, 2025 13:36
@lloeki
Copy link
Member

lloeki commented Oct 17, 2025

Restating the important part of the document's content:

  • Messages: MUST be constant templates only - no dynamic parameter replacement or string interpolation.
  • Stack Traces: MUST be redacted to show only:
    • Datadog library code frames
    • Base library/runtime code frames
    • Known 3rd party integration frames
    • Customer code frames MUST be completely redacted
  • Exception: MUST be the exception type only with the exception message removed.
  • Static Code Analysis: MUST be used to help ensure that dynamic / formatted messages are not sent to the logs backend.

@lloeki
Copy link
Member

lloeki commented Oct 17, 2025

I removed / altered tests associated with this parameter so if the current test suite passes then hopefully good.

I've added a negative test to ensure exception message content is absent.

@lloeki
Copy link
Member

lloeki commented Oct 17, 2025

probe_notifier appears to be two possibilities: probe status and snapshot

Correct; while there is interpolation, it is essentially from static content:

      [
        [:status, 'probe status'],
        [:snapshot, 'snapshot'],
      ].each do |(event_type, event_name)|

propagation appears to be values such as Datadog::Tracing::Distributed::Propagation::HTTP (our propagators)

I've had a hard time tracing it back, but it appears to come from a configuration setting that then maps to specific classes. In all cases it's .class.name so again, a fairly controlled String.

patcher appears to be our integration patchers within dd-trace-rb so something like Datadog::Tracing::Contrib::Rails::Patcher -> there are many of these but seems reasonable IMO

Correct; unless the below is overridden in a particular patch implementation, they are essentially static strings from each of the patcher class names:

def patch_name
(self.class != Class && self.class != Module) ? self.class.name : name
end

To kinda make it explicit and guarantee it via type checking, this signature should probably be:

          def patch_name: () -> String

def patch_name: () -> untyped

But that's a bit of rabbit holing for this PR.

@lloeki
Copy link
Member

lloeki commented Oct 17, 2025

I'm unsure how to get all-jobs-are-green failing after 32m to pass

Screenshot 2025-10-17 at 16 23 35

Some jobs are unexpectedly killed due to a timeout limit that should not be hit; we're looking into it.

@p-datadog p-datadog requested a review from ivoanjo October 17, 2025 15:54
@ivoanjo
Copy link
Member

ivoanjo commented Oct 17, 2025

So I believe we discussed introducing new exception types to be emitted by profiling with the fixed strings from C that we could use for this, but this PR doesn't do that. Any plans on still doing that?

@lloeki
Copy link
Member

lloeki commented Oct 20, 2025

Any plans on still doing that?

@ivoanjo As far as I'm aware there's:

MY understanding is that this work is out of scope of this PR, which is to be able to restore the currently disabled ingestion stream, and then have work on capturing better details via static strings in subsequent PRs.

@lloeki
Copy link
Member

lloeki commented Oct 21, 2025

After much effort I managed to reproduce the remaining failure:

docker-compose run tracer-jruby-9.4 /bin/bash
bundle install
bundle exec rake dependency:install
BUNDLE_GEMFILE="${PWD}/gemfiles/jruby_9.4_elasticsearch_latest.gemfile" bundle exec rake spec:elasticsearch['--seed 34542','spec/datadog/tracing/contrib/elasticsearch/patcher_spec.rb:168']

This can give us errors like:

Failures:

  1) Datadog::Tracing::Contrib::Elasticsearch::Patcher indexing request creates a span
     Failure/Error: response = super
     
     Elastic::Transport::Transport::Error:
       Net::ReadTimeout with #<TCPSocket:(closed)>
     # /usr/local/bundle/gems/elastic-transport-8.4.0/lib/elastic/transport/transport/base.rb:324:in `perform_request'
     # /usr/local/bundle/gems/elastic-transport-8.4.0/lib/elastic/transport/transport/http/faraday.rb:36:in `perform_request'
     # /usr/local/bundle/gems/elastic-transport-8.4.0/lib/elastic/transport/client.rb:192:in `perform_request'
     # ./lib/datadog/tracing/contrib/elasticsearch/patcher.rb:105:in `block in perform_request'
     # ./lib/datadog/tracing/trace_operation.rb:243:in `block in measure'
     # ./lib/datadog/tracing/span_operation.rb:167:in `measure'
     # ./lib/datadog/tracing/trace_operation.rb:243:in `measure'
     # ./lib/datadog/tracing/tracer.rb:425:in `start_span'
     # ./lib/datadog/tracing/tracer.rb:166:in `block in trace'
     # ./lib/datadog/tracing/context.rb:45:in `activate!'
     # ./lib/datadog/tracing/tracer.rb:165:in `trace'
     # ./lib/datadog/tracing.rb:29:in `trace'
     # ./lib/datadog/tracing/contrib/elasticsearch/patcher.rb:53:in `perform_request'
     # /usr/local/bundle/gems/elasticsearch-9.1.2/lib/elasticsearch.rb:71:in `method_missing'
     # ./spec/datadog/tracing/contrib/elasticsearch/patcher_spec.rb:166:in `block in <main>'
     # ./spec/datadog/tracing/contrib/elasticsearch/patcher_spec.rb:169:in `block in <main>'
     # ./spec/datadog/tracing/contrib/elasticsearch/patcher_spec.rb:169:in `block in <main>'
     # ./spec/datadog/tracing/contrib/elasticsearch/patcher_spec.rb:48:in `block in <main>'
     # ./spec/datadog/tracing/contrib/support/tracer_helpers.rb:96:in `block in TracerHelpers'
     # ./spec/spec_helper.rb:272:in `block in <main>'
     # ./spec/spec_helper.rb:154:in `block in <main>'
     # /usr/local/bundle/gems/webmock-3.23.1/lib/webmock/rspec.rb:39:in `block in <main>'
     # /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in `block in <main>'
     # ------------------
     # --- Caused by: ---
     # Net::ReadTimeout:
     #   Net::ReadTimeout with #<TCPSocket:(closed)>
     #   /usr/local/bundle/gems/net-http-0.6.0/lib/net/http/response.rb:158:in `read_status_line'

Finished in 1 minute 2.17 seconds (files took 3.92 seconds to load)
26 examples, 1 failure

Failed examples:

rspec ./spec/datadog/tracing/contrib/elasticsearch/patcher_spec.rb:168 # Datadog::Tracing::Contrib::Elasticsearch::Patcher indexing request creates a span

Randomized with seed 34542
Datadog::Tracing::Contrib::Elasticsearch::Patcher
  indexing request
    creates a span (FAILED - 1)

Failures:

  1) Datadog::Tracing::Contrib::Elasticsearch::Patcher indexing request creates a span
     Failure/Error: response = super
     
     Elastic::Transport::Transport::Errors::ServiceUnavailable:
       [503] {"error":{"root_cause":[{"type":"unavailable_shards_exception","reason":"[some_index][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[some_index][0]] containing [index {[some_index][1], source[{\"field\":\"Test\",\"nested_object\":{\"value\":\"x\"},\"nested_array\":[\"a\",\"b\"],\"nested_object_array\":[{\"a\":\"a\"},{\"b\":\"b\"}]}]}]]"}],"type":"unavailable_shards_exception","reason":"[some_index][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[some_index][0]] containing [index {[some_index][1], source[{\"field\":\"Test\",\"nested_object\":{\"value\":\"x\"},\"nested_array\":[\"a\",\"b\"],\"nested_object_array\":[{\"a\":\"a\"},{\"b\":\"b\"}]}]}]]"},"status":503}
     # /usr/local/bundle/gems/elastic-transport-8.4.0/lib/elastic/transport/transport/base.rb:228:in `__raise_transport_error'
     # /usr/local/bundle/gems/elastic-transport-8.4.0/lib/elastic/transport/transport/base.rb:346:in `perform_request'
     # /usr/local/bundle/gems/elastic-transport-8.4.0/lib/elastic/transport/transport/http/faraday.rb:36:in `perform_request'
     # /usr/local/bundle/gems/elastic-transport-8.4.0/lib/elastic/transport/client.rb:192:in `perform_request'
     # ./lib/datadog/tracing/contrib/elasticsearch/patcher.rb:105:in `block in perform_request'
     # ./lib/datadog/tracing/trace_operation.rb:243:in `block in measure'
     # ./lib/datadog/tracing/span_operation.rb:167:in `measure'
     # ./lib/datadog/tracing/trace_operation.rb:243:in `measure'
     # ./lib/datadog/tracing/tracer.rb:425:in `start_span'
     # ./lib/datadog/tracing/tracer.rb:166:in `block in trace'
     # ./lib/datadog/tracing/context.rb:45:in `activate!'
     # ./lib/datadog/tracing/tracer.rb:165:in `trace'
     # ./lib/datadog/tracing.rb:29:in `trace'
     # ./lib/datadog/tracing/contrib/elasticsearch/patcher.rb:53:in `perform_request'
     # /usr/local/bundle/gems/elasticsearch-9.1.2/lib/elasticsearch.rb:71:in `method_missing'
     # ./spec/datadog/tracing/contrib/elasticsearch/patcher_spec.rb:166:in `block in <main>'
     # ./spec/datadog/tracing/contrib/elasticsearch/patcher_spec.rb:169:in `block in <main>'
     # ./spec/datadog/tracing/contrib/elasticsearch/patcher_spec.rb:169:in `block in <main>'
     # ./spec/datadog/tracing/contrib/elasticsearch/patcher_spec.rb:48:in `block in <main>'
     # ./spec/datadog/tracing/contrib/support/tracer_helpers.rb:96:in `block in TracerHelpers'
     # ./spec/spec_helper.rb:272:in `block in <main>'
     # ./spec/spec_helper.rb:154:in `block in <main>'
     # /usr/local/bundle/gems/webmock-3.23.1/lib/webmock/rspec.rb:39:in `block in <main>'
     # /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in `block in <main>'

Finished in 1 minute 0.28 seconds (files took 3.72 seconds to load)
1 example, 1 failure

Failed examples:

rspec ./spec/datadog/tracing/contrib/elasticsearch/patcher_spec.rb:168 # Datadog::Tracing::Contrib::Elasticsearch::Patcher indexing request creates a span

Randomized with seed 34542

Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Left some notes + looking forward to the refactoring that gets us back detailed exception messages from the profiler

@lloeki
Copy link
Member

lloeki commented Oct 21, 2025

This times out:

client.perform_request 'PUT', "some_index/_doc/1", {}, {}

but this works fine:

client.perform_request 'GET', '_cluster/health'

I mean, as long as it returns a 200... but there's a status: red in there

#<Elastic::Transport::Transport::Response:0x26c09a0f
 @body=
  {"cluster_name"=>"docker-cluster",
   "status"=>"red",
   "timed_out"=>false,
   "number_of_nodes"=>1,
   "number_of_data_nodes"=>1,
   "active_primary_shards"=>0,
   "active_shards"=>0,
   "relocating_shards"=>0,
   "initializing_shards"=>0,
   "unassigned_shards"=>3,
   "delayed_unassigned_shards"=>0,
   "number_of_pending_tasks"=>0,
   "number_of_in_flight_fetch"=>0,
   "task_max_waiting_in_queue_millis"=>0,
   "active_shards_percent_as_number"=>0.0},
 @headers={"x-elastic-product"=>"Elasticsearch", "content-type"=>"application/vnd.elasticsearch+json;compatible-with=8", "content-length"=>"386"},
 @status=200>

@lloeki
Copy link
Member

lloeki commented Oct 21, 2025

Locally I had these docker logs dd-trace-rb-elasticsearch-1:

{"@timestamp":"2025-10-21T10:29:16.176Z", "log.level": "WARN", "message":"high disk watermark [90%] exceeded on [eesBpn7cTzqTgrmbXYr11A][7d7c63ff75a8][/usr/share/elasticsearch/data] free: 12.6gb[5.1%], shards will be relocated away from this node; currently relocating away shards totalling [0] bytes; the node is expected to continue to exceed the high disk watermark when these relocations are complete", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[7d7c63ff75a8][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.DiskThresholdMonitor","elasticsearch.cluster.uuid":"K7tuqIZGSHeY4k41yPLF5g","elasticsearch.node.id":"eesBpn7cTzqTgrmbXYr11A","elasticsearch.node.name":"7d7c63ff75a8","elasticsearch.cluster.name":"docker-cluster"}

Cleaned up some space, restarted the container with docker-compose restart elasticsearch, status moved to yellow, and now PUT do pass correctly.

I presume we might be hitting some GHA space limit depending on test ordering (thus seed).

@lloeki lloeki requested review from a team as code owners October 21, 2025 10:41
@lloeki lloeki requested a review from mabdinur October 21, 2025 10:41
bouwkast and others added 4 commits October 21, 2025 15:48
Messages sent to our backend from the tracing
libraries must be constant messages.
This will prevent failing tests from consuming allotted CI
budget thus preventing test results to beproperly reported.
Runs after tests have failed (for diagnosis) or succeeded (for comparison)

Only free disk space for now, but we could add more.
@lloeki lloeki force-pushed the steven/error-logs-remediation branch from f40e845 to c8670d6 Compare October 21, 2025 13:49
@lloeki lloeki merged commit 6bb4e4c into master Oct 21, 2025
406 checks passed
@lloeki lloeki deleted the steven/error-logs-remediation branch October 21, 2025 14:40
@github-actions github-actions bot added this to the 2.23.0 milestone Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Involves Datadog core libraries profiling Involves Datadog profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants