Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Dynamic configuration updates cause metrics reporting to stop #12282

Open
mstahlswrx opened this issue Nov 18, 2024 · 5 comments
Open
Labels

Comments

@mstahlswrx
Copy link

mstahlswrx commented Nov 18, 2024

Describe the problem you faced

When running the Hudi streamer with the https://hudi.apache.org/releases/release-0.14.0#dynamic-configuration-updates feature configured, the streamer restarted after a configuration change as expected; however, metrics reporting stopped.

I believe the sequence of events are as follows:

Hudi calls streamSync.close() which ends up shutting down metrics reporting. Since the metrics instances are all cached in a static field, when a new StreamSync instance is created, it creates a new HoodieMetrics instance which reuses the stopped Metrics instance.

#10132 is a fix for a similar issue, but not related to the streamer. (That patch is already in hudi 0.14.)

To Reproduce

Steps to reproduce the behavior:

  1. Create an implementation of the ConfigurationHotUpdateStrategy class
  2. Start the hudi streamer with the --config-hot-update-strategy-class option configure with your hot update strategy class
  3. Verify metrics reporting is working
  4. Change a configuration property so that your ConfigurationHotUpdateStrategy.updateProperties method returns updated properties
  5. Wait for the streamer to re-initialize with the updated properties
  6. Observe that metrics reporting no longer works

Expected behavior

After the streamer re-initializes metrics reporting continues to work.

Environment Description

  • Hudi version : 0.14.1
  • Spark version : 3.4.x
  • Storage (HDFS/S3/GCS..) : S3
  • Running on Docker? (yes/no) : yes
@rangareddy
Copy link

rangareddy commented Dec 3, 2024

Hi @mstahlswrx

Can we coordinate to reproduce this issue?

@mstahlswrx
Copy link
Author

@rangareddy I'm going to throw together a junit test that will reproduce the issue.

@mstahlswrx
Copy link
Author

I created a test that reproduces the problem. You'll notice in the test output that metrics are reported every 5s (1st column in the log4j output in millis) until the streamer re-inits as a result of the hot-swap. As part of re-init the reporter is stopped, shutting down the scheduled executor and closing the http client. No further metrics can then be reported which is evident in the output by the lack of send() log messages.

hudi-12282-unit-test.patch
test.txt

19636 [main] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestHotUpdateStrategy [] - Created.
19637 [main] INFO  org.apache.hudi.utilities.streamer.HoodieStreamer [] - Creating Hudi Streamer with configs:
hoodie.metrics.reporter.class: org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestDatadogMetricsReporter
20393 [main] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestDatadogMetricsReporter [] - Created.
20393 [main] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestDatadogMetricsReporter [] - start() called. isStarted: false -> true
20412 [pool-42-thread-1] INFO  org.apache.hudi.utilities.streamer.HoodieStreamer [] - Setting Spark Pool name for delta-sync to hoodiedeltasync
20415 [pool-42-thread-1] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestHotUpdateStrategy [] - updateProperties() was called 1 times - not causing a hot swap.
25423 [metrics-hudi-datadog-reporter-1-thread-1] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestDatadogMetricsReporter [] - send() - Simulating sending payload = {"series":[{"metric":"hoodie_trips.deltastreamer.heartbeatTimestampMs","points":[[1733928771,1733928766367]],"tags":[]}]} to datadog.
28409 [pool-42-thread-1] INFO  org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor [] - Committed 00000000000000010
30394 [metrics-hudi-datadog-reporter-1-thread-1] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestDatadogMetricsReporter [] - send() - Simulating sending payload = {"series":[{"metric":"hoodie_trips.deltastreamer.heartbeatTimestampMs","points":[[1733928776,1733928766367]],"tags":[]}]} to datadog.
35393 [metrics-hudi-datadog-reporter-1-thread-1] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestDatadogMetricsReporter [] - send() - Simulating sending payload = {"series":[{"metric":"hoodie_trips.deltastreamer.heartbeatTimestampMs","points":[[1733928781,1733928766367]],"tags":[]},{"metric":"hoodie_trips.finalize.duration","points":[[1733928781,12]],"tags":[]},{"metric":"hoodie_trips.finalize.numFilesFinalized","points":[[1733928781,3]],"tags":[]},{"metric":"hoodie_trips.index.UPSERT.duration","points":[[1733928781,0]],"tags":[]},{"metric":"hoodie_trips.index.lookup.duration","points":[[1733928781,1767]],"tags":[]}]} to datadog.
36139 [pool-42-thread-1] INFO  org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor [] - Committed 20241211095246387
36352 [pool-42-thread-1] INFO  org.apache.hudi.client.BaseHoodieWriteClient [] - Committed 20241211095246387
36400 [pool-42-thread-1] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestHotUpdateStrategy [] - updateProperties() was called 2 times - causing a hot swap.
36400 [pool-42-thread-1] INFO  org.apache.hudi.utilities.streamer.HoodieStreamer [] - Re-init delta sync with new config properties:
36400 [pool-42-thread-1] INFO  org.apache.hudi.utilities.streamer.HoodieStreamer [] - Creating Hudi Streamer with configs:
hoodie.metrics.reporter.class: org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestDatadogMetricsReporter
36417 [pool-42-thread-1] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestDatadogMetricsReporter [] - report() called. isStarted = true
36419 [pool-42-thread-1] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestDatadogMetricsReporter [] - send() - Simulating sending payload = {"series":[{"metric":"hoodie_trips.HoodieMetadata.files.baseFileCount","points":[[1733928782,1]],"tags":[]},{"metric":"hoodie_trips.HoodieMetadata.files.logFileCount","points":[[1733928782,2]],"tags":[]},{"metric":"hoodie_trips.HoodieMetadata.files.totalBaseFileSizeInBytes","points":[[1733928782,6807]],"tags":[]},{"metric":"hoodie_trips.HoodieMetadata.files.totalLogFileSizeInBytes","points":[[1733928782,13552]],"tags":[]},{"metric":"hoodie_trips.HoodieMetadata.initialize.count","points":[[1733928782,1]],"tags":[]},{"metric":"hoodie_trips.HoodieMetadata.initialize.totalDuration","points":[[1733928782,2393]],"tags":[]},{"metric":"hoodie_trips.HoodieMetadata.table_service_execution_duration.count","points":[[1733928782,1]],"tags":[]},{"metric":"hoodie_trips.HoodieMetadata.table_service_execution_duration.totalDuration","points":[[1733928782,82]],"tags":[]},{"metric":"hoodie_trips.HoodieMetadata.table_service_execution_status","points":[[1733928782,1]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystem.create","points":[[1733928782,3]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystem.create.totalDuration","points":[[1733928782,160]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystem.getFileStatus","points":[[1733928782,3]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystem.getFileStatus.totalDuration","points":[[1733928782,0]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystem.mkdirs","points":[[1733928782,3]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystem.mkdirs.totalDuration","points":[[1733928782,41]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystem.write","points":[[1733928782,10611]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystem.write.totalBytes","points":[[1733928782,1581258]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystem.write.totalDuration","points":[[1733928782,28]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystemMetaFolder.create","points":[[1733928782,11]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystemMetaFolder.create.totalDuration","points":[[1733928782,492]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystemMetaFolder.delete","points":[[1733928782,5]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystemMetaFolder.delete.totalDuration","points":[[1733928782,2]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystemMetaFolder.getFileStatus","points":[[1733928782,54]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystemMetaFolder.getFileStatus.totalDuration","points":[[1733928782,7]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystemMetaFolder.listStatus","points":[[1733928782,65]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystemMetaFolder.listStatus.totalDuration","points":[[1733928782,51]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystemMetaFolder.mkdirs","points":[[1733928782,2]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystemMetaFolder.mkdirs.totalDuration","points":[[1733928782,1]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystemMetaFolder.rename","points":[[1733928782,4]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystemMetaFolder.rename.totalDuration","points":[[1733928782,42]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystemMetaFolder.write","points":[[1733928782,36]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystemMetaFolder.write.totalBytes","points":[[1733928782,22763]],"tags":[]},{"metric":"hoodie_trips.HoodieWrapperFileSystemMetaFolder.write.totalDuration","points":[[1733928782,2]],"tags":[]},{"metric":"hoodie_trips.TimelineService.CREATE_AND_MERGE_MARKERS","points":[[1733928782,1]],"tags":[]},{"metric":"hoodie_trips.TimelineService.CREATE_MARKER","points":[[1733928782,3]],"tags":[]},{"metric":"hoodie_trips.TimelineService.DELETE_MARKER_DIR","points":[[1733928782,1]],"tags":[]},{"metric":"hoodie_trips.TimelineService.MARKERS_DIR_EXISTS","points":[[1733928782,1]],"tags":[]},{"metric":"hoodie_trips.TimelineService.REFRESH_TABLE","points":[[1733928782,1]],"tags":[]},{"metric":"hoodie_trips.TimelineService.TOTAL_API_CALLS","points":[[1733928782,7]],"tags":[]},{"metric":"hoodie_trips.TimelineService.TOTAL_API_TIME","points":[[1733928782,109]],"tags":[]},{"metric":"hoodie_trips.TimelineService.TOTAL_CHECK_TIME","points":[[1733928782,0]],"tags":[]},{"metric":"hoodie_trips.TimelineService.TOTAL_HANDLE_TIME","points":[[1733928782,109]],"tags":[]},{"metric":"hoodie_trips.TimelineService.TOTAL_REFRESH_TIME","points":[[1733928782,0]],"tags":[]},{"metric":"hoodie_trips.TimelineService.WRITE_VALUE_CNT","points":[[1733928782,7]],"tags":[]},{"metric":"hoodie_trips.TimelineService.WRITE_VALUE_TIME","points":[[1733928782,1]],"tags":[]},{"metric":"hoodie_trips.TimelineService.numEmbeddedTimelineServers","points":[[1733928782,0]],"tags":[]},{"metric":"hoodie_trips.deltacommit.commitTime","points":[[1733928782,1733928766387]],"tags":[]},{"metric":"hoodie_trips.deltacommit.duration","points":[[1733928782,7669]],"tags":[]},{"metric":"hoodie_trips.deltacommit.totalBytesWritten","points":[[1733928782,1452015]],"tags":[]},{"metric":"hoodie_trips.deltacommit.totalCompactedRecordsUpdated","points":[[1733928782,0]],"tags":[]},{"metric":"hoodie_trips.deltacommit.totalCreateTime","points":[[1733928782,2543]],"tags":[]},{"metric":"hoodie_trips.deltacommit.totalFilesInsert","points":[[1733928782,3]],"tags":[]},{"metric":"hoodie_trips.deltacommit.totalFilesUpdate","points":[[1733928782,0]],"tags":[]},{"metric":"hoodie_trips.deltacommit.totalInsertRecordsWritten","points":[[1733928782,1000]],"tags":[]},{"metric":"hoodie_trips.deltacommit.totalLogFilesCompacted","points":[[1733928782,0]],"tags":[]},{"metric":"hoodie_trips.deltacommit.totalLogFilesSize","points":[[1733928782,0]],"tags":[]},{"metric":"hoodie_trips.deltacommit.totalPartitionsWritten","points":[[1733928782,3]],"tags":[]},{"metric":"hoodie_trips.deltacommit.totalRecordsDeleted","points":[[1733928782,0]],"tags":[]},{"metric":"hoodie_trips.deltacommit.totalRecordsWritten","points":[[1733928782,1000]],"tags":[]},{"metric":"hoodie_trips.deltacommit.totalScanTime","points":[[1733928782,0]],"tags":[]},{"metric":"hoodie_trips.deltacommit.totalUpdateRecordsWritten","points":[[1733928782,0]],"tags":[]},{"metric":"hoodie_trips.deltacommit.totalUpsertTime","points":[[1733928782,0]],"tags":[]},{"metric":"hoodie_trips.deltastreamer.duration","points":[[1733928782,15986]],"tags":[]},{"metric":"hoodie_trips.deltastreamer.heartbeatTimestampMs","points":[[1733928782,1733928782354]],"tags":[]},{"metric":"hoodie_trips.deltastreamer.lastSync","points":[[1733928782,1733928782354]],"tags":[]},{"metric":"hoodie_trips.finalize.duration","points":[[1733928782,12]],"tags":[]},{"metric":"hoodie_trips.finalize.numFilesFinalized","points":[[1733928782,3]],"tags":[]},{"metric":"hoodie_trips.index.UPSERT.duration","points":[[1733928782,0]],"tags":[]},{"metric":"hoodie_trips.index.lookup.duration","points":[[1733928782,1767]],"tags":[]}]} to datadog.
36420 [pool-42-thread-1] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestDatadogMetricsReporter [] - stop() called. isStarted: true -> false
40537 [pool-42-thread-1] INFO  org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor [] - Committed 20241211095302386
40757 [pool-42-thread-1] INFO  org.apache.hudi.client.BaseHoodieWriteClient [] - Committed 20241211095302386
40787 [pool-42-thread-1] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestHotUpdateStrategy [] - updateProperties() was called 3 times - not causing a hot swap.
40810 [pool-42-thread-1] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestHotUpdateStrategy [] - updateProperties() was called 4 times - not causing a hot swap.
40820 [pool-42-thread-1] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestHotUpdateStrategy [] - updateProperties() was called 5 times - not causing a hot swap.
40830 [pool-42-thread-1] WARN  org.apache.hudi.utilities.streamer.HoodieStreamer [] - Closing and shutting down ingestion service
40830 [pool-42-thread-1] INFO  org.apache.hudi.utilities.streamer.HoodieStreamer [] - Ingestion completed. Has error: false
40833 [pool-42-thread-1] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestDatadogMetricsReporter [] - report() called. isStarted = false
40834 [pool-42-thread-1] WARN  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestDatadogMetricsReporter [] - send() - Called on a closed DatadogHttpClient... metrics will never get sent to Datadog!
40834 [pool-42-thread-1] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestDatadogMetricsReporter [] - stop() called. isStarted: false -> false
40834 [pool-42-thread-1] INFO  org.apache.hudi.utilities.streamer.HoodieStreamer [] - Delta Sync shutdown. Error ?true
40835 [pool-42-thread-1] WARN  org.apache.hudi.utilities.streamer.HoodieStreamer [] - Gracefully shutting down compactor
	at org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.shutdownAsyncServices(HoodieStreamer.java:839) ~[classes/:?]
	at org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.lambda$startService$1(HoodieStreamer.java:810) ~[classes/:?]
40835 [pool-42-thread-1] INFO  org.apache.hudi.utilities.streamer.HoodieStreamer [] - Ingestion completed. Has error: false
40835 [pool-42-thread-1] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestDatadogMetricsReporter [] - report() called. isStarted = false
40836 [pool-42-thread-1] WARN  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestDatadogMetricsReporter [] - send() - Called on a closed DatadogHttpClient... metrics will never get sent to Datadog!
40836 [pool-42-thread-1] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestDatadogMetricsReporter [] - stop() called. isStarted: false -> false
	at org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer.testMetricsWithHotSwap(TestHoodieDeltaStreamer.java:911)
47311 [Thread-97] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestDatadogMetricsReporter [] - report() called. isStarted = false
47312 [Thread-97] WARN  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestDatadogMetricsReporter [] - send() - Called on a closed DatadogHttpClient... metrics will never get sent to Datadog!
47312 [Thread-97] INFO  org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer$TestDatadogMetricsReporter [] - stop() called. isStarted: false -> false
[ERROR]   TestHoodieDeltaStreamer.testMetricsWithHotSwap:911 Tried to send metrics using a closed DatadogHttpClient ==> expected: <false> but was: <true>

@mstahlswrx
Copy link
Author

patch < /tmp/hudi-12282-unit-test.patch
clear; mvn -pl hudi-utilities -Dtest=TestHoodieDeltaStreamer#testMetricsWithHotSwap test 2>&1 \
 | tee /tmp/build.txt \
 | egrep 'TestHotUpdateStrategy|TestDatadogMetricsReporter|HoodieStreamer|TestHoodieDeltaStreamer\.testMetricsWithHotSwap|Committed' \
| tee /tmp/test.txt

@rangareddy
Copy link

Created a Hudi Upstream Jira - https://issues.apache.org/jira/browse/HUDI-8778

@ad1happy2go ad1happy2go moved this from ⏳ Awaiting Triage to 🏁 Triaged in Hudi Issue Support Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🏁 Triaged
Development

No branches or pull requests

3 participants