-
Notifications
You must be signed in to change notification settings - Fork 82
telemetry with batching, persistence, and improved logging #4014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v8/develop
Are you sure you want to change the base?
Conversation
|
||
// Graceful shutdown handling | ||
process.on('SIGINT', () => this.gracefulShutdown()); | ||
process.on('SIGTERM', () => this.gracefulShutdown()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Signal Handler Duplication Causes Shutdown Race Condition
The QuestTelemetry
module registers redundant SIGINT
and SIGTERM
signal handlers. The main OTNode
class already handles these signals, and its handleExit()
method correctly initiates telemetry cleanup via the module manager. This duplication creates a race condition where both QuestTelemetry.gracefulShutdown()
and OTNode.handleExit()
execute on process termination. Although gracefulShutdown()
has a guard, the cleanup()
method, which can be called by both paths, lacks this protection, potentially causing attempts to flush or close already closed connections and leading to unpredictable shutdown behavior.
const telemetryModuleManager = this.container.resolve('telemetryModuleManager'); | ||
if (telemetryModuleManager && telemetryModuleManager.getImplementation()) { | ||
this.logger.info('Cleaning up telemetry...'); | ||
await telemetryModuleManager.getImplementation().module.cleanup(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Telemetry Cleanup Method Access Error
The telemetry cleanup call in handleExit
incorrectly attempts to access telemetryModuleManager.getImplementation().module.cleanup()
. The cleanup()
method is directly available on the object returned by getImplementation()
, not nested under a .module
property, which will cause a runtime error.
Locations (1)
constructor() { | ||
this.localSender = null; | ||
this.lastErrorLogTime = 0; | ||
this.errorLogInterval = 60000; // 1 minute between error logs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move all this constants to constants/constants.js
this.lastSuccessfulSend = Date.now(); | ||
this.maxTimeWithoutSuccess = 5 * 60 * 1000; // 5 minutes | ||
|
||
// Event persistence for when QuestDB is down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be stored in memory, if nodes restarts, it's lost, and increases the memory footprint of the node?
try { | ||
this.localSender = Sender.fromConfig(this.config.localEndpoint); | ||
this.retryAttempts = 0; | ||
this.telemetryStats.connectionStatus = 'connected'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
connected should be constants
this.lastSuccessfulSend = Date.now(); | ||
this.logger.debug('QuestDB local sender created successfully'); | ||
} catch (error) { | ||
this.telemetryStats.connectionStatus = 'failed'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failed should be constants
startHealthMonitoring() { | ||
this.healthCheckTimer = setInterval(() => { | ||
this.logHealthStatus(); | ||
}, this.healthCheckInterval); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
healthCheckInterval
shouldn't be field it should be constant
|
||
table.symbol('operationId', event.operationId || 'NULL'); | ||
table.symbol('blockchainId', event.blockchainId || 'NULL'); | ||
table.symbol('name', event.name || 'NULL'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be 'NULL' as string, or it should just be null value?
|
||
// Log failed proof events (important to know) | ||
if (proofEvents.length > 0) { | ||
this.logger.error(`[TELEMETRY] Failed to send ${proofEvents.length} proof events: ${proofEvents.map(e => e.operationId).join(', ')}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a lot of failed events this log will be huge
|
||
for (const event of events) { | ||
try { | ||
await this.retryWithBackoff(async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to this sequantaly, can we rewrite this so it retries at same time for all of them by launching promises at same time and awaits for all of them
// Calculate queue status | ||
const queueStatus = this.persistentEventQueue.length > 0 | ||
? `Queue: ${this.persistentEventQueue.length}/${this.maxPersistentQueueSize}` | ||
: 'Queue: empty'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it shouldn't be empty but 0% let's have consistants logs
} | ||
} | ||
|
||
logLifetimeProofSummary() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove dead code
Did partial review, can you address these comments, also address claued raised issues and liter errors |
Description