Skip to content

Conversation

Niks988
Copy link
Collaborator

@Niks988 Niks988 commented Jul 25, 2025

Description

  • Add event batching (50 events, 5s timeout) for better performance
  • Implement persistent queue (10K events) to prevent event loss during QuestDB downtime
  • Add connection health monitoring and auto-reconnection
  • Improve logging with clear batch and queue status
  • Add graceful shutdown with event flushing
  • Eliminate ECONNRESET spam with error rate limiting
  • Add fallback mechanisms and memory leak protection


// Graceful shutdown handling
process.on('SIGINT', () => this.gracefulShutdown());
process.on('SIGTERM', () => this.gracefulShutdown());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Signal Handler Duplication Causes Shutdown Race Condition

The QuestTelemetry module registers redundant SIGINT and SIGTERM signal handlers. The main OTNode class already handles these signals, and its handleExit() method correctly initiates telemetry cleanup via the module manager. This duplication creates a race condition where both QuestTelemetry.gracefulShutdown() and OTNode.handleExit() execute on process termination. Although gracefulShutdown() has a guard, the cleanup() method, which can be called by both paths, lacks this protection, potentially causing attempts to flush or close already closed connections and leading to unpredictable shutdown behavior.

Locations (1)

Fix in CursorFix in Web

const telemetryModuleManager = this.container.resolve('telemetryModuleManager');
if (telemetryModuleManager && telemetryModuleManager.getImplementation()) {
this.logger.info('Cleaning up telemetry...');
await telemetryModuleManager.getImplementation().module.cleanup();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Telemetry Cleanup Method Access Error

The telemetry cleanup call in handleExit incorrectly attempts to access telemetryModuleManager.getImplementation().module.cleanup(). The cleanup() method is directly available on the object returned by getImplementation(), not nested under a .module property, which will cause a runtime error.

Locations (1)

Fix in Cursor Fix in Web

constructor() {
this.localSender = null;
this.lastErrorLogTime = 0;
this.errorLogInterval = 60000; // 1 minute between error logs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move all this constants to constants/constants.js

this.lastSuccessfulSend = Date.now();
this.maxTimeWithoutSuccess = 5 * 60 * 1000; // 5 minutes

// Event persistence for when QuestDB is down
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be stored in memory, if nodes restarts, it's lost, and increases the memory footprint of the node?

try {
this.localSender = Sender.fromConfig(this.config.localEndpoint);
this.retryAttempts = 0;
this.telemetryStats.connectionStatus = 'connected';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connected should be constants

this.lastSuccessfulSend = Date.now();
this.logger.debug('QuestDB local sender created successfully');
} catch (error) {
this.telemetryStats.connectionStatus = 'failed';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failed should be constants

startHealthMonitoring() {
this.healthCheckTimer = setInterval(() => {
this.logHealthStatus();
}, this.healthCheckInterval);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

healthCheckInterval shouldn't be field it should be constant


table.symbol('operationId', event.operationId || 'NULL');
table.symbol('blockchainId', event.blockchainId || 'NULL');
table.symbol('name', event.name || 'NULL');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be 'NULL' as string, or it should just be null value?


// Log failed proof events (important to know)
if (proofEvents.length > 0) {
this.logger.error(`[TELEMETRY] Failed to send ${proofEvents.length} proof events: ${proofEvents.map(e => e.operationId).join(', ')}`);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a lot of failed events this log will be huge


for (const event of events) {
try {
await this.retryWithBackoff(async () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to this sequantaly, can we rewrite this so it retries at same time for all of them by launching promises at same time and awaits for all of them

// Calculate queue status
const queueStatus = this.persistentEventQueue.length > 0
? `Queue: ${this.persistentEventQueue.length}/${this.maxPersistentQueueSize}`
: 'Queue: empty';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it shouldn't be empty but 0% let's have consistants logs

}
}

logLifetimeProofSummary() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove dead code

@Mihajlo-Pavlovic
Copy link
Collaborator

Did partial review, can you address these comments, also address claued raised issues and liter errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants