Avoid a race condition that causes 100% usage of a CPU core #300

emasab · 2025-04-28T11:45:40Z

when consuming with partitionsConsumedConcurrently > 1 and all messages are consumed.

Closes #195

What

When there are no messages available it's possible both workers alternate in fetching or waiting for #fetchInProgress and never reach the state where they await for #queueNonEmptyCb. Solved by having a different promise when a non-empty event is needed and avoiding resolving it when a fetch is in progress to avoid a double callback causes one of the two events to be lost.

Checklist

Contains customer facing changes? Including API/behavior changes
Did you add sufficient unit test and/or integration test coverage for this PR?

References

JIRA:

Test & Review

For reproducing before the fix or testing use the example code provided in #195

Open questions / Follow-ups

…uming with `partitionsConsumedConcurrently > 1` and all messages are consumed. Closes #195

Copilot

Pull Request Overview

This PR fixes a race condition that was causing 100% CPU usage during message consumption when partitionsConsumedConcurrently > 1. The changes introduce improved synchronization via DeferredPromises, update test configurations to better stress the fix, and adjust worker termination logic.

Increased the volume of test messages and added flush calls to ensure proper synchronization.
Modified consumer internals to notify and await available partitions, and improved worker termination handling.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
test/promisified/producer/flush.spec.js	Increased the test message count to stress the condition.
test/promisified/consumer/seek.spec.js	Added flush calls to avoid out-of-range errors during seeking.
test/promisified/consumer/consumeMessages.spec.js	Adjusted message volume in tests to align with the new race condition fix.
lib/kafkajs/_consumer_cache.js	Introduced DeferredPromise for available partitions and notification logic.
lib/kafkajs/_consumer.js	Refactored consumer flow; updated error handling and worker termination logic.
CHANGELOG.md	Updated changelog to document the race condition fix.

Comments suppressed due to low confidence (2)

test/promisified/producer/flush.spec.js:78

Increasing the number of messages from 100 to 1000 can impact test execution time; please ensure test timeouts and resource usage are adjusted accordingly.

producer.send({ topic: topicName, messages: Array(1000).fill(message) }).then(() => {

lib/kafkajs/_consumer.js:1678

Replacing direct resolution with a dedicated method improves consistency; please double-check that all race conditions related to worker termination are adequately handled.

this.#resolveWorkerTerminationScheduled();

Copilot · 2025-04-28T11:46:27Z

lib/kafkajs/_consumer.js

@@ -1316,7 +1321,7 @@ class Consumer {

    /* If any message is unprocessed, either due to an error or due to the user not marking it processed, we must seek
     * back to get it so it can be reprocessed. */
-    if (lastOffsetProcessed.offset !== lastOffset) {
+    if (!payload._seeked && lastOffsetProcessed.offset !== lastOffset) {


The condition now checks for the _seeked flag to avoid redundant seeks; please ensure this logic meets the intended behavior for reprocessing messages.

maksym-opanasenko-ft · 2025-04-28T12:27:46Z

test/promisified/consumer/seek.spec.js

@@ -348,6 +352,11 @@ describe('Consumer seek >', () => {
    });

    describe('batch staleness >', () => {
+        beforeEach(async () => {
+            // Theses tests expect a single partititon


Typo: Theses

it's not possible that we await for a different reason than the one that caused to return null in the first place

…if all partitions are paused

test/promisified/producer/flush.spec.js

sonarqube-confluent · 2025-05-01T15:23:50Z

Analysis Details

3 Issues

0 Bugs
0 Vulnerabilities
3 Code Smells

Coverage and Duplications

93.80% Coverage (47.80% Estimated after merge)
No duplication information (2.10% Estimated after merge)

Project ID: confluent-kafka-javascript

View in SonarQube

milindl

Left some review comments, doing another pass

milindl · 2025-05-09T06:28:41Z

test/promisified/consumer/seek.spec.js

@@ -410,6 +422,13 @@ describe('Consumer seek >', () => {

            consumer.run({
                eachBatch: async ({ batch, isStale, resolveOffset }) => {
+                    if (offsetsConsumed.length == 0 && 


milindl · 2025-05-09T06:28:57Z

test/promisified/admin/list_topics.spec.js

@@ -4,6 +4,7 @@ const {
    secureRandom,
    createTopic,
    createAdmin,
+    sleep,


Remove as it is unused

milindl · 2025-05-09T06:32:02Z

lib/kafkajs/_consumer_cache.js

+     * Promise that resolved when there are available partitions to take.
+     */
+    async availablePartitions() {
+        await this.#availablePartitionsPromise;


nit: can just return instead of await for identical behaviour

milindl

Thanks for the fix! I checked additionally to make sure we're not leaking promises etc.

Copilot AI review requested due to automatic review settings April 28, 2025 11:45

emasab requested review from a team as code owners April 28, 2025 11:45

Copilot AI reviewed Apr 28, 2025

View reviewed changes

emasab mentioned this pull request Apr 28, 2025

🔥 Critical Performance Issue: partitionsConsumedConcurrently > 1 Causes CPU Overload Without Consumer Lag #195

Closed

emasab marked this pull request as draft April 28, 2025 11:56

maksym-opanasenko-ft reviewed Apr 28, 2025

View reviewed changes

emasab added 3 commits April 29, 2025 21:27

Merge the #nextFetchRetry awaits with the fetches so

5cad29f

it's not possible that we await for a different reason than the one that caused to return null in the first place

Fix typo

8b07557

Continue polling even when we don't receive any message, for example …

5dd12ee

…if all partitions are paused

airlock-confluentinc bot force-pushed the dev_fix_race_condition_high_cpu_usage branch from dd9e69c to 5dd12ee Compare April 30, 2025 16:23

rebalanceCallback.spec.js consume from beginning

67dadc0

This comment has been minimized.

Sign in to view

emasab marked this pull request as ready for review May 1, 2025 08:20

Add comments

db19b8b

emasab commented May 1, 2025

View reviewed changes

test/promisified/producer/flush.spec.js Show resolved Hide resolved

milindl reviewed May 9, 2025

View reviewed changes

Address comments

e834a98

emasab requested a review from milindl May 13, 2025 14:11

milindl approved these changes May 16, 2025

View reviewed changes

emasab merged commit af20c7c into master May 16, 2025
1 of 2 checks passed

emasab deleted the dev_fix_race_condition_high_cpu_usage branch May 16, 2025 08:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid a race condition that causes 100% usage of a CPU core #300

Avoid a race condition that causes 100% usage of a CPU core #300

emasab commented Apr 28, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 28, 2025

Uh oh!

maksym-opanasenko-ft Apr 28, 2025

Uh oh!

This comment has been minimized.

Uh oh!

sonarqube-confluent bot commented May 1, 2025

Uh oh!

milindl left a comment

Uh oh!

milindl May 9, 2025

Uh oh!

milindl May 9, 2025

Uh oh!

milindl May 9, 2025

Uh oh!

milindl left a comment

Uh oh!

Uh oh!

Avoid a race condition that causes 100% usage of a CPU core #300

Avoid a race condition that causes 100% usage of a CPU core #300

Conversation

emasab commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Checklist

References

Test & Review

Open questions / Follow-ups

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

maksym-opanasenko-ft Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

Uh oh!

sonarqube-confluent bot commented May 1, 2025

Analysis Details

3 Issues

Coverage and Duplications

Uh oh!

milindl left a comment

Choose a reason for hiding this comment

Uh oh!

milindl May 9, 2025

Choose a reason for hiding this comment

Uh oh!

milindl May 9, 2025

Choose a reason for hiding this comment

Uh oh!

milindl May 9, 2025

Choose a reason for hiding this comment

Uh oh!

milindl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

emasab commented Apr 28, 2025 •

edited

Loading