Skip to content

Conversation

@Abdulkbk
Copy link
Contributor

fixes #10337
continues #10356

Change Description

From the issue description:

CPU profiling reveals that IsPublicNode queries consume 62.36% of total CPU time (21.77s out of 34.91s) during gossip message processing. This represents a critical performance bottleneck in LND's gossip subsystem that significantly impacts node synchronization and network message processing throughput.

In this PR, we add caching to the IsPublicNode method in the SQLStore. Since

The IsPublicNode function is called for every node announcement received from the Lightning Network gossip protocol.

Adding cache will significantly reduce database overhead and accelerate gossip message processing.

Steps to Test

Steps for reviewers to follow to test the change.

go test -tags=test_db_sqlite -run TestNodeIsPublic # to ensure we still return the correst status for nodes.

go test -tags=test_db_sqlite -run TestNodeIsPublicCacheInvalidation -v # to test we're correctly invalidating cache

@gemini-code-assist
Copy link

Summary of Changes

Hello @Abdulkbk, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a caching mechanism for the IsPublicNode query, which was identified as a major CPU bottleneck during Lightning Network gossip message processing. By caching the public status of nodes, the system can avoid repeated database queries, thereby significantly improving performance, reducing database overhead, and accelerating the overall gossip subsystem. The changes include the cache implementation, robust invalidation logic, and new configuration options.

Highlights

  • Performance Improvement: Implemented an LRU cache for the IsPublicNode query within the SQLStore to significantly reduce CPU consumption during gossip message processing, addressing a critical performance bottleneck.
  • Cache Integration: The IsPublicNode method now checks the cache first, retrieving the public status of a node if available, or querying the database and storing the result for future use.
  • Cache Invalidation: Comprehensive cache invalidation logic has been added to ensure data consistency. The publicNodeCache is cleared for affected nodes whenever channel edges are added, deleted, marked as zombie, or when nodes are deleted, or during graph pruning and block disconnections.
  • Configuration Options: Introduced a new DefaultPublicNodeCacheSize constant and a PublicNodeCacheSize field in StoreOptions to allow configuration of the cache's capacity, defaulting to 15,000 entries.
  • New Test Coverage: A new test, TestNodeIsPublicCacheInvalidation, was added to thoroughly verify that the cache is correctly invalidated across various graph modification operations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces caching for the IsPublicNode query to address a significant performance bottleneck in gossip message processing. The changes are well-structured, including the addition of a new cache, configuration options, and cache invalidation logic at various points where a node's public status might change. I've identified a potential race condition in one of the cache invalidation paths and an opportunity to optimize the cache-aside pattern to prevent thundering herds. Overall, this is a valuable improvement.

Comment on lines 2353 to 2360
// Store the result in cache.
s.cacheMu.Lock()
_, err = s.publicNodeCache.Put(pubKey, &cachedPublicNode{
isPublic: isPublic,
})
if err != nil {
log.Warnf("unable to store node %x in cache: %v", pubKey, err)
}

s.cacheMu.Unlock()

return isPublic, nil

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a potential for a thundering herd problem here. If multiple goroutines call IsPublicNode for the same key that is not in the cache, they will all miss the cache, query the database, and then attempt to write to the cache. To optimize this and prevent redundant database queries, you can re-check the cache after acquiring the write lock. This ensures that only the first goroutine populates the cache, and subsequent ones will use the cached value.

	// Store the result in cache. We use a double-checked locking pattern
	// here to avoid a thundering herd problem where multiple goroutines
	// query the DB for the same key on a cache miss.
	s.cacheMu.Lock()
	defer s.cacheMu.Unlock()

	// Re-check the cache to avoid a race where another goroutine populated
	// it between our read-lock release and write-lock acquisition.
	if cached, err := s.publicNodeCache.Get(pubKey); err == nil && cached != nil {
		return cached.isPublic, nil
	}

	_, err = s.publicNodeCache.Put(pubKey, &cachedPublicNode{
		isPublic: isPublic,
	})
	if err != nil {
		log.Warnf("unable to store node %x in cache: %v", pubKey, err)
	}

	return isPublic, nil

@Abdulkbk Abdulkbk force-pushed the ispub-cache branch 2 times, most recently from 0a81125 to 85b45d5 Compare November 12, 2025 16:12
@lightninglabs-deploy
Copy link

@Abdulkbk, remember to re-request review from reviewers when ready

@Roasbeef Roasbeef added this to the v0.20.1 milestone Dec 10, 2025
@saubyk saubyk added this to lnd v0.20 Dec 10, 2025
@saubyk saubyk moved this to In progress in lnd v0.20 Dec 10, 2025
@saubyk saubyk moved this from In progress to In review in lnd v0.20 Dec 10, 2025
@ziggie1984 ziggie1984 self-requested a review December 10, 2025 07:31
Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to cache the call, let's also add some benchmark here.

s.cacheMu.RLock()
cached, err := s.publicNodeCache.Get(pubKey)

switch {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this logic hard to follow can we instead do:

 // Cache hit - return immediately.
  if err == nil && cached != nil {
      return cached.isPublic, nil
  }

  // Log unexpected errors (anything other than "not found").
  if err != nil && !errors.Is(err, cache.ErrElementNotFound) {
      log.Warnf("unexpected error checking node cache: %v", err)
  }

This commit adds the struct we'll use to cache the node. It
also adds the require `Size` method for the lru package.
@Abdulkbk
Copy link
Contributor Author

I cherry-picked f9078e5 from #10356 adding the benchmark for IsPublicNode call and below is how it compares:

go test -tags=test_db_sqlite -bench=BenchmarkIsPublicNode -v
Scenario Iterations (b.N) Time per op (ns/op)
Cache 10422774 109.9 ns/op
No cache 390 2991256 ns/op (~2ms per op)

The difference is very significant with cache (~ 10000x faster).

@Abdulkbk
Copy link
Contributor Author

I notice the itests are all failing. Let me check to see why...

@ziggie1984
Copy link
Collaborator

can you add a release note entry for 20.1

@ziggie1984 ziggie1984 added back port candidate pr which should be back ported to last major release backport-v0.20.x-branch This label is used to trigger the creation of a backport PR to the branch `v0.20.x-branch`. labels Dec 15, 2025
type cachedPublicNode struct {
isPublic bool
}
type cachedPublicNode struct{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you remove the isPublic again ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh. I was supposed to remove it from the commit that added the field so the commit compiles and to make the linter happy. I will fix this.

default:
s.rejectCache.remove(edge.ChannelID)
s.chanCache.remove(edge.ChannelID)
s.removePublicNodeCache(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can remove the edge here just because we fail to add the edge, we are dealing with nodes here not channels. I don't acutally think we need to delte them here.


s.rejectCache.remove(chanID)
s.chanCache.remove(chanID)
s.removePublicNodeCache(pubKey1, pubKey2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, we cannot do this here, they can still have other channels

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. At the MarkEdgeZombie callsite, we also remove the channel from graphcache. It makes sense to remove the cache here too, since even if the node has other channels, we can't be sure if they're private or public. It's safer to get that info from the DB after this call.

Copy link
Collaborator

@ziggie1984 ziggie1984 Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the node was ever public the train is passed and we should keep it in the cache. A node either remains private the entire time or remains private the entire time. It does not really sense to switch from public to private.

s.chanCache.remove(chanID)
}

var pubkeys [][33]byte
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to delete here, its an LRU cache, if a node was public we don't bother because the pubkey already was annoucned.

for _, channel := range closedChans {
s.rejectCache.remove(channel.ChannelID)
s.chanCache.remove(channel.ChannelID)
s.removePublicNodeCache(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not able to delete

for _, channel := range removedChans {
s.rejectCache.remove(channel.ChannelID)
s.chanCache.remove(channel.ChannelID)
s.removePublicNodeCache(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, cannot b e deleted

//
// NOTE: This can safely be called without holding a lock since the lru is
// thread safe.
func (s *SQLStore) removePublicNodeCache(pubkeys ...[33]byte) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this if we remove all the callsites

return fmt.Errorf("unable to delete node: %w", err)
}

s.removePublicNodeCache(pubKey)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even here, I am acutally not sure if we should remove it from the cache, its an LRU cache so it cycles unused values out, so we still might get some infos to this node if the gossip is delayed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be as cautious as possible? Let's assume initially the node is public. After a DeleteNode call, if you call IsPublicNode without a cache, you get false, but if we have a cache that wasn't invalidated, then you get true. Wouldn't that be a discrepancy?

In this commit, we add publicNodeCache into the sqlstore. We also
add the necessary config for initializing the cache.

Additionally, we introduce a new config `public-node-cache-size`
which let us set values for the cache size.
Signed-off-by: Abdullahi Yunus <[email protected]>
In this commit, we first check for the node in our cache before
querying the database when determining if a node is public or not.
In this commit, we remove nodes from the node cache in various db
method call site which execution could affect the public status of
the nodes.
In this commit we add a benchmark to test the performance of
IsPublicNode query.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

back port candidate pr which should be back ported to last major release backport-v0.20.x-branch This label is used to trigger the creation of a backport PR to the branch `v0.20.x-branch`.

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

[bug]: Graph SQL implementation results in some performance issues

4 participants