Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail #2935

Merged
merged 19 commits into from
Jan 24, 2025

Conversation

mlbiscoc
Copy link
Contributor

@mlbiscoc mlbiscoc commented Jan 3, 2025

https://issues.apache.org/jira/browse/SOLR-17519

Description

In BaseHttpClusterStateProvider if all initially passed nodes except for 1 go down, then CSP will only be aware of the 1 live node. If that node were to go down and any of the other initially passed nodes were to recover, CSP would not be able to connect to get latest cluster state of live nodes because it only holds the address of the now downed node.

Solution

CSP holds immutable initialNodes which is used to never remove this set from live nodes to keep it resilient. Now if the case above were to occur, CSP would still be able to get cluster state and latest set of live nodes because live nodes always has the initial set. Live nodes can also hold new nodes that are added to the cluster after CSP initialization but those are removable unlike the initial nodes.

Tests

testClusterStateProviderDownedInitialLiveNodes tests the above test case and used to fail.
testClusterStateProviderLiveNodesWithNewHost tests live nodes with a 3rd added new host after CSP is initialized which is removable but initial nodes are always still present to be reachable by CSP.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Comment on lines 55 to 57
private Set<String> initialNodes;
volatile Set<String> liveNodes;
volatile Set<String> knownNodes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right away, I'm surprised. I can understand needing two sets, but 3? IMO knownNodes doesn't need to exists; that's the same as liveNodes. I see below you've changed many references from liveNodes to knownNodes but I recommend against doing that because the exact wording of "liveNodes" is extremely pervasive in SolrCloud (well known concept) so let's not have a vary it in just this class or any class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally did it with just liveNodes but liveNodes is returned to the client which from it's name I was thinking they would assume they are "live". If we just place all the nodes into liveNodes thats not necessarily true, right? If the current set hold all the initially passed nodes, it isn't guaranteed it's live which is why I switched to known nodes while live is what was actually fetched from ZooKeeper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no guarantee that getLiveNodes is going to return a list of reachable nodes. A moment after returning it, a node could become unreachable. So it's "best effort" and the caller has to deal with failures by trying the other nodes in the list and/or getting a possibly refreshed list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats fair. Removed known nodes and put everything into live nodes while using the initial set inside all the time.

@@ -65,6 +68,8 @@ public void init(List<String> solrUrls) throws Exception {
urlScheme = solrUrl.startsWith("https") ? "https" : "http";
try (SolrClient initialClient = getSolrClient(solrUrl)) {
this.liveNodes = fetchLiveNodes(initialClient);
this.initialNodes = Set.copyOf(liveNodes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea of this JIRA issue is that we'll take the Solr URLs as configured and use this is the initial / backup liveNodes. I think this is a very simple idea to to document/understand/implement.

Comment on lines 46 to 47
private static JettySolrRunner jettyNode1;
private static JettySolrRunner jettyNode2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static fields in tests that refer to Solr should be null'ed out, so there's some burden to them. Here... I think you could avoid these altogether and simply call cluster.getJettySolrRunner(0) which isn't bad!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could add a convenience method if you like.

}

private void waitForCSPCacheTimeout() throws InterruptedException {
Thread.sleep(6000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test should set the system property solr.solrj.cache.timeout.sec to maybe 1ms and then you can merely sleep for like a 100 milliseconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache timeout setting is in seconds and pulls an int. Can't make it 1ms. This probably should have had better granularity instead of increments of seconds.

I could change cacheTimeout to take milliseconds as the int instead but that would change peoples system property not using the default. So if they set the cache timeout to 5 seconds, it now turns into 5ms if they are not aware of this change.

@@ -61,6 +67,7 @@ public abstract class BaseHttpClusterStateProvider implements ClusterStateProvid
private int cacheTimeout = EnvUtils.getPropertyAsInteger("solr.solrj.cache.timeout.sec", 5);

public void init(List<String> solrUrls) throws Exception {
this.initialNodes = getNodeNamesFromSolrUrls(solrUrls);
Copy link
Contributor Author

@mlbiscoc mlbiscoc Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea of this JIRA issue is that we'll take the Solr URLs as configured and use this is the initial / backup liveNodes. I think this is a very simple idea to to document/understand/implement.

That makes sense. This leaves me with a few questions then. Shouldn't this take a list of URL or URI java object to verify actual non malformed URLs instead of a list of strings? I created the functions below to convert these Strings into cluster state nodeNames for liveNodes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great idea to do basic URL malformed checks like this

Comment on lines 428 to 444
public Set<String> getNodeNamesFromSolrUrls(List<String> urls)
throws URISyntaxException, MalformedURLException {
Set<String> set = new HashSet<>();
for (String url : urls) {
String nodeNameFromSolrUrl = getNodeNameFromSolrUrl(url);
set.add(nodeNameFromSolrUrl);
}
return Collections.unmodifiableSet(set);
}

/** URL to cluster state node name (http://127.0.0.1:12345/solr to 127.0.0.1:12345_solr) */
public String getNodeNameFromSolrUrl(String solrUrl)
throws MalformedURLException, URISyntaxException {
URL url = new URI(solrUrl).toURL();
return url.getAuthority() + url.getPath().replace('/', '_');
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if these methods belong here but it is required to convert the initial set of String urls into cluster state node names

Comment on lines 423 to 425
Set<String> liveNodes = new HashSet<>(nodes);
liveNodes.addAll(this.initialNodes);
this.liveNodes = Set.copyOf(liveNodes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this 3 lines of extra copy-ing instead of nothing more than: this.liveNodes = Set.copyOf(nodes); ? That is, why are we touching / using initialNodes at all here?
Maybe an IllegalArgumentException if nodes.isEmpty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was to not do another loop through if liveNodes was exhausted. initalNodes would always exist in liveNodes with this set method. Let me refactor again from your suggestion comment

Comment on lines 430 to 435
Set<String> set = new HashSet<>();
for (String url : urls) {
String nodeNameFromSolrUrl = getNodeNameFromSolrUrl(url);
set.add(nodeNameFromSolrUrl);
}
return Collections.unmodifiableSet(set);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could easily be converted to a single expression with streams

Comment on lines 438 to 443
/** URL to cluster state node name (http://127.0.0.1:12345/solr to 127.0.0.1:12345_solr) */
public String getNodeNameFromSolrUrl(String solrUrl)
throws MalformedURLException, URISyntaxException {
URL url = new URI(solrUrl).toURL();
return url.getAuthority() + url.getPath().replace('/', '_');
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you find other code in Solr doing this? Surely it's somewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be a static method that with a unit test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm maybe... I found getBaseUrlForNodeName but it's going the other way nodeName->url. Let me look around more

@@ -229,10 +236,9 @@ > getCacheTimeout()) {
for (String nodeName : liveNodes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we exhaust liveNodes, shouldn't we then try the initial configured nodes, and then only failing that, throw an exception?

@dsmiley
Copy link
Contributor

dsmiley commented Jan 10, 2025

I simplified characterization of how I think this should be done:

  • on initialization, copy the configured URLs to a field for safe keeping. Convert to URL if you like (validates if malformed)
  • on initialization, liveNodes shall be empty; we haven't fetched it yet. Ideally we don't even try to on initialization.
  • when getLiveNodes is called and it's not out of date, return it. If it's out of date, loop live nodes and then the initial configured URLs for the first server that will respond to fetch the current live nodes.

Copy link
Contributor Author

@mlbiscoc mlbiscoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified characterization of how I think this should be done:
on initialization, copy the configured URLs to a field for safe keeping. Convert to URL if you like (validates if malformed)
on initialization, liveNodes shall be empty; we haven't fetched it yet. Ideally we don't even try to on initialization.
when getLiveNodes is called and it's not out of date, return it. If it's out of date, loop live nodes and then the initial configured URLs for the first server that will respond to fetch the current live nodes.

So I went back and basically implemented the way you put here and deferred the static URL and opt-in node discovery option because it probably needs more discussion and implementation details before hand.

Only thing I didn't do was remove the liveNode fetch on initialization. Curious why CSP shouldn't do that? Wouldn't we want to confirm that the URLs passed are live?

Comment on lines 236 to 242
if (updateLiveNodes(liveNodes)) return this.liveNodes;

log.warn(
"Attempt to fetch cluster state from all known live nodes {} failed. Trying backup nodes",
liveNodes);

if (updateLiveNodes(backupNodes)) return this.liveNodes;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backup is checked if liveNodes is exhausted. There are a few places that liveNodes is used for fetching but backup isn't now. Was thinking of adding it, but that requires a larger refactor from just this bug and writing additional tests. Happy to do it though.

Comment on lines 800 to 804
public static String getNodeNameFromSolrUrl(String solrUrl)
throws MalformedURLException, URISyntaxException {
URL url = new URI(solrUrl).toURL();
return url.getAuthority() + url.getPath().replace('/', '_');
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find anywhere where this is a function like this (Maybe you someone knows?). Regardless, I think if it exists somewhere, it should sit here because the getBaseUrlForNodeName function right above it does the conversion nodeName->Url and putting url->nodeName right below it makes sense to me. Added unit tests and just added unit tests for getBaseUrlForNodeName because it didn't exist before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. (never mind my redundant code review comment)

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See SOLR-17607 for why we shouldn't connect on creation.

There's Utils.getBaseUrlForNodeName for one direction so the reverse should go right next to it.

@@ -51,6 +55,7 @@ public abstract class BaseHttpClusterStateProvider implements ClusterStateProvid
private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());

private String urlScheme;
private Set<String> backupNodes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we name this configuredNodes since that's what it is? Yes we use it as a backup but we could adapt how we use them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe leave as a list as it was provided. Why not use them in the order provided.

@@ -226,17 +232,15 @@ public Set<String> getLiveNodes() {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just above this, there's a liveNodes null check. That seems bogus now

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know what... See URLUtil. Shouldn't that be where these methods go? The existing method can be deprecated and call the one at the better location.
Also has a test where your tests of this method can be added to.
(big code base; so many util classes)

This PR is pretty close to merging now.

@@ -51,6 +55,7 @@ public abstract class BaseHttpClusterStateProvider implements ClusterStateProvid
private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());

private String urlScheme;
private Set<String> backupNodes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe leave as a list as it was provided. Why not use them in the order provided.

try {
return getNodeNameFromSolrUrl(url);
} catch (MalformedURLException | URISyntaxException e) {
throw new RuntimeException("Failed to parse base Solr URL " + url, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IllegalArgumentException seems appropriate

Comment on lines 800 to 804
public static String getNodeNameFromSolrUrl(String solrUrl)
throws MalformedURLException, URISyntaxException {
URL url = new URI(solrUrl).toURL();
return url.getAuthority() + url.getPath().replace('/', '_');
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. (never mind my redundant code review comment)

jetty.start(false);
jetty.start();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense... I'll call it out in the commit log. I wonder if there's any down-side to this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the results of this could be subtle, and nobody would guess this JIRA/PR-title would have anything to do with this. Thus should be merged separately. What did you find that led you to incorporate this change into this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests I incorporated where I take down and bring back nodes, the URLs changed because port number changed. So I looked at this method noticed this was the cause. It didn't make sense to me change the port when you are passing the same jetty back and tests didn't seem to be effected so I didn't think there was an impact to change. I can move it out if you'd like, I just need to then start jetty in the tests instead of from this function to keep the same ports.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When faced with curiosities like this, it's useful to do a bit of code/JIRA archeology. git blame (via IntelliJ's margin, BTW) shows the current setting (to pick a random port) was deliberate and had its own JIRA -- https://issues.apache.org/jira/browse/SOLR-9474 and I found a related one: https://issues.apache.org/jira/browse/SOLR-9469 . I read the conversation there. Changing it is worthy of its own JIRA & CHANGES.txt entry, following a deeper attempt to understand the causes behind why it hasn't been reusing a port thus far.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for find that. Thats fair. For now, I overloaded the function with a reusePort to not break anything and maybe change it to just reuse port later. Is it fine to do that for now, or should I just be calling jetty itself and reuse the port in the tests?

@@ -51,6 +54,15 @@ public static void setupCluster() throws Exception {
.resolve("streaming")
.resolve("conf"))
.configure();
cluster.waitForAllNodes(30);
System.setProperty("solr.solrj.cache.timeout.sec", "1");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(am random pull request browser/scanner only here) -- maybe have matching System.clearProperty during test cleanup to avoid carry-ever effects between tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Christine, Solr's test infrastructure auto-restores all system properties to as they were from test suite to test suite thanks to SystemPropertiesRestoreRule. But not test-method to test-method, albeit that's not pertinent for the line you commented (as it's in a BeforeClass).

@mlbiscoc
Copy link
Contributor Author

You know what... See URLUtil. Shouldn't that be where these methods go? The existing method can be deprecated and call the one at the better location. Also has a test where your tests of this method can be added to. (big code base; so many util classes)

This PR is pretty close to merging now.

I ended up moving all those functions into URLUril which resulted in a much larger code change. Not sure if you meant for me to use the Deprecated annotation but instead just moved it and refactored the callers.

Also didn't add in SOLR-17607 because it broke a few more tests that needed more changes so just held the changes locally and will probably just make it in another PR and link it to that JIRA unless you prefer me to do it here in one go.

Copy link
Contributor

@HoustonPutman HoustonPutman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. And we can add a flag in the future (in a separate JIRA/PR) to not use the live nodes to fetch the cluster state. (This is to David's point in the JIRA)

Comment on lines +155 to +158
public static String getNodeNameForBaseUrl(String solrUrl)
throws MalformedURLException, URISyntaxException {
URL url = new URI(solrUrl).toURL();
return url.getAuthority() + url.getPath().replace('/', '_');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After changing configured nodes to List, this function isn't used anymore except for tests.

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the new method that isn't actually used is fine with me; no strong opinion either way.
Just a couple trivial bits of feedback but I think this is ready otherwise. Can you add a CHANGES.txt entry please?

solr/CHANGES.txt Outdated
Comment on lines 179 to 180
* SOLR-17519: SolrJ fix to CloudSolrClient and HTTP ClusterState where it can fail to request cluster state
if its current live nodes list is stale. HTTP ClusterState now retains the initial configured list of passed URLs as backup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Solr users know about "HTTP ClusterState"; it's not likely in their vocabulary. On the other hand, wording like "SolrJ CloudSolrClient configured with Solr URLs" is likely to be understood.

"Attempt to fetch cluster state from all known live nodes {} failed. Trying backup nodes",
liveNodes);

if (configuredNodes.stream().anyMatch((node) -> updateLiveNodes(node.toString())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check configuredNodes first so as to give the user the possibility of some control, basically addressing the conversation points in JIRA?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could but to me it makes sense to try live nodes first unless this class was explicitly created as your JIRA proposal pointed, in that it only uses the passed URLs and nothing else.

I think it should be used as backup and go for fetching from live nodes first and assume by its name it is "live". If we do the opt in as a feature flag for dynamic node discovery then that gives them control. I'm fine with flipping it with configured first though if you feel strongly about it.

@dsmiley
Copy link
Contributor

dsmiley commented Jan 22, 2025

I'll merge this in about a day as-is.

@dsmiley dsmiley merged commit b1fe883 into apache:main Jan 24, 2025
4 checks passed
dsmiley pushed a commit that referenced this pull request Jan 27, 2025
…igured URLs as backup (#2935)

SolrJ CloudSolrClient configured with Solr URLs can fail to request cluster state if its current live nodes list are all unavailable.  The HttpClusterStateProvider now retains the initial configured list of passed URLs as backup.

Utils.getBaseUrlForNodeName moved to URLUtil.
MiniSolrCloudCluster.startJettySolrRunner is overloaded now to choose to reuse the port.

(cherry picked from commit b1fe883)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants