-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail #2935
SOLR-17519: CloudSolrClient with HTTP ClusterState can forget live nodes and then fail #2935
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
private Set<String> initialNodes; | ||
volatile Set<String> liveNodes; | ||
volatile Set<String> knownNodes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right away, I'm surprised. I can understand needing two sets, but 3? IMO knownNodes doesn't need to exists; that's the same as liveNodes. I see below you've changed many references from liveNodes to knownNodes but I recommend against doing that because the exact wording of "liveNodes" is extremely pervasive in SolrCloud (well known concept) so let's not have a vary it in just this class or any class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally did it with just liveNodes
but liveNodes is returned to the client which from it's name I was thinking they would assume they are "live". If we just place all the nodes into liveNodes thats not necessarily true, right? If the current set hold all the initially passed nodes, it isn't guaranteed it's live which is why I switched to known nodes while live is what was actually fetched from ZooKeeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no guarantee that getLiveNodes is going to return a list of reachable nodes. A moment after returning it, a node could become unreachable. So it's "best effort" and the caller has to deal with failures by trying the other nodes in the list and/or getting a possibly refreshed list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats fair. Removed known nodes and put everything into live nodes while using the initial set inside all the time.
@@ -65,6 +68,8 @@ public void init(List<String> solrUrls) throws Exception { | |||
urlScheme = solrUrl.startsWith("https") ? "https" : "http"; | |||
try (SolrClient initialClient = getSolrClient(solrUrl)) { | |||
this.liveNodes = fetchLiveNodes(initialClient); | |||
this.initialNodes = Set.copyOf(liveNodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea of this JIRA issue is that we'll take the Solr URLs as configured and use this is the initial / backup liveNodes. I think this is a very simple idea to to document/understand/implement.
private static JettySolrRunner jettyNode1; | ||
private static JettySolrRunner jettyNode2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static fields in tests that refer to Solr should be null'ed out, so there's some burden to them. Here... I think you could avoid these altogether and simply call cluster.getJettySolrRunner(0)
which isn't bad!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could add a convenience method if you like.
} | ||
|
||
private void waitForCSPCacheTimeout() throws InterruptedException { | ||
Thread.sleep(6000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test should set the system property solr.solrj.cache.timeout.sec
to maybe 1ms and then you can merely sleep for like a 100 milliseconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache timeout setting is in seconds and pulls an int. Can't make it 1ms. This probably should have had better granularity instead of increments of seconds.
I could change cacheTimeout to take milliseconds as the int instead but that would change peoples system property not using the default. So if they set the cache timeout to 5 seconds, it now turns into 5ms if they are not aware of this change.
@@ -61,6 +67,7 @@ public abstract class BaseHttpClusterStateProvider implements ClusterStateProvid | |||
private int cacheTimeout = EnvUtils.getPropertyAsInteger("solr.solrj.cache.timeout.sec", 5); | |||
|
|||
public void init(List<String> solrUrls) throws Exception { | |||
this.initialNodes = getNodeNamesFromSolrUrls(solrUrls); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea of this JIRA issue is that we'll take the Solr URLs as configured and use this is the initial / backup liveNodes. I think this is a very simple idea to to document/understand/implement.
That makes sense. This leaves me with a few questions then. Shouldn't this take a list of URL
or URI
java object to verify actual non malformed URLs instead of a list of strings? I created the functions below to convert these Strings into cluster state nodeNames for liveNodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great idea to do basic URL malformed checks like this
public Set<String> getNodeNamesFromSolrUrls(List<String> urls) | ||
throws URISyntaxException, MalformedURLException { | ||
Set<String> set = new HashSet<>(); | ||
for (String url : urls) { | ||
String nodeNameFromSolrUrl = getNodeNameFromSolrUrl(url); | ||
set.add(nodeNameFromSolrUrl); | ||
} | ||
return Collections.unmodifiableSet(set); | ||
} | ||
|
||
/** URL to cluster state node name (http://127.0.0.1:12345/solr to 127.0.0.1:12345_solr) */ | ||
public String getNodeNameFromSolrUrl(String solrUrl) | ||
throws MalformedURLException, URISyntaxException { | ||
URL url = new URI(solrUrl).toURL(); | ||
return url.getAuthority() + url.getPath().replace('/', '_'); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if these methods belong here but it is required to convert the initial set of String urls into cluster state node names
Set<String> liveNodes = new HashSet<>(nodes); | ||
liveNodes.addAll(this.initialNodes); | ||
this.liveNodes = Set.copyOf(liveNodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this 3 lines of extra copy-ing instead of nothing more than: this.liveNodes = Set.copyOf(nodes);
? That is, why are we touching / using initialNodes at all here?
Maybe an IllegalArgumentException if nodes.isEmpty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was to not do another loop through if liveNodes was exhausted. initalNodes
would always exist in liveNodes with this set method. Let me refactor again from your suggestion comment
Set<String> set = new HashSet<>(); | ||
for (String url : urls) { | ||
String nodeNameFromSolrUrl = getNodeNameFromSolrUrl(url); | ||
set.add(nodeNameFromSolrUrl); | ||
} | ||
return Collections.unmodifiableSet(set); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could easily be converted to a single expression with streams
/** URL to cluster state node name (http://127.0.0.1:12345/solr to 127.0.0.1:12345_solr) */ | ||
public String getNodeNameFromSolrUrl(String solrUrl) | ||
throws MalformedURLException, URISyntaxException { | ||
URL url = new URI(solrUrl).toURL(); | ||
return url.getAuthority() + url.getPath().replace('/', '_'); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you find other code in Solr doing this? Surely it's somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be a static method that with a unit test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm maybe... I found getBaseUrlForNodeName
but it's going the other way nodeName->url. Let me look around more
@@ -229,10 +236,9 @@ > getCacheTimeout()) { | |||
for (String nodeName : liveNodes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we exhaust liveNodes, shouldn't we then try the initial configured nodes, and then only failing that, throw an exception?
I simplified characterization of how I think this should be done:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified characterization of how I think this should be done:
on initialization, copy the configured URLs to a field for safe keeping. Convert to URL if you like (validates if malformed)
on initialization, liveNodes shall be empty; we haven't fetched it yet. Ideally we don't even try to on initialization.
when getLiveNodes is called and it's not out of date, return it. If it's out of date, loop live nodes and then the initial configured URLs for the first server that will respond to fetch the current live nodes.
So I went back and basically implemented the way you put here and deferred the static URL and opt-in node discovery option because it probably needs more discussion and implementation details before hand.
Only thing I didn't do was remove the liveNode fetch on initialization. Curious why CSP shouldn't do that? Wouldn't we want to confirm that the URLs passed are live?
if (updateLiveNodes(liveNodes)) return this.liveNodes; | ||
|
||
log.warn( | ||
"Attempt to fetch cluster state from all known live nodes {} failed. Trying backup nodes", | ||
liveNodes); | ||
|
||
if (updateLiveNodes(backupNodes)) return this.liveNodes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backup is checked if liveNodes is exhausted. There are a few places that liveNodes is used for fetching but backup isn't now. Was thinking of adding it, but that requires a larger refactor from just this bug and writing additional tests. Happy to do it though.
public static String getNodeNameFromSolrUrl(String solrUrl) | ||
throws MalformedURLException, URISyntaxException { | ||
URL url = new URI(solrUrl).toURL(); | ||
return url.getAuthority() + url.getPath().replace('/', '_'); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find anywhere where this is a function like this (Maybe you someone knows?). Regardless, I think if it exists somewhere, it should sit here because the getBaseUrlForNodeName
function right above it does the conversion nodeName->Url
and putting url->nodeName
right below it makes sense to me. Added unit tests and just added unit tests for getBaseUrlForNodeName
because it didn't exist before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect. (never mind my redundant code review comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See SOLR-17607 for why we shouldn't connect on creation.
There's Utils.getBaseUrlForNodeName
for one direction so the reverse should go right next to it.
@@ -51,6 +55,7 @@ public abstract class BaseHttpClusterStateProvider implements ClusterStateProvid | |||
private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass()); | |||
|
|||
private String urlScheme; | |||
private Set<String> backupNodes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we name this configuredNodes since that's what it is? Yes we use it as a backup but we could adapt how we use them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also maybe leave as a list as it was provided. Why not use them in the order provided.
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java
Show resolved
Hide resolved
@@ -226,17 +232,15 @@ public Set<String> getLiveNodes() { | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just above this, there's a liveNodes null check. That seems bogus now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know what... See URLUtil. Shouldn't that be where these methods go? The existing method can be deprecated and call the one at the better location.
Also has a test where your tests of this method can be added to.
(big code base; so many util classes)
This PR is pretty close to merging now.
@@ -51,6 +55,7 @@ public abstract class BaseHttpClusterStateProvider implements ClusterStateProvid | |||
private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass()); | |||
|
|||
private String urlScheme; | |||
private Set<String> backupNodes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also maybe leave as a list as it was provided. Why not use them in the order provided.
try { | ||
return getNodeNameFromSolrUrl(url); | ||
} catch (MalformedURLException | URISyntaxException e) { | ||
throw new RuntimeException("Failed to parse base Solr URL " + url, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IllegalArgumentException seems appropriate
public static String getNodeNameFromSolrUrl(String solrUrl) | ||
throws MalformedURLException, URISyntaxException { | ||
URL url = new URI(solrUrl).toURL(); | ||
return url.getAuthority() + url.getPath().replace('/', '_'); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect. (never mind my redundant code review comment)
jetty.start(false); | ||
jetty.start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense... I'll call it out in the commit log. I wonder if there's any down-side to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the results of this could be subtle, and nobody would guess this JIRA/PR-title would have anything to do with this. Thus should be merged separately. What did you find that led you to incorporate this change into this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests I incorporated where I take down and bring back nodes, the URLs changed because port number changed. So I looked at this method noticed this was the cause. It didn't make sense to me change the port when you are passing the same jetty back and tests didn't seem to be effected so I didn't think there was an impact to change. I can move it out if you'd like, I just need to then start jetty in the tests instead of from this function to keep the same ports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When faced with curiosities like this, it's useful to do a bit of code/JIRA archeology. git blame
(via IntelliJ's margin, BTW) shows the current setting (to pick a random port) was deliberate and had its own JIRA -- https://issues.apache.org/jira/browse/SOLR-9474 and I found a related one: https://issues.apache.org/jira/browse/SOLR-9469 . I read the conversation there. Changing it is worthy of its own JIRA & CHANGES.txt entry, following a deeper attempt to understand the causes behind why it hasn't been reusing a port thus far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for find that. Thats fair. For now, I overloaded the function with a reusePort
to not break anything and maybe change it to just reuse port later. Is it fine to do that for now, or should I just be calling jetty itself and reuse the port in the tests?
@@ -51,6 +54,15 @@ public static void setupCluster() throws Exception { | |||
.resolve("streaming") | |||
.resolve("conf")) | |||
.configure(); | |||
cluster.waitForAllNodes(30); | |||
System.setProperty("solr.solrj.cache.timeout.sec", "1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(am random pull request browser/scanner only here) -- maybe have matching System.clearProperty
during test cleanup to avoid carry-ever effects between tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Christine, Solr's test infrastructure auto-restores all system properties to as they were from test suite to test suite thanks to SystemPropertiesRestoreRule. But not test-method to test-method, albeit that's not pertinent for the line you commented (as it's in a BeforeClass).
I ended up moving all those functions into Also didn't add in SOLR-17607 because it broke a few more tests that needed more changes so just held the changes locally and will probably just make it in another PR and link it to that JIRA unless you prefer me to do it here in one go. |
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. And we can add a flag in the future (in a separate JIRA/PR) to not use the live nodes to fetch the cluster state. (This is to David's point in the JIRA)
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java
Outdated
Show resolved
Hide resolved
public static String getNodeNameForBaseUrl(String solrUrl) | ||
throws MalformedURLException, URISyntaxException { | ||
URL url = new URI(solrUrl).toURL(); | ||
return url.getAuthority() + url.getPath().replace('/', '_'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After changing configured nodes to List, this function isn't used anymore except for tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping the new method that isn't actually used is fine with me; no strong opinion either way.
Just a couple trivial bits of feedback but I think this is ready otherwise. Can you add a CHANGES.txt entry please?
solr/solrj/src/test/org/apache/solr/client/solrj/impl/ClusterStateProviderTest.java
Show resolved
Hide resolved
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseHttpClusterStateProvider.java
Outdated
Show resolved
Hide resolved
solr/CHANGES.txt
Outdated
* SOLR-17519: SolrJ fix to CloudSolrClient and HTTP ClusterState where it can fail to request cluster state | ||
if its current live nodes list is stale. HTTP ClusterState now retains the initial configured list of passed URLs as backup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Solr users know about "HTTP ClusterState"; it's not likely in their vocabulary. On the other hand, wording like "SolrJ CloudSolrClient configured with Solr URLs" is likely to be understood.
"Attempt to fetch cluster state from all known live nodes {} failed. Trying backup nodes", | ||
liveNodes); | ||
|
||
if (configuredNodes.stream().anyMatch((node) -> updateLiveNodes(node.toString()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we check configuredNodes first so as to give the user the possibility of some control, basically addressing the conversation points in JIRA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could but to me it makes sense to try live nodes first unless this class was explicitly created as your JIRA proposal pointed, in that it only uses the passed URLs and nothing else.
I think it should be used as backup and go for fetching from live nodes first and assume by its name it is "live". If we do the opt in as a feature flag for dynamic node discovery then that gives them control. I'm fine with flipping it with configured first though if you feel strongly about it.
I'll merge this in about a day as-is. |
…igured URLs as backup (#2935) SolrJ CloudSolrClient configured with Solr URLs can fail to request cluster state if its current live nodes list are all unavailable. The HttpClusterStateProvider now retains the initial configured list of passed URLs as backup. Utils.getBaseUrlForNodeName moved to URLUtil. MiniSolrCloudCluster.startJettySolrRunner is overloaded now to choose to reuse the port. (cherry picked from commit b1fe883)
https://issues.apache.org/jira/browse/SOLR-17519
Description
In
BaseHttpClusterStateProvider
if all initially passed nodes except for 1 go down, then CSP will only be aware of the 1 live node. If that node were to go down and any of the other initially passed nodes were to recover, CSP would not be able to connect to get latest cluster state of live nodes because it only holds the address of the now downed node.Solution
CSP holds immutable
initialNodes
which is used to never remove this set from live nodes to keep it resilient. Now if the case above were to occur, CSP would still be able to get cluster state and latest set of live nodes because live nodes always has the initial set. Live nodes can also hold new nodes that are added to the cluster after CSP initialization but those are removable unlike the initial nodes.Tests
testClusterStateProviderDownedInitialLiveNodes
tests the above test case and used to fail.testClusterStateProviderLiveNodesWithNewHost
tests live nodes with a 3rd added new host after CSP is initialized which is removable but initial nodes are always still present to be reachable by CSP.Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.