Skip to content

Introduces a separate thread in the UserManager. #21857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: devel
Choose a base branch
from

Conversation

johann-listunov
Copy link
Contributor

@johann-listunov johann-listunov commented Jul 15, 2025

Add a separate thread to the UserManger to handle the load of the user-cache asynchronously.
This pre-loads the user-cache as fast as possible (as soon as the global version changes)
Also, compared to the old version, this allows the read-only operation to work on the user-cache
for longer and not block each other while trying to load a new version. On a high load environment this also
means a more responsive UserManger.

Scope & Purpose

  • 💩 Bugfix
  • 🍕 New feature
  • 🔥 Performance improvement
  • 🔨 Refactoring/simplification

Checklist

  • Tests
    • Regression tests
    • C++ Unit tests
    • integration tests
    • resilience tests
  • 📖 CHANGELOG entry made
  • 📚 documentation written (release notes, API changes, ...)
  • Backports
    • Backport for 3.12.0: (Please link PR)
    • Backport for 3.11: (Please link PR)
    • Backport for 3.10: (Please link PR)

Related Information

(Please reference tickets / specification / other PRs etc)

  • Docs PR:
  • Enterprise PR:
  • GitHub issue / Jira ticket:
  • Design document:

@cla-bot cla-bot bot added the cla-signed label Jul 15, 2025

std::lock_guard guard{_loadFromDBLock}; // must be first
WRITE_LOCKER(writeGuard, _userCacheLock); // must be second
READ_LOCKER(readGuard, _userCacheLock); // must be second
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comment.

@@ -379,27 +409,34 @@ VPackBuilder auth::UserManager::allUsers() {
}

void auth::UserManager::triggerCacheRevalidation() {
triggerLocalReload();
uint64_t versionToReloadTo = globalVersion() + 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be const.

}
}
// This test is not really applicable anymore.
// Neither triggerGlobalReload or triggerLocal reload are incrementing the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Neither triggerGlobalReload or triggerLocal reload are incrementing the
// Neither triggerGlobalReload or triggerLocalReload are incrementing the

@@ -229,7 +229,7 @@ function testSuite() {
}
let aliveStatus = waitForAlive(30, coordinator.url, { auth: { bearer: jwt } });
// note: this should actually work, but currently doesn't TODO
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think yes.

@@ -1,6 +1,12 @@
devel
-----

* Add a separate thread to the UserManger to handle the load of the user-cache asynchronously.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Add a separate thread to the UserManger to handle the load of the user-cache asynchronously.
* Add a separate thread to the UserManager to handle the load of the user-cache asynchronously.

This pre-loads the user-cache as fast as possible (as soon as the global version changes)
Also, compared to the old version, this allows the read-only operation to work on the user-cache
for longer and not block each other while trying to load a new version. On a high load environment this also
means a more responsive UserManger.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
means a more responsive UserManger.
means a more responsive UserManager.

Comment on lines +4 to +8
* Add a separate thread to the UserManger to handle the load of the user-cache asynchronously.
This pre-loads the user-cache as fast as possible (as soon as the global version changes)
Also, compared to the old version, this allows the read-only operation to work on the user-cache
for longer and not block each other while trying to load a new version. On a high load environment this also
means a more responsive UserManger.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's valuable to note the motivation for this, namely that this avoids the HTTP server threads getting blocked while the user-cache is being updated.

Comment on lines -183 to 194
void auth::UserManager::loadFromDB() {
TRI_ASSERT(ServerState::instance()->isSingleServerOrCoordinator());

if (_internalVersion.load(std::memory_order_acquire) == globalVersion()) {
void auth::UserManager::loadUserCacheAndStartUpdateThread() noexcept {
if (_usersInitialized.load(std::memory_order_relaxed) == true) {
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this method might be called multiple times, but not concurrently? Otherwise this would not work.

LOG_TOPIC("ef78c", INFO, Logger::AUTHENTICATION) << "Preloading user cache";
auto const start = chrono::system_clock::now();
int64_t lastLog = 0;
int64_t constexpr LogEverySeconds = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int64_t constexpr LogEverySeconds = 3;
int64_t constexpr logEverySeconds = 3;

// } catch (...) {
// FAIL();
// }
//}

TEST_F(UserManagerClusterTest, cacheRevalidationShouldKeepVersionsInLine) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to the used failure point defeat the purpose of this test. It's arguable whether the failure point should be changed (e.g., move it to the user manager thread, but instead of throwing an exception set a special value that can be observed here); and/or whether the test should be rewritten instead, as it doesn't quite match up any longer with the way the user manager works now.

@@ -166,10 +172,6 @@ TEST_F(UserManagerClusterTest, cacheRevalidationShouldKeepVersionsInLine) {

TEST_F(UserManagerClusterTest,
triggerLocalReloadShouldNotUpdateClusterVersion) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, the changes to the failure point defeat the purpose of this test, at least of the second part. Again, it's arguable whether the test should be written differently.

@@ -203,10 +205,6 @@ TEST_F(UserManagerClusterTest,
}

TEST_F(UserManagerClusterTest, triggerGlobalReloadShouldUpdateClusterVersion) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the second part of the test at least must be done differently now to still make sense.

@@ -229,7 +229,7 @@ function testSuite() {
}
let aliveStatus = waitForAlive(30, coordinator.url, { auth: { bearer: jwt } });
// note: this should actually work, but currently doesn't TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think yes.

@@ -468,22 +473,27 @@ class IResearchAnalyzerFeatureTest
auto* userManager = authFeature.userManager();
if (userManager != nullptr) {
userManager->removeAllUsers();
userManager->shutdown();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might not need the user manager thread at all, and same for the failure points. Same for the other tests.

Copy link
Contributor

@jvolmer jvolmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already discussed some stuff about trying to get rid of the failure points and possibly waiting for a version change in the tests, so here just some small comments I noticed.

Comment on lines +151 to +155
// this is only needed in unittest this
// will shutdown the running thread on demand
// its needed because the failure point can be deactivated before the thread
// is finished and can lead to calls on the server that are not initialized
// properly in the unit-test environment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// this is only needed in unittest this
// will shutdown the running thread on demand
// its needed because the failure point can be deactivated before the thread
// is finished and can lead to calls on the server that are not initialized
// properly in the unit-test environment
// This is only needed in unittest:
// This will shutdown the running thread on demand. It's needed because the
// failure point can be deactivated before the thread is finished and can lead
// to calls on the server that are not initialized properly in the unit-test
// environment.

Comment on lines +168 to +169
// for test that was previously in the loadFromDB call, that is now in
// a seperate thread
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// for test that was previously in the loadFromDB call, that is now in
// a seperate thread
// for a test that was previously in the loadFromDB call which is now executed in
// a separate thread

}

TRI_ASSERT(ServerState::instance()->isSingleServerOrCoordinator());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems weird to me to do the assertion here, this feels random, would be move this directly to the start of the function. And I would also like to have this assertion at the start of loadUserCacheAndStartUpdateThread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants