Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#6566] improvement(core): Add the cache mechanism for metalake and use cache to load in-use information. #6569

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

yuqi1129
Copy link
Contributor

@yuqi1129 yuqi1129 commented Feb 27, 2025

What changes were proposed in this pull request?

  1. Add cache mechanism for MatalakeManage.
  2. Use cache to load in-use information for metalake and catalog.

This is the Flame diagram after optimization:
image

Why are the changes needed?

Loading in-use from backend storage everytime times when doing catalog operations is very time-comsuming, we'd to optimize it.

Fix: #6566

Does this PR introduce any user-facing change?

N/A.

How was this patch tested?

Add UT and existing tests.

@yuqi1129 yuqi1129 requested review from mchades and jerryshao and removed request for mchades February 27, 2025 12:04
@yuqi1129 yuqi1129 self-assigned this Feb 27, 2025
@@ -60,6 +67,21 @@ public class MetalakeManager implements MetalakeDispatcher {

private final IdGenerator idGenerator;

@VisibleForTesting
static final Cache<NameIdentifier, BaseMetalake> METALAKE_CACHE =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If our goal is to accelerate the acquisition of in-use, it seems that we only need to cache the corresponding in-use value, and do not need to cache BaseMetalake.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cacheing BaseMetalake will take only a little memory and can use it when loadingMetalake by the way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think caching metalake is better, because the amount of metalake is quite limited, with small memory size we can improve the performance a lot, it is worthy to cache the metalake.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use uppercase for this variable? Typically, we only use uppercase letter for final variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It indeed has final flag, see static final Cache<NameIdentifier, BaseMetalake> METALAKE_CACHE

@@ -260,7 +260,7 @@ private ModelCatalog asModels() {

private final Config config;

@VisibleForTesting final Cache<NameIdentifier, CatalogWrapper> catalogCache;
@VisibleForTesting static Cache<NameIdentifier, CatalogWrapper> catalogCache;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we make it static? I assume there will be only one CatalogManager, so there should be only one catalogCache, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we make it static?

The method check catalogInUse and metalakeInUse are all static. If we want to use cache for them, we need to change it to static

assume there will be only one CatalogManager, so there should be only one catalogCache, right?

Yes, there will be only one cache and all catalogs shares the same instance, It's not a big problem I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should named CATALOG_CACHE?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be final?

if (wrapper != null) {
catalogEntity = wrapper.catalog.entity();
} else {
catalogEntity = store.get(catalogIdent, EntityType.CATALOG, CatalogEntity.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we put this catalogEntity in cache? Besides, can we use loadcatalog directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading catalogEntity and then transform it to standard CatalogEntity are NOT static method, so I omit it. in fact, in most cases, we can get catalogEntity from the cache as all calls are from catalog operations, and the catalog should be in the cache except the first time.

@jerryshao
Copy link
Contributor

What is the performance gaining after cache?

@yuqi1129
Copy link
Contributor Author

yuqi1129 commented Feb 28, 2025

What is the performance gaining after cache?

The picture attached to this PR is of the performance with the cache, about the picture without cache, please see the picture in corresponding issue

@jerqi jerqi changed the title [#6566] improvement(core): Add the cache mechanism for matalake and use cache to load in-use information. [#6566] improvement(core): Add the cache mechanism for metalake and use cache to load in-use information. Feb 28, 2025
@@ -260,7 +260,7 @@ private ModelCatalog asModels() {

private final Config config;

@VisibleForTesting final Cache<NameIdentifier, CatalogWrapper> catalogCache;
@VisibleForTesting static Cache<NameIdentifier, CatalogWrapper> catalogCache;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should named CATALOG_CACHE?

Comment on lines +135 to +138
BaseMetalake metalake = METALAKE_CACHE.getIfPresent(ident);
if (metalake == null) {
metalake = store.get(ident, EntityType.METALAKE, BaseMetalake.class);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not cache the result after getting it from the store?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #6569 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after the user alters the metalake, then load schema/table directly without list/get the metalake, the cache never be hitting, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. alter a metalake is not operated frequently, so I guess it's acceptable

@@ -253,7 +289,7 @@ public BaseMetalake alterMetalake(NameIdentifier ident, MetalakeChange... change
throw new MetalakeNotInUseException(
"Metalake %s is not in use, please enable it first", ident);
}

METALAKE_CACHE.invalidate(ident);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the result be cached after updating?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put it back to cache seems to be an optional if this optional is not frequently called. Anyway, this is an improvement, let me check if we can add it back.

@yuqi1129
Copy link
Contributor Author

should named CATALOG_CACHE?

The variable does not have a final flag and only has a static flag; accordingly to Java specs, it seems that we still need to use camel-case sytle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] Get catalogInUse and metalakeInUse from cache instead of retrieve them from storage
3 participants