Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(v2): metadata label query #3749

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

kolesnikovae
Copy link
Collaborator

@kolesnikovae kolesnikovae commented Dec 9, 2024

The PR introduces metadata labels and a query interface. Metadata labels are attached by the segment writer and are utilized during the read path to filter out blocks irrelevant to the query without needing to access the blocks' TSDB index (acting as a form of skip index).

Currently, the primary use case for these labels is profile-type queries. Instead of fetching all entries in the query frontend for post-filtering, labels such as profile_type and service_name are queried directly through the new interface. This approach avoids heavy processing and only retrieves a minimal set of data.

In the future, we may consider implementing an in-memory inverted index at the metadata shard level to accelerate metadata queries and optimize memory usage.

@kolesnikovae kolesnikovae marked this pull request as ready for review December 9, 2024 12:59
@kolesnikovae kolesnikovae requested a review from a team as a code owner December 9, 2024 12:59
Base automatically changed from feat/metadata-string-interning to main December 10, 2024 10:48
@kolesnikovae kolesnikovae marked this pull request as draft December 10, 2024 12:53
@kolesnikovae kolesnikovae force-pushed the feat/metadata-label-query branch from a8dfdb7 to 43910d3 Compare December 11, 2024 04:00
@kolesnikovae kolesnikovae force-pushed the feat/metadata-label-query branch from 43910d3 to 5b2893c Compare December 11, 2024 04:03
@kolesnikovae kolesnikovae marked this pull request as ready for review December 11, 2024 09:11
@kolesnikovae kolesnikovae requested a review from aleks-p December 12, 2024 12:02
@aleks-p
Copy link
Contributor

aleks-p commented Dec 12, 2024

I took a quick look and I think I am lacking some context.

The service name is already assigned to metastorev1.Dataset, so we filter datasets based on this today. Same with profile types, we store them as a compact array but we are introducing a larger one containing combinations of service names (always 1) and profile types (variable number) + repeated label pair counts for each combination.

Is the idea that this becomes a more flexible system for storing and querying other labels in the future? If yes, have we thought about how the proposed storage will scale (aka, do we need to store all combinations, can we skip service name)? If not, it feels like these queries are already answered efficiently today and perhaps we can postpone doing this.

One benefit I see is moving some of the computation from the query frontend to the metastore, as well as sending less data around, however I feel like this could be done without changing the storage.

Again, I am probably missing something so take this with a grain of salt :)

@kolesnikovae
Copy link
Collaborator Author

kolesnikovae commented Dec 13, 2024

The service name is already assigned to metastorev1.Dataset, so we filter datasets based on this today. Same with profile types, we store them as a compact array

The implementation was intended for the prototype stage, not for production use. It was quick and dirty, designed just to make things work (somehow).

Is the idea that this becomes a more flexible system for storing and querying other labels in the future?

The change resolves a major performance issue that prevents use of metastore in production deployments:

Instead of fetching all entries in the query frontend for post-filtering, labels such as profile_type and service_name are queried directly through the new interface. This approach avoids heavy processing and only retrieves a minimal set of data.

This is the first PR in the series of improvements. Among the other things, I plan to:

  • Implement follower reads. It is less critical if we block followers' index (and raft log replication). We have everything in place but the client does not support such scenario.
  • Limit the metadata queries (something we discussed recently).
  • Implement fine-grained locking at the metadata index level.
  • Consider custom binary formats for metadata entries to avoid wasteful allocations.

If yes, have we thought about how the proposed storage will scale (aka, do we need to store all combinations, can we skip service name)?

It's not the primary goal of the change. Regarding the data model and the storage layer:

In the future, we may consider implementing an in-memory inverted index at the metadata shard level to accelerate metadata queries and optimize memory usage.

This will allow us to eliminate the full scan we currently have to do: we will only need to read a very limited piece of information on dimensions (like our TSDB index), instead of accessing each block metadata entry.

The current implementation does not introduce any noticeable memory or disk overhead. Looking at the snapshot size chart, I can't even tell exactly when I deployed the new version.

image

If not, it feels like these queries are already answered efficiently today and perhaps we can postpone doing this.

I haven't found traces of the queries (actually, there were two queries at least), but the profiles are quite self-explanatory.

image image

The recently implemented string interning has mitigated the issue partially, but it was still very inefficient.

The Pyroscope UI needs the list of services and their profile types to build the main view. Currently, to get this information, the query-frontend service fetches all the metadata entries of the tenant within the time range. What's worse, we preserve the full metadata entry body, because we're querying for all services. Even in a moderately small deployment, we'd need to transfer and process hundreds of megabytes of protobuf data from metastore to query frontend.

Note that the issue may not arise immediately but surfaces as more data accumulate. The problem deteriorates massively when the compaction process delays.

One benefit I see is moving some of the computation from the query frontend to the metastore, as well as sending less data around, however I feel like this could be done without changing the storage.

This is the only benefit the change is aiming at. The technique is called "predicate pushdown": it's less effective and efficient without the index I mentioned above, but it solves the problem for now and for the nearest future.

Of course, it could be implemented in a different way. I decided to implement it this way due to numerous reasons. Among them:

  • Postponing the change will complicate the implementation due to compatibility guarantees.
  • The concept of labels/dimensions aligns perfectly with the inverted index model and can also be leveraged for skip indexes. This approach is natural for document-oriented stores, which is essentially what the metastore is.
  • The existing APIs and the data model we adhere to suggest using labels: our main client – the UI – accesses this data via the Series API, which is all about labels (ProfileTypes is still used in the Grafana Explore view, though; I hope we deprecate it one day).
  • It is extendable, indeed:
    • One option I see is service collocation: having a separate dataset for each individual service might not always be a) efficient or b) possible. For example, currently, OTel knows nothing about services – I hope this will be addressed, but I can see how a similar issue might arise in a different context.
    • Another example: storing build identifiers of mappings as service labels. This way, we can pre-fetch symbolic information for a query before accessing the data blocks to avoid synchronous symbolication.

@aleks-p
Copy link
Contributor

aleks-p commented Jan 2, 2025

I should have been more clear, my doubts are about something else.

I understand that the purpose is to discard as many blocks as possible as early as possible. I have no doubt this is more efficient than the filtering we are doing currently in the query frontend. My point was that this idea seems doable without a change to the block meta storage format and I am trying to gauge whether the added complexity in how labels are stored is worth it at this time. I expect the answer to be "yes", but I was hoping for some more reasoning around this.

Comment on lines +103 to +104
if i == skip {
skip += int(v)*2 + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't obvious to me at my first pass. I think it has to do with storing the same labels multiple times but I am unsure why we do that. A comment here could help. Alternatively, an explanation here or elsewhere on why labels are stored in this particular way could also be helpful.

Copy link
Collaborator Author

@kolesnikovae kolesnikovae Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment.

This is how multiple dimensions are handled: when an entity has multiple relations, and the representation is denormalized, there is some redundancy (it should be at least 1NF – but this not a relational db therefore we have lots of freedom here).

For example, for now, each dataset is described by a collection of label sets:

service_name=A, profile_type=cpu
service_name=A, profile_type=memory

If we store multiple services in the same dataset, or extend the list attributes available at the query time, we can represent it as follows:

service_name=A, profile_type=cpu
service_name=A, profile_type=memory
service_name=B, namespace=N, profile_type=cpu
service_name=B, namespace=N, profile_type=memory
service_name=C, namespace=N, profile_type=cpu
service_name=C, namespace=N, profile_type=memory

Or, if we want to add arbitrary service attributes (don't have to be present in the block's tsdb index):

service_name=A, attr_1=Foo
service_name=B, attr_1=Bar

For example, the approach above can be used for cost attribution or any similar break-down analysis.

The length-prefixed encoding of KV pairs is the simplest way to represent that:

len(2) | k1 | v1 | k2 | v2 | len(3) | k1 | v3 | k2 | v4 | k3 | v5

It's not meant to be ultra-efficient in terms of space because ultimately, we will be using an inverted index to store and query them (e.g., we could probably reuse the TSDB index, but wouldn't). In addition, keep in mind 2 things: 1) varint encoding in protobuf 2) page organization in BoltDB – these are the reasons why the impact on the encoding/decoding and snapshot size is barely noticeable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for expanding on it. What are the advantages of this storage, compared to storing unique label pairs?

This:

service_name=A, profile_type=cpu
service_name=A, profile_type=memory
service_name=B, namespace=N, profile_type=cpu
service_name=B, namespace=N, profile_type=memory
service_name=C, namespace=N, profile_type=cpu
service_name=C, namespace=N, profile_type=memory

Becomes:

service_name=A, service_name=B, service_name=C, profile_type=cpu, profile_type=memory, ...

The only reason I am asking is because of the current usage, where multi-dimensionality doesn't seem required and queries are simple. Correct me if I am wrong, but we are currently either looking for a single service name or use an empty query expression (for profile types). From my understanding however, we plan to use this differently later on, so lets keep it with a comment providing some context.

type LabelBuilder struct {
strings *StringTable
labels []int32
constant []int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, should this be constants?

Copy link
Collaborator Author

@kolesnikovae kolesnikovae Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why? This is the constant part of the label sets produced with the builder – see WithConstantPairs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using a singular form for something representing multiple values (label pairs). It is a confusing name for me, but if you think it is fine we can keep it.

return resp, err
}

func (svc *MetadataQueryService) queryMetadataLabels(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not specific to this PR and we don't have to do it now, but we should add tracing to the metastore APIs

@kolesnikovae
Copy link
Collaborator Author

Thanks for the review, Aleks!

My point was that this idea seems doable without a change to the block meta storage format

I believe you can find answers in:

Of course, it could be implemented in a different way. I decided to implement it this way due to numerous reasons. Among them:

  • Postponing the change will complicate the implementation due to compatibility guarantees.
  • The concept of labels/dimensions aligns perfectly with the inverted index model and can also be leveraged for skip indexes. This approach is natural for document-oriented stores, which is essentially what the metastore is.
  • The existing APIs and the data model we adhere to suggest using labels: our main client – the UI – accesses this data via the Series API, which is all about labels (ProfileTypes is still used in the Grafana Explore view, though; I hope we deprecate it one day).
  • It is extendable, indeed:
    • One option I see is service collocation: having a separate dataset for each individual service might not always be a) efficient or b) possible. For example, currently, OTel knows nothing about services – I hope this will be addressed, but I can see how a similar issue might arise in a different context.
    • Another example: storing build identifiers of mappings as service labels. This way, we can pre-fetch symbolic information for a query before accessing the data blocks to avoid synchronous symbolication.

If you disagree with any of the above or don't find it justified, I'm open to your input, but I ask that you be specific:

I am trying to gauge whether the added complexity in how labels are stored is worth it at this time. I expect the answer to be "yes", but I was hoping for some more reasoning around this.

This does not help me understand your concerns. Could you please clarify your expectations?

Copy link
Contributor

@aleks-p aleks-p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not help me understand your concerns. Could you please clarify your expectations?

My concern was adding complexity for something we don't need right now, but might need in the future. Future-proofing can be good, especially if we have a plan to take a certain direction or if we know the cost of doing the work later will be too high. It can also be bad, if we never follow up on this and end up having code that can't be reasoned about.

From this conversation it looks like you are convinced both of those points are true, which is enough to wrap this discussion :)

type LabelBuilder struct {
strings *StringTable
labels []int32
constant []int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using a singular form for something representing multiple values (label pairs). It is a confusing name for me, but if you think it is fine we can keep it.

Comment on lines +103 to +104
if i == skip {
skip += int(v)*2 + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for expanding on it. What are the advantages of this storage, compared to storing unique label pairs?

This:

service_name=A, profile_type=cpu
service_name=A, profile_type=memory
service_name=B, namespace=N, profile_type=cpu
service_name=B, namespace=N, profile_type=memory
service_name=C, namespace=N, profile_type=cpu
service_name=C, namespace=N, profile_type=memory

Becomes:

service_name=A, service_name=B, service_name=C, profile_type=cpu, profile_type=memory, ...

The only reason I am asking is because of the current usage, where multi-dimensionality doesn't seem required and queries are simple. Correct me if I am wrong, but we are currently either looking for a single service name or use an empty query expression (for profile types). From my understanding however, we plan to use this differently later on, so lets keep it with a comment providing some context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants