-
Notifications
You must be signed in to change notification settings - Fork 621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(v2): metadata label query #3749
base: main
Are you sure you want to change the base?
Conversation
a8dfdb7
to
43910d3
Compare
43910d3
to
5b2893c
Compare
I took a quick look and I think I am lacking some context. The service name is already assigned to Is the idea that this becomes a more flexible system for storing and querying other labels in the future? If yes, have we thought about how the proposed storage will scale (aka, do we need to store all combinations, can we skip service name)? If not, it feels like these queries are already answered efficiently today and perhaps we can postpone doing this. One benefit I see is moving some of the computation from the query frontend to the metastore, as well as sending less data around, however I feel like this could be done without changing the storage. Again, I am probably missing something so take this with a grain of salt :) |
I should have been more clear, my doubts are about something else. I understand that the purpose is to discard as many blocks as possible as early as possible. I have no doubt this is more efficient than the filtering we are doing currently in the query frontend. My point was that this idea seems doable without a change to the block meta storage format and I am trying to gauge whether the added complexity in how labels are stored is worth it at this time. I expect the answer to be "yes", but I was hoping for some more reasoning around this. |
if i == skip { | ||
skip += int(v)*2 + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't obvious to me at my first pass. I think it has to do with storing the same labels multiple times but I am unsure why we do that. A comment here could help. Alternatively, an explanation here or elsewhere on why labels are stored in this particular way could also be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a comment.
This is how multiple dimensions are handled: when an entity has multiple relations, and the representation is denormalized, there is some redundancy (it should be at least 1NF – but this not a relational db therefore we have lots of freedom here).
For example, for now, each dataset is described by a collection of label sets:
service_name=A, profile_type=cpu
service_name=A, profile_type=memory
If we store multiple services in the same dataset, or extend the list attributes available at the query time, we can represent it as follows:
service_name=A, profile_type=cpu
service_name=A, profile_type=memory
service_name=B, namespace=N, profile_type=cpu
service_name=B, namespace=N, profile_type=memory
service_name=C, namespace=N, profile_type=cpu
service_name=C, namespace=N, profile_type=memory
Or, if we want to add arbitrary service attributes (don't have to be present in the block's tsdb index):
service_name=A, attr_1=Foo
service_name=B, attr_1=Bar
For example, the approach above can be used for cost attribution or any similar break-down analysis.
The length-prefixed encoding of KV pairs is the simplest way to represent that:
len(2) | k1 | v1 | k2 | v2 | len(3) | k1 | v3 | k2 | v4 | k3 | v5
It's not meant to be ultra-efficient in terms of space because ultimately, we will be using an inverted index to store and query them (e.g., we could probably reuse the TSDB index, but wouldn't). In addition, keep in mind 2 things: 1) varint encoding in protobuf 2) page organization in BoltDB – these are the reasons why the impact on the encoding/decoding and snapshot size is barely noticeable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for expanding on it. What are the advantages of this storage, compared to storing unique label pairs?
This:
service_name=A, profile_type=cpu
service_name=A, profile_type=memory
service_name=B, namespace=N, profile_type=cpu
service_name=B, namespace=N, profile_type=memory
service_name=C, namespace=N, profile_type=cpu
service_name=C, namespace=N, profile_type=memory
Becomes:
service_name=A, service_name=B, service_name=C, profile_type=cpu, profile_type=memory, ...
The only reason I am asking is because of the current usage, where multi-dimensionality doesn't seem required and queries are simple. Correct me if I am wrong, but we are currently either looking for a single service name or use an empty query expression (for profile types). From my understanding however, we plan to use this differently later on, so lets keep it with a comment providing some context.
type LabelBuilder struct { | ||
strings *StringTable | ||
labels []int32 | ||
constant []int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, should this be constants
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why? This is the constant
part of the label sets produced with the builder – see WithConstantPairs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using a singular form for something representing multiple values (label pairs). It is a confusing name for me, but if you think it is fine we can keep it.
return resp, err | ||
} | ||
|
||
func (svc *MetadataQueryService) queryMetadataLabels( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not specific to this PR and we don't have to do it now, but we should add tracing to the metastore APIs
Thanks for the review, Aleks!
I believe you can find answers in:
If you disagree with any of the above or don't find it justified, I'm open to your input, but I ask that you be specific:
This does not help me understand your concerns. Could you please clarify your expectations? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not help me understand your concerns. Could you please clarify your expectations?
My concern was adding complexity for something we don't need right now, but might need in the future. Future-proofing can be good, especially if we have a plan to take a certain direction or if we know the cost of doing the work later will be too high. It can also be bad, if we never follow up on this and end up having code that can't be reasoned about.
From this conversation it looks like you are convinced both of those points are true, which is enough to wrap this discussion :)
type LabelBuilder struct { | ||
strings *StringTable | ||
labels []int32 | ||
constant []int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using a singular form for something representing multiple values (label pairs). It is a confusing name for me, but if you think it is fine we can keep it.
if i == skip { | ||
skip += int(v)*2 + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for expanding on it. What are the advantages of this storage, compared to storing unique label pairs?
This:
service_name=A, profile_type=cpu
service_name=A, profile_type=memory
service_name=B, namespace=N, profile_type=cpu
service_name=B, namespace=N, profile_type=memory
service_name=C, namespace=N, profile_type=cpu
service_name=C, namespace=N, profile_type=memory
Becomes:
service_name=A, service_name=B, service_name=C, profile_type=cpu, profile_type=memory, ...
The only reason I am asking is because of the current usage, where multi-dimensionality doesn't seem required and queries are simple. Correct me if I am wrong, but we are currently either looking for a single service name or use an empty query expression (for profile types). From my understanding however, we plan to use this differently later on, so lets keep it with a comment providing some context.
The PR introduces metadata labels and a query interface. Metadata labels are attached by the segment writer and are utilized during the read path to filter out blocks irrelevant to the query without needing to access the blocks' TSDB index (acting as a form of skip index).
Currently, the primary use case for these labels is profile-type queries. Instead of fetching all entries in the query frontend for post-filtering, labels such as
profile_type
andservice_name
are queried directly through the new interface. This approach avoids heavy processing and only retrieves a minimal set of data.In the future, we may consider implementing an in-memory inverted index at the metadata shard level to accelerate metadata queries and optimize memory usage.