Skip to content

Skip indexing points for seq_no in tsdb and logsdb #128139

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 27, 2025
Merged

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented May 19, 2025

This change skips indexing points for the seq_no field in tsdb and logsdb indices to reduce storage requirements and improve indexing throughput. Although this optimization could be applied to all new indices, it is limited to tsdb and logsdb, where seq_no usage is expected to be limited and storage requirements are more critical.

@dnhatn
Copy link
Member Author

dnhatn commented May 19, 2025

Metric Task Baseline Contender Diff Unit Diff %
Cumulative indexing time of primary shards 320.29 333.012 12.7219 min +3.97%
Min cumulative indexing time across primary shard 320.29 333.012 12.7219 min +3.97%
Median cumulative indexing time across primary shard 320.29 333.012 12.7219 min +3.97%
Max cumulative indexing time across primary shard 320.29 333.012 12.7219 min +3.97%
Cumulative indexing throttle time of primary shards 0 0 0 min 0.00%
Min cumulative indexing throttle time across primary shard 0 0 0 min 0.00%
Median cumulative indexing throttle time across primary shard 0 0 0 min 0.00%
Max cumulative indexing throttle time across primary shard 0 0 0 min 0.00%
Cumulative merge time of primary shards 150.314 88.7911 -61.5229 min -40.93%
Cumulative merge count of primary shards 33 35 2 +6.06%
Min cumulative merge time across primary shard 150.314 88.7911 -61.5229 min -40.93%
Median cumulative merge time across primary shard 150.314 88.7911 -61.5229 min -40.93%
Max cumulative merge time across primary shard 150.314 88.7911 -61.5229 min -40.93%
Cumulative merge throttle time of primary shards 42.1572 13.2881 -28.8691 min -68.48%
Min cumulative merge throttle time across primary shard 42.1572 13.2881 -28.8691 min -68.48%
Median cumulative merge throttle time across primary shard 42.1572 13.2881 -28.8691 min -68.48%
Max cumulative merge throttle time across primary shard 42.1572 13.2881 -28.8691 min -68.48%
Cumulative refresh time of primary shards 2.61827 2.72112 0.10285 min +3.93%
Cumulative refresh count of primary shards 82 84 2 +2.44%
Min cumulative refresh time across primary shard 2.61827 2.72112 0.10285 min +3.93%
Median cumulative refresh time across primary shard 2.61827 2.72112 0.10285 min +3.93%
Max cumulative refresh time across primary shard 2.61827 2.72112 0.10285 min +3.93%
Cumulative flush time of primary shards 11.8072 12.3607 0.55348 min +4.69%
Cumulative flush count of primary shards 65 70 5 +7.69%
Min cumulative flush time across primary shard 11.8072 12.3607 0.55348 min +4.69%
Median cumulative flush time across primary shard 11.8072 12.3607 0.55348 min +4.69%
Max cumulative flush time across primary shard 11.8072 12.3607 0.55348 min +4.69%
Total Young Gen GC time 52.207 53.018 0.811 s +1.55%
Total Young Gen GC count 1061 1086 25 +2.36%
Total Old Gen GC time 0 0 0 s 0.00%
Total Old Gen GC count 0 0 0 0.00%
Dataset size 24.3116 23.9251 -0.3865 GB -1.59%
Store size 24.3116 23.9251 -0.3865 GB -1.59%
Translog size 5.12227e-08 5.12227e-08 0 GB 0.00%
Heap used for segments 0 0 0 MB 0.00%
Heap used for doc values 0 0 0 MB 0.00%
Heap used for terms 0 0 0 MB 0.00%
Heap used for norms 0 0 0 MB 0.00%
Heap used for points 0 0 0 MB 0.00%
Heap used for stored fields 0 0 0 MB 0.00%
Segment count 30 42 12 +40.00%
Total Ingest Pipeline count 0 0 0 0.00%
Total Ingest Pipeline time 0 0 0 ms 0.00%
Total Ingest Pipeline failed 0 0 0 0.00%
Min Throughput index 37019.6 35354.7 -1664.98 docs/s -4.50%
Mean Throughput index 38264.2 37020.7 -1243.54 docs/s -3.25%
Median Throughput index 37945.6 36120 -1825.64 docs/s -4.81%
Max Throughput index 42149.2 40447.1 -1702.01 docs/s -4.04%
50th percentile latency index 860.665 877.681 17.0164 ms +1.98%
90th percentile latency index 1132.66 1208.98 76.3207 ms +6.74%
99th percentile latency index 6565.73 6849.73 284 ms +4.33%
99.9th percentile latency index 10157.1 10295.9 138.766 ms +1.37%
99.99th percentile latency index 12905.1 14065.4 1160.25 ms +8.99%
100th percentile latency index 13873.7 16058.7 2184.94 ms +15.75%
50th percentile service time index 860.665 877.681 17.0164 ms +1.98%
90th percentile service time index 1132.66 1208.98 76.3207 ms +6.74%
99th percentile service time index 6565.73 6849.73 284 ms +4.33%
99.9th percentile service time index 10157.1 10295.9 138.766 ms +1.37%
99.99th percentile service time index 12905.1 14065.4 1160.25 ms +8.99%
100th percentile service time index 13873.7 16058.7 2184.94 ms +15.75%

@dnhatn
Copy link
Member Author

dnhatn commented May 19, 2025

@martijnvg @kkrik-es

I benchmarked this change: Indexing time increased from 320 minutes to 333 minutes, while merge times decreased from 150 minutes to 88 minutes (throttling reduced from 42 to 13 minutes). Store decreased from 24.3116GB to 23.9251GB. I believe the retention query with doc_values is causing the slowness. Although SortedNumericDocValuesRangeQuery already uses DocValuesSkipper, it may not be fully leveraged. I think we should consider creating an optimized query to avoid indexing regressions before pushing this change.

The TSDB codec uses less storage for the doc_values of the _seq_no field:

"_seq_no": {
    "total": "287.6mb",
    "total_in_bytes": 301624995,
    "inverted_index": {
        "total": "0b",
        "total_in_bytes": 0
    },
    "stored_fields": "0b",
    "stored_fields_in_bytes": 0,
    "doc_values": "287.6mb",
    "doc_values_in_bytes": 301624995,
    "points": "0b",
    "points_in_bytes": 0,
    "norms": "0b",
    "norms_in_bytes": 0,
    "term_vectors": "0b",
    "term_vectors_in_bytes": 0,
    "knn_vectors": "0b",
    "knn_vectors_in_bytes": 0
}
"_seq_no": {
    "total": "933.7mb",
    "total_in_bytes": 979066601,
    "inverted_index": {
        "total": "0b",
        "total_in_bytes": 0
    },
    "stored_fields": "0b",
    "stored_fields_in_bytes": 0,
    "doc_values": "383.8mb",
    "doc_values_in_bytes": 402479840,
    "points": "549.8mb",
    "points_in_bytes": 576586761,
    "norms": "0b",
    "norms_in_bytes": 0,
    "term_vectors": "0b",
    "term_vectors_in_bytes": 0,
    "knn_vectors": "0b",
    "knn_vectors_in_bytes": 0
}

@kkrik-es
Copy link
Contributor

kkrik-es commented May 19, 2025

Not too bad! You may also check the following index in elastic/logs, _seq_no is like 10% today (and _id accounts for another 10%):

$ cat  ~/test_data/patterned/rally18.log |grep .ds-logs-k8-application.log-default- | grep seq
|    .ds-logs-k8-application.log-default-2025.03.07-000001 _seq_no doc values  |   861.434         |     MB |
|   .ds-logs-k8-application.log-default-2025.03.07-000001 _seq_no points       |       1.26528     |     GB |
|   .ds-logs-k8-application.log-default-2025.03.07-000001 _seq_no total        |       2.10652     |     GB |

$ cat  ~/test_data/patterned/rally18.log |grep .ds-logs-k8-application.log-default- | grep ' _id '
|   .ds-logs-k8-application.log-default-2025.03.07-000001 _id inverted index  |           1.45893     |     GB |
|   .ds-logs-k8-application.log-default-2025.03.07-000001 _id stored fields   |      953.436         |     MB |
|   .ds-logs-k8-application.log-default-2025.03.07-000001 _id total           |        2.39002     |     GB |

@martijnvg
Copy link
Member

Thanks @dnhatn for working on this!

The TSDB codec uses less storage for the doc_values of the _seq_no field:

This is great, I initially assumed tsdb doc values codec for _seq_no field wouldn't buy us much. Turns out I'm wrong :)

Although SortedNumericDocValuesRangeQuery already uses DocValuesSkipper, it may not be fully leveraged. I think we should consider creating an optimized query to avoid indexing regressions before pushing this change.

What kind of optimizations are thinking about? We don't sort by _seq_no field and I suspect the ordering is kind of random. The SortedNumericDocValuesRangeQuery already has logic if the requested range doesn't match with min / max of a segment or if the requested range fully matches with the min and max of a segment. In other cases, it delegates to DocValuesRangeIterator, which makes best case effort use of the doc value skipper.

@dnhatn
Copy link
Member Author

dnhatn commented May 21, 2025

@kkrik-es @martijnvg Thanks for looking!

I may have reached a conclusion too quickly. Since there are no deletes or recovery_source in the track, the retention query should be a no-op in both cases. So I re-ran the benchmark using the standard codec for seq_no, and the results look better. I think we can proceed with disabling points for seq_no in tsdb and logsdb, but using the standard codec. We can switch the codec in a follow-up after further testing.

"_seq_no": {
    "total": "341.5mb",
    "total_in_bytes": 358145438,
    "inverted_index": {
        "total": "0b",
        "total_in_bytes": 0
    },
    "stored_fields": "0b",
    "stored_fields_in_bytes": 0,
    "doc_values": "341.5mb",
    "doc_values_in_bytes": 358145438,
    "points": "0b",
    "points_in_bytes": 0,
    "norms": "0b",
    "norms_in_bytes": 0,
    "term_vectors": "0b",
    "term_vectors_in_bytes": 0,
    "knn_vectors": "0b",
    "knn_vectors_in_bytes": 0
}
Metric Task Baseline Contender Diff Unit Diff %
Cumulative indexing time of primary shards 320.29 295.458 -24.8319 min -7.75%
Min cumulative indexing time across primary shard 320.29 295.458 -24.8319 min -7.75%
Median cumulative indexing time across primary shard 320.29 295.458 -24.8319 min -7.75%
Max cumulative indexing time across primary shard 320.29 295.458 -24.8319 min -7.75%
Cumulative indexing throttle time of primary shards 0 0 0 min 0.00%
Min cumulative indexing throttle time across primary shard 0 0 0 min 0.00%
Median cumulative indexing throttle time across primary shard 0 0 0 min 0.00%
Max cumulative indexing throttle time across primary shard 0 0 0 min 0.00%
Cumulative merge time of primary shards 150.314 71.4849 -78.829 min -52.44%
Cumulative merge count of primary shards 33 31 -2 -6.06%
Min cumulative merge time across primary shard 150.314 71.4849 -78.829 min -52.44%
Median cumulative merge time across primary shard 150.314 71.4849 -78.829 min -52.44%
Max cumulative merge time across primary shard 150.314 71.4849 -78.829 min -52.44%
Cumulative merge throttle time of primary shards 42.1572 16.649 -25.5082 min -60.51%
Min cumulative merge throttle time across primary shard 42.1572 16.649 -25.5082 min -60.51%
Median cumulative merge throttle time across primary shard 42.1572 16.649 -25.5082 min -60.51%
Max cumulative merge throttle time across primary shard 42.1572 16.649 -25.5082 min -60.51%
Cumulative refresh time of primary shards 2.61827 2.29685 -0.32142 min -12.28%
Cumulative refresh count of primary shards 82 79 -3 -3.66%
Min cumulative refresh time across primary shard 2.61827 2.29685 -0.32142 min -12.28%
Median cumulative refresh time across primary shard 2.61827 2.29685 -0.32142 min -12.28%
Max cumulative refresh time across primary shard 2.61827 2.29685 -0.32142 min -12.28%
Cumulative flush time of primary shards 11.8072 11.7695 -0.0377 min -0.32%
Cumulative flush count of primary shards 65 63 -2 -3.08%
Min cumulative flush time across primary shard 11.8072 11.7695 -0.0377 min -0.32%
Median cumulative flush time across primary shard 11.8072 11.7695 -0.0377 min -0.32%
Max cumulative flush time across primary shard 11.8072 11.7695 -0.0377 min -0.32%
Total Young Gen GC time 52.207 52.667 0.46 s +0.88%
Total Young Gen GC count 1061 1097 36 +3.39%
Total Old Gen GC time 0 0 0 s 0.00%
Total Old Gen GC count 0 0 0 0.00%
Dataset size 24.3116 23.9603 -0.35129 GB -1.44%
Store size 24.3116 23.9603 -0.35129 GB -1.44%
Translog size 5.12227e-08 5.12227e-08 0 GB 0.00%
Heap used for segments 0 0 0 MB 0.00%
Heap used for doc values 0 0 0 MB 0.00%
Heap used for terms 0 0 0 MB 0.00%
Heap used for norms 0 0 0 MB 0.00%
Heap used for points 0 0 0 MB 0.00%
Heap used for stored fields 0 0 0 MB 0.00%
Segment count 30 37 7 +23.33%
Total Ingest Pipeline count 0 0 0 0.00%
Total Ingest Pipeline time 0 0 0 ms 0.00%
Total Ingest Pipeline failed 0 0 0 0.00%
Min Throughput index 37019.6 38891 1871.34 docs/s +5.06%
Mean Throughput index 38264.2 40125 1860.84 docs/s +4.86%
Median Throughput index 37945.6 39700.9 1755.35 docs/s +4.63%
Max Throughput index 42149.2 43492.8 1343.67 docs/s +3.19%
50th percentile latency index 860.665 789.215 -71.4501 ms -8.30%
90th percentile latency index 1132.66 1073.33 -59.3224 ms -5.24%
99th percentile latency index 6565.73 6284.46 -281.265 ms -4.28%
99.9th percentile latency index 10157.1 10921.5 764.368 ms +7.53%
99.99th percentile latency index 12905.1 14296.8 1391.62 ms +10.78%
100th percentile latency index 13873.7 20993.3 7119.55 ms +51.32%
50th percentile service time index 860.665 789.215 -71.4501 ms -8.30%
90th percentile service time index 1132.66 1073.33 -59.3224 ms -5.24%
99th percentile service time index 6565.73 6284.46 -281.265 ms -4.28%
99.9th percentile service time index 10157.1 10921.5 764.368 ms +7.53%
99.99th percentile service time index 12905.1 14296.8 1391.62 ms +10.78%
100th percentile service time index 13873.7 20993.3 7119.55 ms +51.32%

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nhat for iterating here!

So I re-ran the benchmark using the standard codec for seq_no, and the results look better.

So the only change was that _seqno field uses stock lucene doc values codec? The benchmark ran with the same track params as before? I don't think with logsdb we ever have recovery source anymore.

I think we can proceed with disabling points for seq_no in tsdb and logsdb, but using the standard codec.

Should we maybe put this behind a feature flag, so that we can see how it affects other benchmarks? I'm particularly interested in how this change affects the elastic/logs ccr benchmark.

private static Query rangeQueryForSeqNo(boolean withPoints, long lowerValue, long upperValue) {
if (withPoints) {
// TODO: Use IndexOrDocValuesQuery
return LongPoint.newRangeQuery(SeqNoFieldMapper.NAME, lowerValue, upperValue);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address the TODO here and wrap this query here in IndexOrDocValuesQuery?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a small change, but it may impact performance. I prefer to make it in a separate PR to avoid introducing noise. We can do it either before or after this change.

@dnhatn
Copy link
Member Author

dnhatn commented May 21, 2025

Thanks @martijnvg

So the only change was that _seqno field uses stock lucene doc values codec? The benchmark ran with the same track params as before? I don't think with logsdb we ever have recovery source anymore.

That's correct. The TSDB codec likely uses more CPU to produce a more compact encoding than the standard codec.

Should we maybe put this behind a feature flag, so that we can see how it affects other benchmarks? I'm particularly interested in how this change affects the elastic/logs ccr benchmark.

I considered this as well, but CCR can break if the feature is enabled on one cluster and not on another. For example, if the leader cluster enables the feature and doesn't index points for seq_no, but the follower disables it and does index points, this leads to inconsistent field infos. Should we run the elastic/logs CCR benchmark before merging this change, instead of introducing a feature flag?

@martijnvg
Copy link
Member

That's correct. The TSDB codec likely uses more CPU to produce a more compact encoding than the standard codec.

Right, in a follow up, we can maybe investigate to apply only some of the encoding techniques that tsdb doc values codec uses for the _tsid field.

For example, if the leader cluster enables the feature and doesn't index points for seq_no, but the follower disables it and does index points, this leads to inconsistent field infos.

Right, that can get messy and lead to hard to debug bugs. However, at least in production that shouldn't be an issue given that feature flags shouldn't be used in production. The same applies for snapshot builds.

Should we run the elastic/logs CCR benchmark before merging this change, instead of introducing a feature flag?

👍 that also sounds good with me.

@martijnvg
Copy link
Member

Rally compare with current main as baseline and this change as contender:
baseline4-vs-seqno-no-dv-3.txt

(median indexing throughput is ~12% slower, however this statistic is a bit noisy)

CCR performance results between current main and this PR: https://esbench-metrics.kb.us-east-2.aws.elastic-cloud.com:9243/app/dashboards#/view/2eb4019c-95ac-4a05-9990-cd85b22bbaeb?_g=h@1dbd48d

CCR replication is little bit slower (~19%) with this PR, but it is much better as it was before with the decompression issue fixed (#128473): https://esbench-metrics.kb.us-east-2.aws.elastic-cloud.com:9243/app/r/s/iAxiI

But given that #128473 improved performance significantly, this change should be a good trade off.

@martijnvg martijnvg added :StorageEngine/Mapping The storage related side of mappings >enhancement labels May 27, 2025
@martijnvg martijnvg marked this pull request as ready for review May 27, 2025 17:40
@martijnvg martijnvg requested a review from a team as a code owner May 27, 2025 17:40
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine
Copy link
Collaborator

Hi @dnhatn, I've created a changelog YAML for you.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label May 27, 2025
@dnhatn dnhatn merged commit 7f2e55f into elastic:main May 27, 2025
19 checks passed
@dnhatn dnhatn deleted the seq_no_dv branch May 27, 2025 22:59
@dnhatn
Copy link
Member Author

dnhatn commented May 27, 2025

@martijnvg @kkrik-es Thank you so much for your review and for helping to unblock this PR. Should we backport this change to 8.19?

@martijnvg
Copy link
Member

Thank you so much for your review and for helping to unblock this PR. Should we backport this change to 8.19?

I don't think that this is possible given that doc value skippers are not available in the 8.19 branch.

@felixbarny
Copy link
Member

What's the storage reduction per document for this change?

@martijnvg
Copy link
Member

martijnvg commented Jun 4, 2025 via email

@felixbarny
Copy link
Member

I'm specifically curious about the reduction in bytes per document. Do you know how many documents we have ingested in the track? The relative reduction in the total size of the data set depends on how many fields are stored per document. The reduction in bytes per doc should stay consistent, no matter the data set.

@martijnvg
Copy link
Member

I don't have that number, since it each document has a different number of metric fields. I think we need to run the metricsgen with the index setting that this pr enabled and than another run with that that index setting disabled to get a more accurate view what the savings are on a per metric basis.

The reduction in bytes per doc should stay consistent, no matter the data set.

This depends on how sequence numbers get encoded in the bkd tree. I don't know what compression schemes are used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement serverless-linked Added by automation, don't add manually :StorageEngine/Mapping The storage related side of mappings Team:StorageEngine v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants