Skip indexing points for seq_no in tsdb and logsdb #128139

dnhatn · 2025-05-19T15:17:31Z

This change skips indexing points for the seq_no field in tsdb and logsdb indices to reduce storage requirements and improve indexing throughput. Although this optimization could be applied to all new indices, it is limited to tsdb and logsdb, where seq_no usage is expected to be limited and storage requirements are more critical.

dnhatn · 2025-05-19T15:19:47Z

Metric	Task	Baseline	Contender	Diff	Unit	Diff %
Cumulative indexing time of primary shards		320.29	333.012	12.7219	min	+3.97%
Min cumulative indexing time across primary shard		320.29	333.012	12.7219	min	+3.97%
Median cumulative indexing time across primary shard		320.29	333.012	12.7219	min	+3.97%
Max cumulative indexing time across primary shard		320.29	333.012	12.7219	min	+3.97%
Cumulative indexing throttle time of primary shards		0	0	0	min	0.00%
Min cumulative indexing throttle time across primary shard		0	0	0	min	0.00%
Median cumulative indexing throttle time across primary shard		0	0	0	min	0.00%
Max cumulative indexing throttle time across primary shard		0	0	0	min	0.00%
Cumulative merge time of primary shards		150.314	88.7911	-61.5229	min	-40.93%
Cumulative merge count of primary shards		33	35	2		+6.06%
Min cumulative merge time across primary shard		150.314	88.7911	-61.5229	min	-40.93%
Median cumulative merge time across primary shard		150.314	88.7911	-61.5229	min	-40.93%
Max cumulative merge time across primary shard		150.314	88.7911	-61.5229	min	-40.93%
Cumulative merge throttle time of primary shards		42.1572	13.2881	-28.8691	min	-68.48%
Min cumulative merge throttle time across primary shard		42.1572	13.2881	-28.8691	min	-68.48%
Median cumulative merge throttle time across primary shard		42.1572	13.2881	-28.8691	min	-68.48%
Max cumulative merge throttle time across primary shard		42.1572	13.2881	-28.8691	min	-68.48%
Cumulative refresh time of primary shards		2.61827	2.72112	0.10285	min	+3.93%
Cumulative refresh count of primary shards		82	84	2		+2.44%
Min cumulative refresh time across primary shard		2.61827	2.72112	0.10285	min	+3.93%
Median cumulative refresh time across primary shard		2.61827	2.72112	0.10285	min	+3.93%
Max cumulative refresh time across primary shard		2.61827	2.72112	0.10285	min	+3.93%
Cumulative flush time of primary shards		11.8072	12.3607	0.55348	min	+4.69%
Cumulative flush count of primary shards		65	70	5		+7.69%
Min cumulative flush time across primary shard		11.8072	12.3607	0.55348	min	+4.69%
Median cumulative flush time across primary shard		11.8072	12.3607	0.55348	min	+4.69%
Max cumulative flush time across primary shard		11.8072	12.3607	0.55348	min	+4.69%
Total Young Gen GC time		52.207	53.018	0.811	s	+1.55%
Total Young Gen GC count		1061	1086	25		+2.36%
Total Old Gen GC time		0	0	0	s	0.00%
Total Old Gen GC count		0	0	0		0.00%
Dataset size		24.3116	23.9251	-0.3865	GB	-1.59%
Store size		24.3116	23.9251	-0.3865	GB	-1.59%
Translog size		5.12227e-08	5.12227e-08	0	GB	0.00%
Heap used for segments		0	0	0	MB	0.00%
Heap used for doc values		0	0	0	MB	0.00%
Heap used for terms		0	0	0	MB	0.00%
Heap used for norms		0	0	0	MB	0.00%
Heap used for points		0	0	0	MB	0.00%
Heap used for stored fields		0	0	0	MB	0.00%
Segment count		30	42	12		+40.00%
Total Ingest Pipeline count		0	0	0		0.00%
Total Ingest Pipeline time		0	0	0	ms	0.00%
Total Ingest Pipeline failed		0	0	0		0.00%
Min Throughput	index	37019.6	35354.7	-1664.98	docs/s	-4.50%
Mean Throughput	index	38264.2	37020.7	-1243.54	docs/s	-3.25%
Median Throughput	index	37945.6	36120	-1825.64	docs/s	-4.81%
Max Throughput	index	42149.2	40447.1	-1702.01	docs/s	-4.04%
50th percentile latency	index	860.665	877.681	17.0164	ms	+1.98%
90th percentile latency	index	1132.66	1208.98	76.3207	ms	+6.74%
99th percentile latency	index	6565.73	6849.73	284	ms	+4.33%
99.9th percentile latency	index	10157.1	10295.9	138.766	ms	+1.37%
99.99th percentile latency	index	12905.1	14065.4	1160.25	ms	+8.99%
100th percentile latency	index	13873.7	16058.7	2184.94	ms	+15.75%
50th percentile service time	index	860.665	877.681	17.0164	ms	+1.98%
90th percentile service time	index	1132.66	1208.98	76.3207	ms	+6.74%
99th percentile service time	index	6565.73	6849.73	284	ms	+4.33%
99.9th percentile service time	index	10157.1	10295.9	138.766	ms	+1.37%
99.99th percentile service time	index	12905.1	14065.4	1160.25	ms	+8.99%
100th percentile service time	index	13873.7	16058.7	2184.94	ms	+15.75%

dnhatn · 2025-05-19T15:45:18Z

@martijnvg @kkrik-es

I benchmarked this change: Indexing time increased from 320 minutes to 333 minutes, while merge times decreased from 150 minutes to 88 minutes (throttling reduced from 42 to 13 minutes). Store decreased from 24.3116GB to 23.9251GB. I believe the retention query with doc_values is causing the slowness. Although SortedNumericDocValuesRangeQuery already uses DocValuesSkipper, it may not be fully leveraged. I think we should consider creating an optimized query to avoid indexing regressions before pushing this change.

The TSDB codec uses less storage for the doc_values of the _seq_no field:

"_seq_no": {
    "total": "287.6mb",
    "total_in_bytes": 301624995,
    "inverted_index": {
        "total": "0b",
        "total_in_bytes": 0
    },
    "stored_fields": "0b",
    "stored_fields_in_bytes": 0,
    "doc_values": "287.6mb",
    "doc_values_in_bytes": 301624995,
    "points": "0b",
    "points_in_bytes": 0,
    "norms": "0b",
    "norms_in_bytes": 0,
    "term_vectors": "0b",
    "term_vectors_in_bytes": 0,
    "knn_vectors": "0b",
    "knn_vectors_in_bytes": 0
}

"_seq_no": {
    "total": "933.7mb",
    "total_in_bytes": 979066601,
    "inverted_index": {
        "total": "0b",
        "total_in_bytes": 0
    },
    "stored_fields": "0b",
    "stored_fields_in_bytes": 0,
    "doc_values": "383.8mb",
    "doc_values_in_bytes": 402479840,
    "points": "549.8mb",
    "points_in_bytes": 576586761,
    "norms": "0b",
    "norms_in_bytes": 0,
    "term_vectors": "0b",
    "term_vectors_in_bytes": 0,
    "knn_vectors": "0b",
    "knn_vectors_in_bytes": 0
}

kkrik-es · 2025-05-19T15:54:14Z

Not too bad! You may also check the following index in elastic/logs, _seq_no is like 10% today (and _id accounts for another 10%):

$ cat  ~/test_data/patterned/rally18.log |grep .ds-logs-k8-application.log-default- | grep seq
|    .ds-logs-k8-application.log-default-2025.03.07-000001 _seq_no doc values  |   861.434         |     MB |
|   .ds-logs-k8-application.log-default-2025.03.07-000001 _seq_no points       |       1.26528     |     GB |
|   .ds-logs-k8-application.log-default-2025.03.07-000001 _seq_no total        |       2.10652     |     GB |

$ cat  ~/test_data/patterned/rally18.log |grep .ds-logs-k8-application.log-default- | grep ' _id '
|   .ds-logs-k8-application.log-default-2025.03.07-000001 _id inverted index  |           1.45893     |     GB |
|   .ds-logs-k8-application.log-default-2025.03.07-000001 _id stored fields   |      953.436         |     MB |
|   .ds-logs-k8-application.log-default-2025.03.07-000001 _id total           |        2.39002     |     GB |

martijnvg · 2025-05-20T07:40:08Z

Thanks @dnhatn for working on this!

The TSDB codec uses less storage for the doc_values of the _seq_no field:

This is great, I initially assumed tsdb doc values codec for _seq_no field wouldn't buy us much. Turns out I'm wrong :)

Although SortedNumericDocValuesRangeQuery already uses DocValuesSkipper, it may not be fully leveraged. I think we should consider creating an optimized query to avoid indexing regressions before pushing this change.

What kind of optimizations are thinking about? We don't sort by _seq_no field and I suspect the ordering is kind of random. The SortedNumericDocValuesRangeQuery already has logic if the requested range doesn't match with min / max of a segment or if the requested range fully matches with the min and max of a segment. In other cases, it delegates to DocValuesRangeIterator, which makes best case effort use of the doc value skipper.

dnhatn · 2025-05-21T03:45:00Z

@kkrik-es @martijnvg Thanks for looking!

I may have reached a conclusion too quickly. Since there are no deletes or recovery_source in the track, the retention query should be a no-op in both cases. So I re-ran the benchmark using the standard codec for seq_no, and the results look better. I think we can proceed with disabling points for seq_no in tsdb and logsdb, but using the standard codec. We can switch the codec in a follow-up after further testing.

"_seq_no": {
    "total": "341.5mb",
    "total_in_bytes": 358145438,
    "inverted_index": {
        "total": "0b",
        "total_in_bytes": 0
    },
    "stored_fields": "0b",
    "stored_fields_in_bytes": 0,
    "doc_values": "341.5mb",
    "doc_values_in_bytes": 358145438,
    "points": "0b",
    "points_in_bytes": 0,
    "norms": "0b",
    "norms_in_bytes": 0,
    "term_vectors": "0b",
    "term_vectors_in_bytes": 0,
    "knn_vectors": "0b",
    "knn_vectors_in_bytes": 0
}

Metric	Task	Baseline	Contender	Diff	Unit	Diff %
Cumulative indexing time of primary shards		320.29	295.458	-24.8319	min	-7.75%
Min cumulative indexing time across primary shard		320.29	295.458	-24.8319	min	-7.75%
Median cumulative indexing time across primary shard		320.29	295.458	-24.8319	min	-7.75%
Max cumulative indexing time across primary shard		320.29	295.458	-24.8319	min	-7.75%
Cumulative indexing throttle time of primary shards		0	0	0	min	0.00%
Min cumulative indexing throttle time across primary shard		0	0	0	min	0.00%
Median cumulative indexing throttle time across primary shard		0	0	0	min	0.00%
Max cumulative indexing throttle time across primary shard		0	0	0	min	0.00%
Cumulative merge time of primary shards		150.314	71.4849	-78.829	min	-52.44%
Cumulative merge count of primary shards		33	31	-2		-6.06%
Min cumulative merge time across primary shard		150.314	71.4849	-78.829	min	-52.44%
Median cumulative merge time across primary shard		150.314	71.4849	-78.829	min	-52.44%
Max cumulative merge time across primary shard		150.314	71.4849	-78.829	min	-52.44%
Cumulative merge throttle time of primary shards		42.1572	16.649	-25.5082	min	-60.51%
Min cumulative merge throttle time across primary shard		42.1572	16.649	-25.5082	min	-60.51%
Median cumulative merge throttle time across primary shard		42.1572	16.649	-25.5082	min	-60.51%
Max cumulative merge throttle time across primary shard		42.1572	16.649	-25.5082	min	-60.51%
Cumulative refresh time of primary shards		2.61827	2.29685	-0.32142	min	-12.28%
Cumulative refresh count of primary shards		82	79	-3		-3.66%
Min cumulative refresh time across primary shard		2.61827	2.29685	-0.32142	min	-12.28%
Median cumulative refresh time across primary shard		2.61827	2.29685	-0.32142	min	-12.28%
Max cumulative refresh time across primary shard		2.61827	2.29685	-0.32142	min	-12.28%
Cumulative flush time of primary shards		11.8072	11.7695	-0.0377	min	-0.32%
Cumulative flush count of primary shards		65	63	-2		-3.08%
Min cumulative flush time across primary shard		11.8072	11.7695	-0.0377	min	-0.32%
Median cumulative flush time across primary shard		11.8072	11.7695	-0.0377	min	-0.32%
Max cumulative flush time across primary shard		11.8072	11.7695	-0.0377	min	-0.32%
Total Young Gen GC time		52.207	52.667	0.46	s	+0.88%
Total Young Gen GC count		1061	1097	36		+3.39%
Total Old Gen GC time		0	0	0	s	0.00%
Total Old Gen GC count		0	0	0		0.00%
Dataset size		24.3116	23.9603	-0.35129	GB	-1.44%
Store size		24.3116	23.9603	-0.35129	GB	-1.44%
Translog size		5.12227e-08	5.12227e-08	0	GB	0.00%
Heap used for segments		0	0	0	MB	0.00%
Heap used for doc values		0	0	0	MB	0.00%
Heap used for terms		0	0	0	MB	0.00%
Heap used for norms		0	0	0	MB	0.00%
Heap used for points		0	0	0	MB	0.00%
Heap used for stored fields		0	0	0	MB	0.00%
Segment count		30	37	7		+23.33%
Total Ingest Pipeline count		0	0	0		0.00%
Total Ingest Pipeline time		0	0	0	ms	0.00%
Total Ingest Pipeline failed		0	0	0		0.00%
Min Throughput	index	37019.6	38891	1871.34	docs/s	+5.06%
Mean Throughput	index	38264.2	40125	1860.84	docs/s	+4.86%
Median Throughput	index	37945.6	39700.9	1755.35	docs/s	+4.63%
Max Throughput	index	42149.2	43492.8	1343.67	docs/s	+3.19%
50th percentile latency	index	860.665	789.215	-71.4501	ms	-8.30%
90th percentile latency	index	1132.66	1073.33	-59.3224	ms	-5.24%
99th percentile latency	index	6565.73	6284.46	-281.265	ms	-4.28%
99.9th percentile latency	index	10157.1	10921.5	764.368	ms	+7.53%
99.99th percentile latency	index	12905.1	14296.8	1391.62	ms	+10.78%
100th percentile latency	index	13873.7	20993.3	7119.55	ms	+51.32%
50th percentile service time	index	860.665	789.215	-71.4501	ms	-8.30%
90th percentile service time	index	1132.66	1073.33	-59.3224	ms	-5.24%
99th percentile service time	index	6565.73	6284.46	-281.265	ms	-4.28%
99.9th percentile service time	index	10157.1	10921.5	764.368	ms	+7.53%
99.99th percentile service time	index	12905.1	14296.8	1391.62	ms	+10.78%
100th percentile service time	index	13873.7	20993.3	7119.55	ms	+51.32%

martijnvg

Thanks Nhat for iterating here!

So I re-ran the benchmark using the standard codec for seq_no, and the results look better.

So the only change was that _seqno field uses stock lucene doc values codec? The benchmark ran with the same track params as before? I don't think with logsdb we ever have recovery source anymore.

I think we can proceed with disabling points for seq_no in tsdb and logsdb, but using the standard codec.

Should we maybe put this behind a feature flag, so that we can see how it affects other benchmarks? I'm particularly interested in how this change affects the elastic/logs ccr benchmark.

martijnvg · 2025-05-21T05:53:44Z

server/src/main/java/org/elasticsearch/index/mapper/SeqNoFieldMapper.java

+    private static Query rangeQueryForSeqNo(boolean withPoints, long lowerValue, long upperValue) {
+        if (withPoints) {
+            // TODO: Use IndexOrDocValuesQuery
+            return LongPoint.newRangeQuery(SeqNoFieldMapper.NAME, lowerValue, upperValue);


Address the TODO here and wrap this query here in IndexOrDocValuesQuery?

This should be a small change, but it may impact performance. I prefer to make it in a separate PR to avoid introducing noise. We can do it either before or after this change.

dnhatn · 2025-05-21T16:37:48Z

Thanks @martijnvg

So the only change was that _seqno field uses stock lucene doc values codec? The benchmark ran with the same track params as before? I don't think with logsdb we ever have recovery source anymore.

That's correct. The TSDB codec likely uses more CPU to produce a more compact encoding than the standard codec.

Should we maybe put this behind a feature flag, so that we can see how it affects other benchmarks? I'm particularly interested in how this change affects the elastic/logs ccr benchmark.

I considered this as well, but CCR can break if the feature is enabled on one cluster and not on another. For example, if the leader cluster enables the feature and doesn't index points for seq_no, but the follower disables it and does index points, this leads to inconsistent field infos. Should we run the elastic/logs CCR benchmark before merging this change, instead of introducing a feature flag?

martijnvg · 2025-05-21T18:24:32Z

That's correct. The TSDB codec likely uses more CPU to produce a more compact encoding than the standard codec.

Right, in a follow up, we can maybe investigate to apply only some of the encoding techniques that tsdb doc values codec uses for the _tsid field.

For example, if the leader cluster enables the feature and doesn't index points for seq_no, but the follower disables it and does index points, this leads to inconsistent field infos.

Right, that can get messy and lead to hard to debug bugs. However, at least in production that shouldn't be an issue given that feature flags shouldn't be used in production. The same applies for snapshot builds.

Should we run the elastic/logs CCR benchmark before merging this change, instead of introducing a feature flag?

👍 that also sounds good with me.

martijnvg · 2025-05-27T16:49:59Z

Rally compare with current main as baseline and this change as contender:
baseline4-vs-seqno-no-dv-3.txt

(median indexing throughput is ~12% slower, however this statistic is a bit noisy)

CCR performance results between current main and this PR: https://esbench-metrics.kb.us-east-2.aws.elastic-cloud.com:9243/app/dashboards#/view/2eb4019c-95ac-4a05-9990-cd85b22bbaeb?_g=h@1dbd48d

CCR replication is little bit slower (~19%) with this PR, but it is much better as it was before with the decompression issue fixed (#128473): https://esbench-metrics.kb.us-east-2.aws.elastic-cloud.com:9243/app/r/s/iAxiI

But given that #128473 improved performance significantly, this change should be a good trade off.

elasticsearchmachine · 2025-05-27T17:40:28Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

elasticsearchmachine · 2025-05-27T17:40:28Z

Hi @dnhatn, I've created a changelog YAML for you.

martijnvg

LGTM

dnhatn · 2025-05-27T23:01:15Z

@martijnvg @kkrik-es Thank you so much for your review and for helping to unblock this PR. Should we backport this change to 8.19?

martijnvg · 2025-05-28T05:42:50Z

Thank you so much for your review and for helping to unblock this PR. Should we backport this change to 8.19?

I don't think that this is possible given that doc value skippers are not available in the 8.19 branch.

felixbarny · 2025-06-03T15:25:16Z

What's the storage reduction per document for this change?

martijnvg · 2025-06-04T13:08:43Z

The total final disk usage in the tsdb benchmark reduced by ~10% (4.6GB to 4.1GB)

…

On Tue, Jun 3, 2025 at 5:25 PM Felix Barnsteiner ***@***.***> wrote: *felixbarny* left a comment (elastic/elasticsearch#128139) <#128139 (comment)> What's the storage reduction per document for this change? — Reply to this email directly, view it on GitHub <#128139 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAENWROFS7DKCAZ53BI3XZ33BW47FAVCNFSM6AAAAAB5OBKYN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSMZVHE2TQMBVGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

felixbarny · 2025-06-06T07:14:01Z

I'm specifically curious about the reduction in bytes per document. Do you know how many documents we have ingested in the track? The relative reduction in the total size of the data set depends on how many fields are stored per document. The reduction in bytes per doc should stay consistent, no matter the data set.

martijnvg · 2025-06-06T07:25:31Z

I don't have that number, since it each document has a different number of metric fields. I think we need to run the metricsgen with the index setting that this pr enabled and than another run with that that index setting disabled to get a more accurate view what the savings are on a per metric basis.

The reduction in bytes per doc should stay consistent, no matter the data set.

This depends on how sequence numbers get encoded in the bkd tree. I don't know what compression schemes are used.

elasticsearchmachine added the v9.1.0 label May 19, 2025

dnhatn force-pushed the seq_no_dv branch from a747535 to c9e8193 Compare May 21, 2025 05:05

Skip indexing points for seq_no in tsdb and logsdb

d08b953

dnhatn force-pushed the seq_no_dv branch from c9e8193 to d08b953 Compare May 21, 2025 05:08

martijnvg reviewed May 21, 2025

View reviewed changes

martijnvg added 3 commits May 23, 2025 12:51

use skippers at index time

9f9f551

Merge remote-tracking branch 'es/main' into seq_no_dv

9d41d1e

_seq_no is searchable, this fixes a few test failures.

2fbc4ce

fix tests

26912f3

martijnvg added :StorageEngine/Mapping The storage related side of mappings >enhancement labels May 27, 2025

martijnvg marked this pull request as ready for review May 27, 2025 17:40

martijnvg requested a review from a team as a code owner May 27, 2025 17:40

elasticsearchmachine added the Team:StorageEngine label May 27, 2025

Update docs/changelog/128139.yaml

5b826a4

martijnvg approved these changes May 27, 2025

View reviewed changes

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label May 27, 2025

dnhatn merged commit 7f2e55f into elastic:main May 27, 2025
19 checks passed

dnhatn deleted the seq_no_dv branch May 27, 2025 22:59

ywangd mentioned this pull request May 28, 2025

[CI] FollowingEngineTests testProcessOnceOnPrimary failing #128541

Closed

Skip indexing points for seq_no in tsdb and logsdb #128139

Skip indexing points for seq_no in tsdb and logsdb #128139

Uh oh!

Conversation

dnhatn commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dnhatn commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dnhatn commented May 19, 2025

Uh oh!

kkrik-es commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martijnvg commented May 20, 2025

Uh oh!

dnhatn commented May 21, 2025

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

martijnvg May 21, 2025

Choose a reason for hiding this comment

Uh oh!

dnhatn May 21, 2025

Choose a reason for hiding this comment

Uh oh!

dnhatn commented May 21, 2025

Uh oh!

martijnvg commented May 21, 2025

Uh oh!

martijnvg commented May 27, 2025

Uh oh!

elasticsearchmachine commented May 27, 2025

Uh oh!

elasticsearchmachine commented May 27, 2025

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dnhatn commented May 27, 2025

Uh oh!

martijnvg commented May 28, 2025

Uh oh!

felixbarny commented Jun 3, 2025

Uh oh!

martijnvg commented Jun 4, 2025 via email

Uh oh!

felixbarny commented Jun 6, 2025

Uh oh!

martijnvg commented Jun 6, 2025

Uh oh!

Uh oh!

dnhatn commented May 19, 2025 •

edited

Loading

dnhatn commented May 19, 2025 •

edited

Loading

kkrik-es commented May 19, 2025 •

edited

Loading