Use single ElasticSearch index to store dependencies #2143

frittentheke · 2020-03-26T08:29:29Z

Requirement - what kind of business use case are you trying to solve?

Using ElasticSearch as storage, and using it most efficiently.

Problem - what in Jaeger blocks you from solving the requirement?

Currently the dependencies (System Architecture in die UI) are created "per day" and stored in an dedicated ElasticSearch index per day (see: https://github.com/jaegertracing/spark-dependencies/blob/master/jaeger-spark-dependencies-elasticsearch/src/main/java/io/jaegertracing/spark/dependencies/elastic/ElasticsearchDependenciesJob.java#L203).

The number of indices (actually the number of shards, but they are closely related) one uses to store data in ElasticSearch shall be kept low as they are not "free" (see. https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster).

So especially when looking at the Jaeger span and service indices - which Jaeger learned to use the rollover API for, in order to keep the number of shards low - using a new index for each day of dependencies to be stored and then only put a single document into that index seems a little excessive.

Proposal - what do you suggest to solve the problem or improve the existing situation?

A coordinated switch in Jaeger as well as in the referred external (Spark) job creating the dependencies to simply store them within a single index with a field to mark which day they belong to.

As for housekeeping: It's one doc per day ... so even if one does never delete any documents that index would not explode in size. But if required / intended this could be done in the Spark job as well.
As in "keep for x days" and then delete docs with an older than the mentioned timestamp.

Any open questions to address

pavolloffay · 2020-03-26T12:45:52Z

Related to jaegertracing/spark-dependencies#68

pavolloffay · 2020-03-26T15:07:44Z

@frittentheke would you like to submit a PR?

frittentheke · 2020-03-27T12:44:26Z

@pavolloffay Sure. What should that PR contain then?

The change of the es storage module (https://github.com/jaegertracing/jaeger/blob/master/plugin/storage/es/dependencystore/storage.go) to read and write to that single index I suppose. The query already uses the timestamp field (https://github.com/jaegertracing/jaeger/blob/master/plugin/storage/es/dependencystore/storage.go#L111) so that would not even need changing.
That would then be fully transparent to the UI, right?
Certainly I could also throw together a little PR for the Spark job again (jaegertracing/spark-dependencies#86) to keep compatibility.

Maybe a topic for a separate issue, but if I may ask: What are your plans forward regarding producing those dependencies?
Currently the Spark job uses JavaEsSpark.esJsonRDD which has no optimizations (DataFrames and their pushdown - https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html#spark-pushdown would be). So apart from the plain query I added in the PR jaegertracing/spark-dependencies#86 all docs are fetched and instantiated into full Span objects, even though not all fields of the spans are required for the dependency extraction. This causes many gigabytes of data transferred and a massive memory footprint as well as turnover on the JVM running the job.

Also the write the dependency storage is not done via the API but directly to elasticsearch - thus the issue with "fixing" both ends of the equation.

While all of Jaeger is Golang, running Java code and then also using the Spark framework seems a bit overly complex - at least if ElasticSearch is concerned. See my comments regarding using the ES terms API (jaegertracing/spark-dependencies#68 (comment)) to keep all of the heavy lifting within the ElasticSearch cluster and only minuscule mounts of data having to be transferred.

But even keeping the current approach - using plain Golang and an Elasticsearch client to iterate over the data would at least keep Jaeger components similar.

pavolloffay · 2020-03-27T13:21:14Z

The UI does not have to be changed. We just need to change the writer (the writer is not used though) and reader. The dependency storage impl should use the same index names as span storage impl.:
IIRC it is jaeger-span-read and jaeger-span-write.

The index cleaner and rollover scripts will have to changed also to support rollover.

Maybe a topic for a separate issue, but if I may ask: What are your plans forward regarding producing those dependencies?

Any improvements to ES query from the spark dependencies job are welcome. Please create a separate issue.

But even keeping the current approach - using plain Golang and an Elasticsearch client to iterate over the data would at least keep Jaeger components similar.

There are no plans to rewrite the current jobs to Golang, The data aggregations job are memory heavy and in prod systems with a lot of data they might require running a spark/flink cluster. The plans were to provide more aggregations jobs, hence frameworks like spark are useful.

frittentheke · 2020-03-30T12:59:16Z

The UI does not have to be changed. We just need to change the writer (the writer is not used though) and reader. The dependency storage impl should use the same index names as span storage impl.:
IIRC it is jaeger-span-read and jaeger-span-write.

The index cleaner and rollover scripts will have to changed also to support rollover.

I was actually not suggesting / implying to use rollover for storing dependencies actually but just a single index. There are so few documents to hold dependencies (currently it's one per day) it makes no sense to roll over.

But thinking about it: Using rollover in conjunction with ILM (ElasticSearch Index Lifecycle Policies) might make sense just for the much easier housekeeping. Then no external job would be required to delete old indices / data, but simply have ElasticSearch roll and expire indices to your liking, full transparent to the application. We run this setup for the spans / services with great success.

Maybe a topic for a separate issue, but if I may ask: What are your plans forward regarding producing those dependencies?

Any improvements to ES query from the spark dependencies job are welcome. Please create a separate issue.

See jaegertracing/spark-dependencies#88

frittentheke · 2020-03-30T14:32:27Z

@pavolloffay I just pushed a PR: #2144
If you happen to like that one - I added the write alias to the Spark job in my PR jaegertracing/spark-dependencies#86 as well .. see: jaegertracing/spark-dependencies@ec4c28a

pavolloffay · 2020-03-30T16:10:53Z

Slightly off-topic question is ES ILM free to use? It's marked as x-pack feature which is payed extension: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html

pavolloffay · 2020-03-30T16:14:39Z

I was actually not suggesting / implying to use rollover for storing dependencies actually but just a single index. There are so few documents to hold dependencies (currently it's one per day) it makes no sense to roll over.

I am not sure how feasible it would be given the index can last for year(s) and there is no way to remove old documents from it.

frittentheke · 2020-03-30T20:41:03Z

Slightly off-topic question is ES ILM free to use? It's marked as x-pack feature which is payed extension: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html

Yes @pavolloffay , but in the free tier (no cost).... see https://www.elastic.co/subscriptions
But with its smart rules on when to do a rollover and when to shrink or delete indices it really is great to not having to run external jobs (like the curator). Even Jaeger currently "has to" provide the housekeeping for the ElasticSearch storage. Even though I blieve the curator (https://github.com/elastic/curator) with a bit of config could be a good replacement and free you from maintaining esCleaner.py and esRollover.py (https://github.com/jaegertracing/jaeger/tree/master/plugin/storage/es) altogether.

pavolloffay · 2020-04-02T14:32:47Z

Scripts esCleaner.py and esRollover.py are using curator under the hood. But instead of using the curator's configuration files we use the programmatic API. We could not use just the conf files because we needed to perform more actions which were not possible with the config file.

AhHa45 · 2021-09-07T11:57:24Z

any news?

frittentheke · 2021-11-21T22:22:55Z

@AhHa45 yes. I refactored my change to add ES alias / rollover support to Jaeger - check out: #2144

…esolves #2143) (#2144) * Add support for ES index aliases / rollover to the dependency store * Give DependencyStore a params struct like the SpanStore to carry its configuration parameters * Adapt and extend the tests accordingly Signed-off-by: Christian Rohmann <[email protected]> * Extend es-rollover and es-index-cleaner to support rolling dependencies indices Signed-off-by: Christian Rohmann <[email protected]> Co-authored-by: Christian Rohmann <[email protected]> Co-authored-by: Albert <[email protected]>

ghost added the needs-triage label Mar 26, 2020

pavolloffay added storage/elasticsearch and removed needs-triage labels Mar 26, 2020

frittentheke mentioned this issue Mar 30, 2020

Add support for ES index aliases / rollover to the dependency store (Resolves #2143) #2144

Merged

albertteoh closed this as completed in #2144 Feb 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use single ElasticSearch index to store dependencies #2143

Use single ElasticSearch index to store dependencies #2143

frittentheke commented Mar 26, 2020 •

edited

Loading

pavolloffay commented Mar 26, 2020 •

edited

Loading

pavolloffay commented Mar 26, 2020

frittentheke commented Mar 27, 2020 •

edited

Loading

pavolloffay commented Mar 27, 2020

frittentheke commented Mar 30, 2020

frittentheke commented Mar 30, 2020 •

edited

Loading

pavolloffay commented Mar 30, 2020

pavolloffay commented Mar 30, 2020

frittentheke commented Mar 30, 2020 •

edited

Loading

pavolloffay commented Apr 2, 2020

AhHa45 commented Sep 7, 2021

frittentheke commented Nov 21, 2021

Use single ElasticSearch index to store dependencies #2143

Use single ElasticSearch index to store dependencies #2143

Comments

frittentheke commented Mar 26, 2020 • edited Loading

Requirement - what kind of business use case are you trying to solve?

Problem - what in Jaeger blocks you from solving the requirement?

Proposal - what do you suggest to solve the problem or improve the existing situation?

Any open questions to address

pavolloffay commented Mar 26, 2020 • edited Loading

pavolloffay commented Mar 26, 2020

frittentheke commented Mar 27, 2020 • edited Loading

pavolloffay commented Mar 27, 2020

frittentheke commented Mar 30, 2020

frittentheke commented Mar 30, 2020 • edited Loading

pavolloffay commented Mar 30, 2020

pavolloffay commented Mar 30, 2020

frittentheke commented Mar 30, 2020 • edited Loading

pavolloffay commented Apr 2, 2020

AhHa45 commented Sep 7, 2021

frittentheke commented Nov 21, 2021

frittentheke commented Mar 26, 2020 •

edited

Loading

pavolloffay commented Mar 26, 2020 •

edited

Loading

frittentheke commented Mar 27, 2020 •

edited

Loading

frittentheke commented Mar 30, 2020 •

edited

Loading

frittentheke commented Mar 30, 2020 •

edited

Loading