-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use single ElasticSearch index to store dependencies #2143
Comments
Related to jaegertracing/spark-dependencies#68 |
@frittentheke would you like to submit a PR? |
@pavolloffay Sure. What should that PR contain then? The change of the es storage module (https://github.com/jaegertracing/jaeger/blob/master/plugin/storage/es/dependencystore/storage.go) to read and write to that single index I suppose. The query already uses the timestamp field (https://github.com/jaegertracing/jaeger/blob/master/plugin/storage/es/dependencystore/storage.go#L111) so that would not even need changing. Maybe a topic for a separate issue, but if I may ask: What are your plans forward regarding producing those dependencies? Also the write the dependency storage is not done via the API but directly to elasticsearch - thus the issue with "fixing" both ends of the equation. While all of Jaeger is Golang, running Java code and then also using the Spark framework seems a bit overly complex - at least if ElasticSearch is concerned. See my comments regarding using the ES terms API (jaegertracing/spark-dependencies#68 (comment)) to keep all of the heavy lifting within the ElasticSearch cluster and only minuscule mounts of data having to be transferred. But even keeping the current approach - using plain Golang and an Elasticsearch client to iterate over the data would at least keep Jaeger components similar. |
The UI does not have to be changed. We just need to change the writer (the writer is not used though) and reader. The dependency storage impl should use the same index names as span storage impl.: The index cleaner and rollover scripts will have to changed also to support rollover.
Any improvements to ES query from the spark dependencies job are welcome. Please create a separate issue.
There are no plans to rewrite the current jobs to Golang, The data aggregations job are memory heavy and in prod systems with a lot of data they might require running a spark/flink cluster. The plans were to provide more aggregations jobs, hence frameworks like spark are useful. |
I was actually not suggesting / implying to use rollover for storing dependencies actually but just a single index. There are so few documents to hold dependencies (currently it's one per day) it makes no sense to roll over. But thinking about it: Using rollover in conjunction with ILM (ElasticSearch Index Lifecycle Policies) might make sense just for the much easier housekeeping. Then no external job would be required to delete old indices / data, but simply have ElasticSearch roll and expire indices to your liking, full transparent to the application. We run this setup for the spans / services with great success.
|
@pavolloffay I just pushed a PR: #2144 |
Slightly off-topic question is ES ILM free to use? It's marked as x-pack feature which is payed extension: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html |
I am not sure how feasible it would be given the index can last for year(s) and there is no way to remove old documents from it. |
Yes @pavolloffay , but in the free tier (no cost).... see https://www.elastic.co/subscriptions |
Scripts |
any news? |
…esolves #2143) (#2144) * Add support for ES index aliases / rollover to the dependency store * Give DependencyStore a params struct like the SpanStore to carry its configuration parameters * Adapt and extend the tests accordingly Signed-off-by: Christian Rohmann <[email protected]> * Extend es-rollover and es-index-cleaner to support rolling dependencies indices Signed-off-by: Christian Rohmann <[email protected]> Co-authored-by: Christian Rohmann <[email protected]> Co-authored-by: Albert <[email protected]>
Requirement - what kind of business use case are you trying to solve?
Using ElasticSearch as storage, and using it most efficiently.
Problem - what in Jaeger blocks you from solving the requirement?
Currently the dependencies (System Architecture in die UI) are created "per day" and stored in an dedicated ElasticSearch index per day (see: https://github.com/jaegertracing/spark-dependencies/blob/master/jaeger-spark-dependencies-elasticsearch/src/main/java/io/jaegertracing/spark/dependencies/elastic/ElasticsearchDependenciesJob.java#L203).
The number of indices (actually the number of shards, but they are closely related) one uses to store data in ElasticSearch shall be kept low as they are not "free" (see. https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster).
So especially when looking at the Jaeger span and service indices - which Jaeger learned to use the rollover API for, in order to keep the number of shards low - using a new index for each day of dependencies to be stored and then only put a single document into that index seems a little excessive.
Proposal - what do you suggest to solve the problem or improve the existing situation?
A coordinated switch in Jaeger as well as in the referred external (Spark) job creating the dependencies to simply store them within a single index with a field to mark which day they belong to.
As for housekeeping: It's one doc per day ... so even if one does never delete any documents that index would not explode in size. But if required / intended this could be done in the Spark job as well.
As in "keep for x days" and then delete docs with an older than the mentioned timestamp.
Any open questions to address
The text was updated successfully, but these errors were encountered: