Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First draft of 0004-greenstand-search-engine #7

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kparikh9
Copy link


* ElasticSearch - can integrate well with other products of the Elastic Stack like Kibana, Logstash. Easiest to experiment with, since there are free trials available for Elastic Cloud (managed ElasticSearch deployment)

## Considered Options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider sphinx? https://stackshare.io/stackups/lucene-vs-sphinx Great for docs search I know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe something from this list? https://www.educba.com/elasticsearch-alternatives/

Copy link
Author

@kparikh9 kparikh9 May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the links, @mckornfield! I'll check these out and see if any fit better than ES


* Steep learning curve?
* Requires more experimentation on what architecture is the best for Greenstand's use case (i.e. search over multiple indexes vs. one index)
* Heavy memory usage (requires 4.0 GB RAM just for ElasticSearch, probably more for Kibana and Logstash) - can be expensive since it requires larger compute servers and this would need to remain on at all times.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely costs a lot as far as resources. Also there's no good auth support in the free versions of ELK

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ELK, I believe we can set up service accounts to request and use tokens for authorization to pass requests to the Elastic cluster https://www.elastic.co/guide/en/elasticsearch/reference/current/token-authentication-services.html. This don't seem to be limited to Elastic Cloud (which is just a managed-deployment of the ELK stack)


## Considered Options

* ElasticSearch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you spike these with a sample dataset?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I took about 20 rows from the public.planters and public.trees tables and all the rows from the public.organizations table in the treetracker database. I tested autocomplete/search hinting queries on three separate indexes (1 for each table) and on one single index that contained all three types of data rows (planters, trees, organizations).


## Decision Drivers

* ElasticSearch - can integrate well with other products of the Elastic Stack like Kibana, Logstash. Easiest to experiment with, since there are free trials available for Elastic Cloud (managed ElasticSearch deployment)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we just got rid of our ELK stack, which we were using for consolidated logging of microservices. It was a very difficult to manage for the current cloud team and having it deployed into our cluster. I presume we would not need the whole ELK stack to achieve what you are looking to do here? Kibana really stressed our cloud resources. However, maybe there is a more stripped down deployment option that would meet your use case.


* ElasticSearch
* Apache Solr
* Apache Lucene
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say more about why the Apache projects were not chosen? I don't have experience with either, but I do know that CKAN (our chose data portal) uses Solr.

@ZavenArra
Copy link
Contributor

@kparikh9 We generally seek to pursue build before buy and self management of our application platform, however it seems like you have a quick solution here that adds some nice value. I am falling at cautious support for this plan, but I'd like to ask that we incorporate into this ADR a little longer range thinking for bringing the search engine into our cloud, without using Kibana in the future. I think if the philosophy at the start of this paragraph and the longer term plan to in-house the solution are both articulated in the ADR, I would be happy to support and accept this decision.

@dadiorchen
Copy link
Contributor

@kparikh9 sorry for the delay, do you want to also try a bit Solr, I deployed a small node with Solr, it seems pretty interesting: https://dev-k8s.treetracker.org/search/solr/#/mycoll/query?q=publisher_s:*am*&q.op=OR&indent=true

@dadiorchen
Copy link
Contributor

I think Solr is more suitable for our case, IMO, because

  1. Our goal

Our main goal here is to do full-text search, search planter info, species, org, and others, (and beable to search crossing fields) also, autocompletion, both Solr and ES can do the job, but Solr is a more dedicated search engine with advanced features (ES is more focused on log analysis I think), as the creator of the ES admits:

Solr is also a solution for exposing an indexing/search server over HTTP, but I would argue that ElasticSearch provides a much superior distributed model and ease of use (though currently lacking on some of the search features, but not for long, and in any case, the plan is to get all Compass features into ElasticSearch)

(source: https://stackoverflow.com/questions/2271600/elasticsearch-sphinx-lucene-solr-xapian-which-fits-for-which-usage)

Here is another opinion:

Solr has more advantages when it comes to the static data, because of its caches and the ability to use an uninverted reader for faceting and sorting – for example, e-commerce. On the other hand, Elasticsearch is better suited – and much more frequently used – for timeseries data use cases, like log analysis use cases.

I think these two has different focus and use case.

  1. Our scale

Because our goal is to index all Greenstand content, I think the scale of the data is not super huge, I don't think we need a super scalable, distributed solution which ES is good at, but the cost is the maintenance and complexity.

  1. Open source

Solr is more open source than ES.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants