Add support for Apache Atlas DSL search

* Add new proxy with the Apache Atlas as backend * Add document to describing Apache Atlas search methods
amundsen-io · May 9, 2019 · 68fcd84 · 68fcd84
1 parent fc1f6e5
commit 68fcd84
Show file tree

Hide file tree

Showing 7 changed files with 553 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -85,6 +85,10 @@ Proxy package contains proxy modules that talks dependencies of Search service.
 ##### [Elasticsearch proxy module](https://github.com/lyft/amundsensearchlibrary/blob/master/search_service/proxy/elasticsearch.py "Elasticsearch proxy module")
 [Elasticsearch](https://www.elastic.co/products/elasticsearch "Elasticsearch") proxy module serves various use case of searching metadata from Elasticsearch. It uses [Query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html "Query DSL") for the use case, execute the search query and transform into [model](https://github.com/lyft/amundsensearchlibrary/tree/master/search_service/models "model").
 
+##### [Atlas proxy module](https://github.com/lyft/amundsensearchlibrary/blob/master/search_service/proxy/atlas.py "Atlas proxy module") 
+[Apache Atlas](https://atlas.apache.org/ "Apache Atlas") proxy module uses Atlas to serve the Atlas requests. At the moment the search DSL REST api is used via the [Python Client](https://atlasclient.readthedocs.io/ "Atlas Client"). 
+
+
 ##### [Statsd utilities module](https://github.com/lyft/amundsensearchlibrary/blob/master/search_service/proxy/statsd_utilities.py "Statsd utilities module")
 [Statsd](https://github.com/etsy/statsd/wiki "Statsd") utilities module has methods / functions to support statsd to publish metrics. By default, statsd integration is disabled and you can turn in on from [Search service configuration](https://github.com/lyft/amundsensearchlibrary/blob/master/search_service/config.py#L7 "Search service configuration").
 For specific configuration related to statsd, you can configure it through [environment variable.](https://statsd.readthedocs.io/en/latest/configure.html#from-the-environment "environment variable.")

diff --git a/docs/atlas-search.md b/docs/atlas-search.md
@@ -0,0 +1,35 @@
+# Atlas search investigation
+There are several approaches to integrate searching within [Apache Atlas](https://atlas.apache.org/ "Apache Atlas"), we describe multiple options below:
+
+- Use REST API's
+
+Directly using the Atlas API's is quick to implement and easy to setup for administrators. Atlas uses a search engine 
+underwater (embedded Solr) to perform search queries, thus in theory this method should scale up. Disadvantages are that 
+we are limited to the REST api that Atlas offers, we could potentially add functionality via pull requests and extend 
+the search capabilities. The [advanced search](https://atlas.apache.org/Search-Advanced.html "Apache Atlas Advanced Search") 
+provides a DSL which contains basic forms of aggregation and arithmetic.
+
+- Use Data Builder to fill Elasticsearch from Atlas
+
+Adopting Atlas within the Data Builder to fill Elasticsearch is a relatively straightforward way of staying 
+compatible with the Neo4j database. It could either be pulling data from Atlas or being pushed by Kafka. This method
+requires a setup of Elasticsearch and Airflow, which increases the amount of infrastructure and maintenance. 
+Another disadvantage is that with a big inflow of metadata this method might not scale as well as the other methods. 
+
+- Use underlying Solr or Elasticsearch from Apache Atlas
+
+Within Atlas there is the possibility to open up either Solr or the experimental Elasticsearch. It depends on janusgraph
+(the behind the scenes graph database) which populates the search engine. Therefore the search engine would not be compatible with 
+the data builder setup. Adoption of such a search engine would require either new queries, some kind of transformer 
+within the search engine, or changes within Atlas itself.  
+
+## Discussion
+Both the REST API approach and the data builder approach can be implemented and be configurable. Both approaches have 
+their own benefits, the data builder together provides a more fine-tuned search whereas the Atlas REST API comes out 
+of the box with Atlas. The last approach of using the underlying search engine from Atlas provides direct access
+to all the meta data with a decent search API. However, integration would be less straight forward as the indexes would
+differ from the data builders search engine loader.
+
+
+The focus is initially to implement the REST API approach and afterwards implement an Atlas data extractor and perhaps
+ importer. 
diff --git a/requirements.txt b/requirements.txt
@@ -20,3 +20,4 @@ typing==3.6.4
 Werkzeug==0.14.1
 wheel==0.31.1
 mypy==0.660
+atlasclient==0.1.6
diff --git a/search_service/config.py b/search_service/config.py
@@ -14,6 +14,7 @@
 PROXY_CLIENT = 'PROXY_CLIENT'
 PROXY_CLIENTS = {
     'ELASTICSEARCH': 'search_service.proxy.elasticsearch.ElasticsearchProxy',
+    'ATLAS': 'search_service.proxy.atlas.AtlasProxy'
 }
 
 
@@ -23,6 +24,20 @@ class Config:
     LOG_DATE_FORMAT = '%Y-%m-%dT%H:%M:%S%z'
     LOG_LEVEL = 'INFO'
 
+    # Used to differentiate tables with other entities in Atlas. For more details:
+    # https://github.com/lyft/amundsenmetadatalibrary/blob/master/docs/proxy/atlas_proxy.md
+    ATLAS_TABLE_ENTITY = 'Table'
+
+    # The relationalAttribute name of Atlas Entity that identifies the database entity.
+    ATLAS_DB_ATTRIBUTE = 'db'
+
+    # Display name of Atlas Entities that we use for amundsen project.
+    # Atlas uses qualifiedName as indexed attribute. but also supports 'name' attribute.
+    ATLAS_NAME_ATTRIBUTE = 'name'
+
+    # Config used by ElastichSearch
+    ELASTICSEARCH_INDEX = 'table_search_index'
+
 
 class LocalConfig(Config):
     DEBUG = False

diff --git a/search_service/proxy/atlas.py b/search_service/proxy/atlas.py
@@ -0,0 +1,213 @@
+import logging
+
+from atlasclient.client import Atlas
+from atlasclient.exceptions import BadRequest
+from atlasclient.models import Entity, EntityCollection
+# default search page size
+from flask import current_app as app
+from typing import List, Dict
+
+from search_service.models.search_result import SearchResult
+from search_service.models.table import Table
+from search_service.proxy import BaseProxy
+from search_service.proxy.statsd_utilities import timer_with_counter
+
+DEFAULT_PAGE_SIZE = 10
+LOGGING = logging.getLogger(__name__)
+
+
+class AtlasProxy(BaseProxy):
+    TABLE_ENTITY = app.config['ATLAS_TABLE_ENTITY']
+    DB_ATTRIBUTE = app.config['ATLAS_DB_ATTRIBUTE']
+    NAME_ATTRIBUTE = app.config['ATLAS_NAME_ATTRIBUTE']
+    ATTRS_KEY = 'attributes'
+    REL_ATTRS_KEY = 'relationshipAttributes'
+
+    """
+    AtlasSearch connection handler
+    """
+    atlas: Atlas
+
+    def __init__(self, *,
+                 host: str = None,
+                 index: str = None,
+                 user: str = '',
+                 password: str = '',
+                 page_size: int = DEFAULT_PAGE_SIZE) -> None:
+        self.atlas = Atlas(host, username=user, password=password)
+        self.index = index
+        self.page_size = page_size
+
+    @staticmethod
+    def _entities(collections: EntityCollection) -> List[Entity]:
+        """
+        Helper method for flattening all collections from {collections}
+        :return: list of all entities
+        """
+        entities: List[Entity] = []
+        for collection in collections:
+            entities.extend(collection.entities)
+        return entities
+
+    def _parse_results(self, response: EntityCollection) -> List[Table]:
+        """
+        based on an atlas {response} with table entities, we map the required information
+        :return: list of tables
+        """
+        table_results = []
+        ids = list()
+        for hit in response:
+            ids.append(hit.guid)
+        # receive all entities
+        entities = self._entities(self.atlas.entity_bulk(guid=ids))
+        db_ids = []
+        for entity in entities:
+            relations = entity.relationshipAttributes
+            database = relations.get(self.DB_ATTRIBUTE)
+            if database:
+                db_ids.append(database['guid'])
+
+        # request databases
+        dbs_list = self._entities(self.atlas.entity_bulk(guid=db_ids)) if len(db_ids) > 0 else []
+        dbs_dict: Dict[str, Entity] = {db.guid: db for db in dbs_list}
+        for entity in entities:
+            relations = entity.relationshipAttributes
+            attrs = entity.attributes
+            database = relations.get(self.DB_ATTRIBUTE)
+            if database and database['guid'] in dbs_dict:
+                db_entity = dbs_dict[database['guid']]
+                db_attrs = db_entity.attributes
+
+                db_name = db_attrs.get(self.NAME_ATTRIBUTE)
+                db_qualified_name = db_attrs.get('qualifiedName', '').split('@')
+                db_cluster = db_qualified_name[1] if len(db_qualified_name) > 1 else ''
+            else:
+                db_cluster = ''
+                db_name = ''
+
+            tags = []
+            # Using or in case, if the key 'classifications' is there with attrs None
+            for classification in attrs.get("classifications") or list():
+                tags.append(
+                    classification.get('typeName')
+                )
+
+            # TODO: Implement columns
+            columns: List[str] = []
+            # for column in attrs.get('columns') or list():
+            #     col_entity = entity.referredEntities[column['guid']]
+            #     col_attrs = col_entity['attributes']
+            #     columns.append(col_attrs.get(self.NAME_KEY))
+            table_name = attrs.get(self.NAME_ATTRIBUTE)
+            table = Table(name=table_name,
+                          key=f"{entity.typeName or 'Table'}://{db_cluster}.{db_name}/{table_name}",
+                          description=attrs.get('description'),
+                          cluster=db_cluster,
+                          database=entity.typeName or 'Table',
+                          schema_name=db_name,
+                          column_names=columns,
+                          tags=tags,
+                          last_updated_epoch=attrs.get('updateTime'))
+
+            table_results.append(table)
+
+        return table_results
+
+    @timer_with_counter
+    def fetch_search_results_with_field(self, *,
+                                        query_term: str,
+                                        field_name: str,
+                                        field_value: str,
+                                        page_index: int = 0) -> SearchResult:
+        """
+        Query Atlas and return results as list of Table objects.
+        Per field name we have a count query and a query for the tables.
+        https://atlas.apache.org/Search-Advanced.html
+
+        :param query_term: search query term
+        :param field_name: field name to do the searching(e.g schema_name, tag_names)
+        :param field_value: value for the field for filtering
+        :param page_index: index of search page user is currently on
+        :return: SearchResult Object
+        :return:
+        """
+
+        sql = f"Table from Table where false"
+        count_sql = f"{sql} select count()"
+        if field_name == 'tag':
+            sql = f"from Table where Table is '{field_value}'"
+            count_sql = f"{sql} select count()"
+        elif field_name == 'schema':
+            sql = f"from Table where db.name like '{field_value}'"
+            count_sql = f"{sql} select count()"
+        elif field_name == 'table':
+            sql = f"from Table where name like '{field_value}'"
+            count_sql = f"{sql} select count()"
+        elif field_name == 'column':
+            sql = f"hive_column where name like '{field_value}' select table"
+            # TODO nanne: count tables instead of columns
+            count_sql = f"hive_column where name like '{field_value}' select count()"
+
+        LOGGING.debug(f"Used following sql query: {sql}")
+        tables: List[Table] = []
+        count_value = 0
+        try:
+            # count results
+            count_params = {'query': count_sql}
+            count_results = list(self.atlas.search_dsl(**count_params))[0]
+            count_value = count_results._data['attributes']['values'][0][0]
+
+            params = {'query': f"{sql} limit {self.page_size} offset {page_index * self.page_size}"}
+            search_results = self.atlas.search_dsl(**params)
+            if count_value > 0 and page_index * self.page_size <= count_value:
+                # unpack all collections (usually just one collection though)
+                for collection in search_results:
+                    if hasattr(collection, 'entities'):
+                        tables.extend(self._parse_results(response=collection.entities))
+        except BadRequest:
+            LOGGING.error("Atlas Search DSL error with the following query:", sql)
+
+        return SearchResult(total_results=count_value, results=tables)
+
+    @timer_with_counter
+    def fetch_search_results(self, *,
+                             query_term: str,
+                             page_index: int = 0) -> SearchResult:
+        """
+        Query Atlas and return results as list of Table objects
+        We use the Atlas DSL for querying the tables.
+        https://atlas.apache.org/Search-Advanced.html
+
+        :param query_term: search query term
+        :param page_index: index of search page user is currently on
+        :return: SearchResult Object
+        """
+
+        if not query_term:
+            # return empty result for blank query term
+            return SearchResult(total_results=0, results=[])
+
+        # define query
+        sql = f"Table from Table " \
+            f"where name like '*{query_term}*' or " \
+            f"description like '*{query_term}*' "
+
+        # count amount of tables
+        count_params = {'query': f"{sql} select count()"}
+        count_results = list(self.atlas.search_dsl(**count_params))[0]
+        count_value = count_results._data['attributes']['values'][0][0]
+
+        # select tables
+        params = {
+            'query': f"{sql} "
+            f"limit {self.page_size} "
+            f"offset {page_index * self.page_size}"}
+        search_results = self.atlas.search_dsl(**params)
+
+        # retrieve results
+        tables = []
+        if 0 < count_value >= page_index * self.page_size:
+            for s in search_results:
+                tables.extend(self._parse_results(response=s.entities))
+
+        return SearchResult(total_results=count_value, results=tables)
diff --git a/search_service/proxy/base.py b/search_service/proxy/base.py
@@ -1,7 +1,5 @@
 from abc import ABCMeta, abstractmethod
 
-from typing import Union, List, Dict, Any
-
 from search_service.models.search_result import SearchResult