Skip to content

Commit

Permalink
Merge pull request amundsen-io#5 in BI/amundsensearchlibrary from add…
Browse files Browse the repository at this point in the history
…_metrics_dashboards to test

* commit '6ffc57032ef83197b3d642ccad02b31a633c0d18':
  Add metrics/dashboards
  Fix #24, correct initialisation of elastic search (amundsen-io#27)
  [DPTOOLS-2252] Publish Docker image in CI (amundsen-io#26)
  Integrates Atlas DSL Search (amundsen-io#17)
  Update PULL_REQUEST_TEMPLATE.md (amundsen-io#23)
  Update README.md (amundsen-io#22)
  Add codecov based for search repo (amundsen-io#20)
  Update README.md (amundsen-io#19)
  Set the elasticsearch base (endpoint) from env variable  (amundsen-io#16)
  Adds the PR template for amundsen search service (amundsen-io#15)
  Doc fix: Docker pull the official image (amundsen-io#14)
  Changed the name of this file for consistency (amundsen-io#13)
  gitignore dist/ as in metadataservice PR #28 (amundsen-io#12)
  • Loading branch information
Jakub Hettler authored and Jakub Hettler committed Jun 20, 2019
2 parents 1ffa15e + 6ffc570 commit 9a48ff2
Show file tree
Hide file tree
Showing 21 changed files with 937 additions and 105 deletions.
22 changes: 22 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
### Summary of Changes

_Include a summary of changes then remove this line_

### Tests

_What tests did you add or modify and why? If no tests were added or modified, explain why. Remove this line_

### Documentation

_What documentation did you add or modify and why? Add any relevant links then remove this line_

### CheckList
Make sure you have checked **all** steps below to ensure a timely review.
- [ ] PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
- In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
- [ ] PR includes a summary of changes.
- [ ] PR adds unit tests, updates existing unit tests, __OR__ documents why no test additions or modifications are needed.
- [ ] In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain docstrings that explain what it does
- [ ] PR passes `make test`
- [ ] I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)"
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,11 @@
*.egg-info
.*.swp
.DS_Store
build/
dist/
venv/
venv3/
.cache/
build/
.idea/
.coverage
*coverage.xml
Expand Down
13 changes: 12 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,25 @@ python:
install:
- pip3 install -r requirements.txt
- pip3 install mypy
- pip3 install codecov
script:
- make test
after_success:
- codecov
deploy:
provider: pypi
- provider: pypi
user: amundsen-dev
password:
secure: ZGfmHZUxxwWVfzKam4R0tzeb80SvopPkkCq1qQw5VIeKhVNHoE0CVyNa85Yo5xsvZSAYHOfGbCP29c6Y6c4+VlXmJQ+1majOFiwzM/kFau2dIaa965N5OXMU/AMtYlw68SpXo/jNlxNSDQwJtd8xEUyaHAS5fc5VvmXD05r4/2SvoodN9TrmkbwzZ8ujJTXG6U3vC3rR5fEIL5WDsdWrvk/Q6b6lvIDqJZ/RXYotauQeG1EXmyB4VroSjYO4+l9Zry+S46gYHWYEidQH1W1UrBRePXwqIH6VAH97PbsCFw6mvUeSkQYJHcSkB+wLUeN+HMcQOk6F3ke9NRkolx4LgD6N+AlXSz0wLta3IUgOAvbUZdxKnBKhjCftWoZx1SWYLtbqIJXEwC51dEQBUd/7GQJuIvXFnNDVz1yKB4zuYqx8v1a5D5uLT69152K14nTB6zu23eRvgPpXUlewAdfyqntOWwFBgK8KMeY54L3IglP+MiebP6smvhm0k5Fk1m7muqzmz3/1/if9UGXCnabG2Yml4MwutRR4PXuN8g9pOgRTQmHd3SV7Y8O9RIL+b8tbagwbpI3CLojO+9b52T7gqrV0d64I4g0WB2M7FsNZBhRuC9Wu7o8a+R9Ft5aK/wF2tRDd+VdurUbBlhhVgdJGWWxkmgk87kqGJjk25Hjm5Oo=
on:
tags: true
distributions: sdist bdist_wheel
repo: lyft/amundsensearchlibrary
- provider: script
script: docker login -u amundsendev -p $DOCKER_LOGIN_PASSWORD && make build-push-image
on:
branch: master
tags: true
env:
global:
secure: beSR8ZRLfnRGfO3xTZo24ETvkPMRmrejS8toR1TfCmcAUZRYrgo7qQ2WaNhCESqyVTY0VcGguamhb8kA8BZ9/WgnCXXqGUfoRhEcViqgpTY0oTnB0Jc0EN4JqTVDAGXio0JdejGQa2LiDfjGZSFCsQkg1GlmN+EzmhliA0VKKpeYJKJS4uCLCEULRkUG7pVvYCEPaVkLrVuj0Jv9nLhfPrgLsIxSQVySRklB6raooEWhFAgH2TMWL6XD/5Hb4kbtx7Diookmlz/lzTxxq8wdOlRAC9tysofkGYIw+KCndi2zVztBvmvewMYMaY4H+MbF5sI1ld7a0nc3xy0QWjamBOgreqalDay0Cj1I2NHcuHW4G4ynHR6uzo8GZInT8owQiKblWlgSLh+DvmYj6EaoIBzEBZz8ULw++0ICJhQjntj+c+qNiSvqK5GTsuCK6eMEPCuf/izDQzJQccV11N5f6qkXKn54XrjvHXHuc+Cq4Np9oObKzQeQlUYugWudWI1GJw5eKhcgSJoDYNNWbjXJhXWpBjyiXvhlwtvVIz7umqhd2B7JihSHR5HOY9Ublk1XAGCuhMXwo5HywFAIiaYoNu0BWsXZL93MaVL98+JkVTpino7IHi3TX0CcpU5Qbt4sHwoerHYbcqA/7XiItH27pV7vO6VH84nZIQ3lNk1SdG8=
16 changes: 16 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
IMAGE := amundsendev/amundsen-search
VERSION:= $(shell grep -m 1 '__version__' setup.py | cut -d '=' -f 2 | tr -d "'" | tr -d '[:space:]')

.PHONY: test
clean:
find . -name \*.pyc -delete
Expand All @@ -18,3 +21,16 @@ mypy:

.PHONY: test
test: test_unit lint mypy

.PHONY: image
image:
docker build -f public.Dockerfile -t ${IMAGE}:${VERSION} .
docker tag ${IMAGE}:${VERSION} ${IMAGE}:latest

.PHONY: push-image
push-image:
docker push ${IMAGE}:${VERSION}
docker push ${IMAGE}:latest

.PHONY: build-push-image
build-push-image: image push-image
17 changes: 14 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
# Amundsen Search service
[![PyPI version](https://badge.fury.io/py/amundsen-search.svg)](https://badge.fury.io/py/amundsen-search)
[![Build Status](https://api.travis-ci.com/lyft/amundsensearchlibrary.svg?branch=master)](https://travis-ci.com/lyft/amundsensearchlibrary)
[![Coverage Status](https://img.shields.io/codecov/c/github/lyft/amundsensearchlibrary/master.svg)](https://codecov.io/github/lyft/amundsensearchlibrary?branch=master)
[![License](http://img.shields.io/:license-Apache%202-blue.svg)](LICENSE)
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)
[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](https://bit.ly/2FVq37z)

Amundsen Search service serves a Restful API and is responsible for searching metadata. The service leverages [Elasticsearch](https://www.elastic.co/products/elasticsearch "Elasticsearch") for most of it's search capabilites.

## Instructions to start the Search service from distribution
Expand Down Expand Up @@ -34,8 +41,8 @@ $ curl -v http://localhost:5000/healthcheck

## Instructions to start the service from the Docker
```bash
$ docker pull amundsen-search
$ docker run -p 5000:5000 amundsen-search
$ docker pull amundsendev/amundsen-search:latest
$ docker run -p 5000:5000 amundsendev/amundsen-search
```

In different terminal, verify the service is up by running
Expand Down Expand Up @@ -85,10 +92,14 @@ Proxy package contains proxy modules that talks dependencies of Search service.
##### [Elasticsearch proxy module](https://github.com/lyft/amundsensearchlibrary/blob/master/search_service/proxy/elasticsearch.py "Elasticsearch proxy module")
[Elasticsearch](https://www.elastic.co/products/elasticsearch "Elasticsearch") proxy module serves various use case of searching metadata from Elasticsearch. It uses [Query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html "Query DSL") for the use case, execute the search query and transform into [model](https://github.com/lyft/amundsensearchlibrary/tree/master/search_service/models "model").

##### [Atlas proxy module](https://github.com/lyft/amundsensearchlibrary/blob/master/search_service/proxy/atlas.py "Atlas proxy module")
[Apache Atlas](https://atlas.apache.org/ "Apache Atlas") proxy module uses Atlas to serve the Atlas requests. At the moment the search DSL REST api is used via the [Python Client](https://atlasclient.readthedocs.io/ "Atlas Client").


##### [Statsd utilities module](https://github.com/lyft/amundsensearchlibrary/blob/master/search_service/proxy/statsd_utilities.py "Statsd utilities module")
[Statsd](https://github.com/etsy/statsd/wiki "Statsd") utilities module has methods / functions to support statsd to publish metrics. By default, statsd integration is disabled and you can turn in on from [Search service configuration](https://github.com/lyft/amundsensearchlibrary/blob/master/search_service/config.py#L7 "Search service configuration").
For specific configuration related to statsd, you can configure it through [environment variable.](https://statsd.readthedocs.io/en/latest/configure.html#from-the-environment "environment variable.")

### [Models package](https://github.com/lyft/amundsensearchlibrary/tree/master/search_service/models "Models package")
Models package contains many modules where each module has many Python classes in it. These Python classes are being used as a schema and a data holder. All data exchange within Amundsen Search service use classes in Models to ensure validity of itself and improve readability and mainatability.
Models package contains many modules where each module has many Python classes in it. These Python classes are being used as a schema and a data holder. All data exchange within Amundsen Search service use classes in Models to ensure validity of itself and improve readability and maintainability.

35 changes: 35 additions & 0 deletions docs/atlas-search.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Atlas search investigation
There are several approaches to integrate searching within [Apache Atlas](https://atlas.apache.org/ "Apache Atlas"), we describe multiple options below:

- Use REST API's

Directly using the Atlas API's is quick to implement and easy to setup for administrators. Atlas uses a search engine
underwater (embedded Solr) to perform search queries, thus in theory this method should scale up. Disadvantages are that
we are limited to the REST api that Atlas offers, we could potentially add functionality via pull requests and extend
the search capabilities. The [advanced search](https://atlas.apache.org/Search-Advanced.html "Apache Atlas Advanced Search")
provides a DSL which contains basic forms of aggregation and arithmetic.

- Use Data Builder to fill Elasticsearch from Atlas

Adopting Atlas within the Data Builder to fill Elasticsearch is a relatively straightforward way of staying
compatible with the Neo4j database. It could either be pulling data from Atlas or being pushed by Kafka. This method
requires a setup of Elasticsearch and Airflow, which increases the amount of infrastructure and maintenance.
Another disadvantage is that with a big inflow of metadata this method might not scale as well as the other methods.

- Use underlying Solr or Elasticsearch from Apache Atlas

Within Atlas there is the possibility to open up either Solr or the experimental Elasticsearch. It depends on janusgraph
(the behind the scenes graph database) which populates the search engine. Therefore the search engine would not be compatible with
the data builder setup. Adoption of such a search engine would require either new queries, some kind of transformer
within the search engine, or changes within Atlas itself.

## Discussion
Both the REST API approach and the data builder approach can be implemented and be configurable. Both approaches have
their own benefits, the data builder together provides a more fine-tuned search whereas the Atlas REST API comes out
of the box with Atlas. The last approach of using the underlying search engine from Atlas provides direct access
to all the meta data with a decent search API. However, integration would be less straight forward as the indexes would
differ from the data builders search engine loader.


The focus is initially to implement the REST API approach and afterwards potentially implement an Atlas data extractor
and importer within the Amundsen Data Builder. So that administrators have more flexibility in combining data sources.
File renamed without changes.
57 changes: 50 additions & 7 deletions search_service/api/search.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
from typing import Iterable, Any

from flask_restful import Resource, fields, marshal_with, reqparse
from flask_restful import Resource, fields, marshal_with, reqparse, marshal

from search_service.proxy import elasticsearch
from search_service.proxy import get_proxy_client

table_fields = {
"name": fields.String,
Expand All @@ -19,9 +19,52 @@
"last_updated_epoch": fields.Integer,
}

dashboard_fields = {
"dashboard_group": fields.String,
"dashboard_name": fields.String,
# description can be empty, if no description is present in DB
"description": fields.String,
"last_reload_time": fields.String,
"user_id": fields.String,
"user_name": fields.String,
"tags": fields.List(fields.String)
}

metric_fields = {
"dashboard_group": fields.String,
"dashboard_name": fields.String,
"metric_name": fields.String,
"metric_function": fields.String,
# description can be empty, if no description is present in DB
"metric_description": fields.String,
"metric_type": fields.String,
"metric_group": fields.String
}

table_result_fields = {
"result_count": fields.Integer,
"results": fields.List(fields.Nested(table_fields), default=[])
}

dashboard_result_fields = {
"result_count": fields.Integer,
"results": fields.List(fields.Nested(dashboard_fields), default=[])
}

metric_result_fields = {
"result_count": fields.Integer,
"results": fields.List(fields.Nested(metric_fields), default=[])
}

result_fields = {
"dashboards": fields.Nested(dashboard_result_fields),
"tables": fields.Nested(table_result_fields),
"metrics": fields.Nested(metric_result_fields),
}

search_results = {
"total_results": fields.Integer,
"results": fields.Nested(table_fields, default=[])
"results": fields.Nested(result_fields)
}


Expand All @@ -30,7 +73,7 @@ class SearchAPI(Resource):
Search API
"""
def __init__(self) -> None:
self.elasticsearch = elasticsearch.get_elasticsearch_proxy()
self.proxy = get_proxy_client()

self.parser = reqparse.RequestParser(bundle_errors=True)

Expand All @@ -50,7 +93,7 @@ def get(self) -> Iterable[Any]:

try:

results = self.elasticsearch.fetch_search_results(
results = self.proxy.fetch_search_results(
query_term=args['query_term'],
page_index=args['page_index']
)
Expand All @@ -68,7 +111,7 @@ class SearchFieldAPI(Resource):
Search API with explict field
"""
def __init__(self) -> None:
self.elasticsearch = elasticsearch.get_elasticsearch_proxy()
self.proxy = get_proxy_client()

self.parser = reqparse.RequestParser(bundle_errors=True)

Expand All @@ -91,7 +134,7 @@ def get(self, *, field_name: str,
args = self.parser.parse_args(strict=True)

try:
results = self.elasticsearch.fetch_search_results_with_field(
results = self.proxy.fetch_search_results_with_field(
query_term=args.get('query_term'),
field_name=field_name,
field_value=field_value,
Expand Down
45 changes: 35 additions & 10 deletions search_service/config.py
Original file line number Diff line number Diff line change
@@ -1,28 +1,53 @@
import os

ELASTICSEARCH_ENDPOINT_KEY = 'ELASTICSEARCH_ENDPOINT'
ELASTICSEARCH_INDEX_KEY = 'ELASTICSEARCH_INDEX'
ELASTICSEARCH_AUTH_USER_KEY = 'ELASTICSEARCH_AUTH_USER'
ELASTICSEARCH_AUTH_PW_KEY = 'ELASTICSEARCH_AUTH_PW'
ELASTICSEARCH_CLIENT_KEY = 'ELASTICSEARCH_CLIENT'
SEARCH_PAGE_SIZE_KEY = 'SEARCH_PAGE_SIZE'
STATS_FEATURE_KEY = 'STATS'

PROXY_ENDPOINT = 'PROXY_ENDPOINT'
PROXY_USER = 'PROXY_USER'
PROXY_PASSWORD = 'PROXY_PASSWORD'
PROXY_CLIENT = 'PROXY_CLIENT'
PROXY_CLIENT_KEY = 'PROXY_CLIENT_KEY'
PROXY_CLIENTS = {
'ELASTICSEARCH': 'search_service.proxy.elasticsearch.ElasticsearchProxy',
'ATLAS': 'search_service.proxy.atlas.AtlasProxy'
}


class Config:
LOG_FORMAT = '%(asctime)s.%(msecs)03d [%(levelname)s] %(module)s.%(funcName)s:%(lineno)d (%(process)d:'\
'%(threadName)s) - %(message)s'
LOG_DATE_FORMAT = '%Y-%m-%dT%H:%M:%S%z'
LOG_LEVEL = 'INFO'

# Used to differentiate tables with other entities in Atlas. For more details:
# https://github.com/lyft/amundsenmetadatalibrary/blob/master/docs/proxy/atlas_proxy.md
ATLAS_TABLE_ENTITY = 'Table'

# The relationalAttribute name of Atlas Entity that identifies the database entity.
ATLAS_DB_ATTRIBUTE = 'db'

# Display name of Atlas Entities that we use for amundsen project.
# Atlas uses qualifiedName as indexed attribute. but also supports 'name' attribute.
ATLAS_NAME_ATTRIBUTE = 'qualifiedName'

# Config used by ElastichSearch
ELASTICSEARCH_INDEX = '_all'


class LocalConfig(Config):
DEBUG = False
TESTING = False
STATS = True
STATS = False
LOCAL_HOST = '0.0.0.0'
ELASTICSEARCH_ENDPOINT = os.environ.get('ELASTICSEARCHSERVICE',
'http://{LOCAL_HOST}:9200'.format(LOCAL_HOST=LOCAL_HOST))
ELASTICSEARCH_INDEX = 'tables_alias'
ELASTICSEARCH_AUTH_USER = 'elastic'
ELASTICSEARCH_AUTH_PW = 'elastic'
PROXY_PORT = '9200'
PROXY_ENDPOINT = os.environ.get('PROXY_ENDPOINT',
'http://{LOCAL_HOST}:{PORT}'.format(
LOCAL_HOST=LOCAL_HOST,
PORT=PROXY_PORT)
)
PROXY_CLIENT = PROXY_CLIENTS[os.environ.get('PROXY_CLIENT', 'ELASTICSEARCH')]
PROXY_CLIENT_KEY = os.environ.get('PROXY_CLIENT_KEY')
PROXY_USER = os.environ.get('CREDENTIALS_PROXY_USER', 'elastic')
PROXY_PASSWORD = os.environ.get('CREDENTIALS_PROXY_PASSWORD', 'elastic')
31 changes: 31 additions & 0 deletions search_service/models/dashboard.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
from typing import Iterable


class Dashboard:
def __init__(self, *,
dashboard_group: str,
dashboard_name: str,
description: str,
last_reload_time: list,
user_id: str,
user_name: str,
tags: str) -> None:
self.dashboard_group = dashboard_group
self.dashboard_name = dashboard_name
self.description = description
self.last_reload_time = last_reload_time
self.user_id = user_id
self.user_name = user_name
self.tags = tags

def __repr__(self) -> str:
return 'Dashboard(dashboard_group={!r}, dashboard_name={!r}, ' \
'description={!r}, last_reload_time={!r}, user_id={!r},' \
'user_name={!r}, tags={!r})' \
.format(self.dashboard_group,
self.dashboard_name,
self.description,
self.last_reload_time,
self.user_id,
self.user_name,
self.tags)
31 changes: 31 additions & 0 deletions search_service/models/metric.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
from typing import Iterable


class Metric:
def __init__(self, *,
dashboard_group: str,
dashboard_name: str,
metric_name: str,
metric_function: list,
metric_description: str,
metric_type: str,
metric_group: str) -> None:
self.dashboard_group = dashboard_group
self.dashboard_name = dashboard_name
self.metric_name = metric_name
self.metric_function = metric_function
self.metric_description = metric_description
self.metric_type = metric_type
self.metric_group = metric_group

def __repr__(self) -> str:
return 'Metric(dashboard_group={!r}, dashboard_name={!r}, ' \
'metric_name={!r}, metric_function={!r}, metric_description={!r},' \
'metric_type={!r}, metric_group={!r})' \
.format(self.dashboard_group,
self.dashboard_name,
self.metric_name,
self.metric_function,
self.metric_description,
self.metric_type,
self.metric_group)
Loading

0 comments on commit 9a48ff2

Please sign in to comment.