Skip to content

Commit

Permalink
Initial commit for Data Hub
Browse files Browse the repository at this point in the history
  • Loading branch information
Kerem Sahin committed Sep 1, 2019
1 parent cfcbb56 commit 23339df
Show file tree
Hide file tree
Showing 3,828 changed files with 131,600 additions and 124,321 deletions.
The diff you're trying to view is too large. We only load the first 3000 changed files.
42 changes: 18 additions & 24 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,29 +1,23 @@
build/
target/
repos/
tmp/
bin/
.gradle/
.settings
# Gradle & Avro
.project
.settings
.classpath
*.swp
*.jar
*.idea
.gradle
.idea
*.iml
*.class
*.ipr
*.iws
/RUNNING_PID
wherehows-etl/src/main/resources/application.properties
**/test/resources/*.properties
logs/
.DS_Store
# See https://help.github.com/ignore-files/ for more about ignoring files.
*.ipr
**/mxe

# Pegasus & Avro
**/src/mainGenerated*
**/src/testGenerated*

# Added by mp-maker
**/build
/config
*/i18n
/out

# compiled output
dist/
out/
/commit
/.vscode/
*/src/generated/
# Mac OS
**/.DS_Store
43 changes: 2 additions & 41 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,47 +1,8 @@
dist: trusty

sudo: required

language: java

jdk:
- oraclejdk8

env:
- DOCKER_COMPOSE_VERSION=1.22.0

services:
- docker
- elasticsearch

before_install:
# elasticsearch
- curl -O https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/deb/elasticsearch/2.3.5/elasticsearch-2.3.5.deb && sudo dpkg -i --force-confnew elasticsearch-2.3.5.deb && sudo service elasticsearch restart

# extralibs
- wget https://github.com/ericsun2/sandbox/raw/master/extralibs/extralibs.zip
- mkdir -p wherehows-etl/extralibs; unzip extralibs.zip -d wherehows-etl/extralibs

# docker-compose
- sudo rm /usr/local/bin/docker-compose
- curl -L https://github.com/docker/compose/releases/download/${DOCKER_COMPOSE_VERSION}/docker-compose-`uname -s`-`uname -m` > docker-compose
- chmod +x docker-compose
- sudo mv docker-compose /usr/local/bin

# permanently increase upper limit for possible watches created per uid, build step exceeds default on travis
- echo fs.inotify.max_user_watches=524288 | sudo tee -a /etc/sysctl.conf && sudo sysctl -p


cache:
directories:
- $HOME/.gradle/caches/
- $HOME/.gradle/wrapper/

script:
- ./gradlew check assemble
- ./gradlew jacocoFullReport coveralls && ./gradlew emberCoverage
- (cd wherehows-docker && ./build.sh latest)
- (cd wherehows-docker && docker-compose config)

after_script:
- rm -rf $WHEREHOWS_DIR/coverage
- ./gradlew check assemble
- ./gradlew emberCoverage
64 changes: 0 additions & 64 deletions NOTICE

This file was deleted.

195 changes: 138 additions & 57 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,87 +1,168 @@
# WhereHows [![Build Status](https://travis-ci.org/linkedin/WhereHows.svg?branch=master)](https://travis-ci.org/linkedin/WhereHows) [![latest](https://img.shields.io/badge/latest-1.0.0-blue.svg)](https://github.com/linkedin/WhereHows/releases) [![Gitter](https://img.shields.io/gitter/room/nwjs/nw.js.svg)](https://gitter.im/wherehows) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](https://github.com/LinkedIn/Wherehows/wiki/Contributing)
## Pre-requisites
Be sure to have JDK installed on your machine.

```
sudo yum install java-1.8.0-openjdk-devel
```

WhereHows is a data discovery and lineage tool built at LinkedIn. It integrates with all the major data processing systems and collects both catalog and operational metadata from them.
Install docker and docker-compose.
```
Check https://www.docker.com/get-started for instructions on how to install docker-ce
```

Within the central metadata repository, WhereHows curates, associates, and surfaces the metadata information through two interfaces:
* a web application that enables data & linage discovery, and community collaboration
* an API endpoint that empowers automation of data processes/applications
Install Chrome web browser.
```
https://www.google.com/chrome/
```

WhereHows serves as the single platform that:
* links data objects with people and processes
* enables crowdsourcing for data knowledge
* provides data governance and provenance based on ownership and lineage
## Quickstart
To start all Docker images at once, please follow below instructions.

```
cd docker/quickstart
docker-compose up
cd ../elasticsearch && bash init.sh
```

## Who Uses WhereHows?
Here is a list of companies known to use WhereHows. Let us know if we missed your company!
## Starting Kafka
Kafka, ZooKeeper and Schema Registry are running in individual Docker containers.
We are using Confluent images. Default configurations are used.

* [LinkedIn](http://www.linkedin.com)
* [Overstock.com](http://www.overstock.com)
* [Fitbit](http://www.fitbit.com)
* [Carbonite](https://www.carbonite.com)
```
cd docker/kafka
docker-compose up
```

## Starting MySQL
MySQL Server runs in its own Docker container. Please run below commands to start MySQL container.
```
cd docker/mysql
docker-compose up
```
To connect to MySQL server you can use below command:
```
docker exec -it mysql mysql -u datahub -pdatahub datahub
```

## How Is WhereHows Used?
How WhereHows is used inside of LinkedIn and other potential [use cases][USE].
## Starting ElasticSearch and Kibana
ElasticSearch and Kibana run in their own Docker containers. Please run below commands to start ElasticSearch and Kibana containers.
```
cd docker/elasticsearch
docker-compose up
```
After containers are initialized, we need to create the search index by running below command:
```
bash init.sh
```
You can connect to Kibana on your web browser via below link
```
http://localhost:5601
```

## Starting GMS

## Documentation
The detailed information can be found in the [Wiki][wiki]
```
./gradlew build
./gradlew :gms:war:JettyRunWar
```

### Example GMS Curl Calls

## Examples in VM (Deprecated)
There is a pre-built vmware image (about 11GB) to quickly demonstrate the functionality of WhereHows. Check out the [VM Guide][VM]
#### Create
```
curl 'http://localhost:8080/corpUsers/($params:(),name:fbar)/snapshot' -X POST -H 'X-RestLi-Method: create' -H 'X-RestLi-Protocol-Version:2.0.0' --data '{"aspects": [{"com.linkedin.identity.CorpUserInfo":{"active": true, "fullName": "Foo Bar", "email": "[email protected]"}}, {"com.linkedin.identity.CorpUserEditableInfo":{}}], "urn": "urn:li:corpuser:fbar"}' -v
curl 'http://localhost:8080/datasets/($params:(),name:x.y,origin:PROD,platform:urn%3Ali%3AdataPlatform%3Afoo)/snapshot' -X POST -H 'X-RestLi-Method: create' -H 'X-RestLi-Protocol-Version:2.0.0' --data '{"aspects":[{"com.linkedin.common.Ownership":{"owners":[{"owner":"urn:li:corpuser:ksahin","type":"DATAOWNER"}],"lastModified":{"time":0,"actor":"urn:li:corpuser:ksahin"}}},{"com.linkedin.dataset.UpstreamLineage":{"upstreams":[{"auditStamp":{"time":0,"actor":"urn:li:corpuser:ksahin"},"dataset":"urn:li:dataset:(urn:li:dataPlatform:foo,barUp,PROD)","type":"TRANSFORMED"}]}},{"com.linkedin.common.InstitutionalMemory":{"elements":[{"url":"https://www.linkedin.com","description":"Sample doc","createStamp":{"time":0,"actor":"urn:li:corpuser:ksahin"}}]}},{"com.linkedin.schema.SchemaMetadata":{"schemaName":"FooEvent","platform":"urn:li:dataPlatform:foo","version":0,"created":{"time":0,"actor":"urn:li:corpuser:ksahin"},"lastModified":{"time":0,"actor":"urn:li:corpuser:ksahin"},"hash":"","platformSchema":{"com.linkedin.schema.KafkaSchema":{"documentSchema":"{\"type\":\"record\",\"name\":\"MetadataChangeEvent\",\"namespace\":\"com.linkedin.mxe\",\"doc\":\"Kafka event for proposing a metadata change for an entity.\",\"fields\":[{\"name\":\"auditHeader\",\"type\":{\"type\":\"record\",\"name\":\"KafkaAuditHeader\",\"namespace\":\"com.linkedin.avro2pegasus.events\",\"doc\":\"Header\"}}]}"}},"fields":[{"fieldPath":"foo","description":"Bar","nativeDataType":"string","type":{"type":{"com.linkedin.schema.StringType":{}}}}]}}],"urn":"urn:li:dataset:(urn:li:dataPlatform:foo,bar,PROD)"}' -v
```

## WhereHows Docker
Docker can provide configuration free dev/production setup quickly, please check out [Docker Getting Start Guide](https://github.com/linkedin/WhereHows/tree/master/wherehows-docker/README.md)
#### Get
```
curl -H 'X-RestLi-Protocol-Version:2.0.0' -H 'X-RestLi-Method: get' 'http://localhost:8080/corpUsers/($params:(),name:fbar)/snapshot/($params:(),aspectVersions:List((aspect:com.linkedin.identity.CorpUserInfo,version:0)))' | jq
curl -H 'X-RestLi-Protocol-Version:2.0.0' -H 'X-RestLi-Method: get' 'http://localhost:8080/datasets/($params:(),name:x.y,origin:PROD,platform:urn%3Ali%3AdataPlatform%3Afoo)/snapshot/($params:(),aspectVersions:List((aspect:com.linkedin.common.Ownership,version:0)))' | jq
```

## Getting Started
New to Wherehows? Check out the [Getting Started Guide][GS]
### Get all
```
curl -H 'X-RestLi-Protocol-Version:2.0.0' -H 'X-RestLi-Method: get_all' 'http://localhost:8080/corpUsers' | jq
```

### Browse

### Preparation
First, please [setup the metadata repository][DB] in MySQL.
```
CREATE DATABASE wherehows
DEFAULT CHARACTER SET utf8
DEFAULT COLLATE utf8_general_ci;
curl "http://localhost:8080/datasets?action=browse" -d '{"path": "", "start": 0, "limit": 10}' -X POST -H 'X-RestLi-Protocol-Version: 2.0.0' | jq
```

### Search

CREATE USER 'wherehows';
SET PASSWORD FOR 'wherehows' = PASSWORD('wherehows');
GRANT ALL ON wherehows.* TO 'wherehows'
```
curl "http://localhost:8080/corpUsers?q=search&input=foo&" -X GET -H 'X-RestLi-Protocol-Version: 2.0.0' -H 'X-RestLi-Method: finder' | jq
curl "http://localhost:8080/datasets?q=search&input=foo&" -X GET -H 'X-RestLi-Protocol-Version: 2.0.0' -H 'X-RestLi-Method: finder' | jq
```

Execute the [DDL files][DDL] to create the required repository tables in **wherehows** database.
### Autocomplete

### Build
1. Get the source code: ```git clone https://github.com/linkedin/WhereHows.git```
2. Put a few 3rd-party jar files to **wherehows-etl/extralibs** directory. Some of these jar files may not be available in Maven Central or Artifactory. See [the download instrucitons][EXJAR] for more detail. ```cd WhereHows/wherehows-etl/extralibs```
3. From the **WhereHows** root directory and build all the modules: ```./gradlew build```
4. Start the metadata ETL and API service: ```./gradlew wherehows-backend:runPlayBinary```
5. In a new terminal, start the web front-end: ```./gradlew wherehows-frontend:runPlayBinary```. The WhereHows UI is available at http://localhost:9001 by default. You can change the port number by editing the value of ```project.ext.httpPort``` in ```wherehows-frontend/build.gradle```.
```
curl "http://localhost:8080/datasets?action=autocomplete" -d '{"query": "foo", "field": "name", "limit": 10}' -X POST -H 'X-RestLi-Protocol-Version: 2.0.0' | jq
```

### Ownership

## Roadmap
Check out the current [roadmap][RM] for WhereHows.
```
curl -H 'X-RestLi-Protocol-Version:2.0.0' -H 'X-RestLi-Method: get' 'http://localhost:8080/datasets/($params:(),name:x.y,origin:PROD,platform:urn%3Ali%3AdataPlatform%3Afoo)/rawOwnership/0' | jq
```

### Schema

## Contribute
Want to contribute? Check out the [Contributors Guide][CON]
```
curl -H 'X-RestLi-Protocol-Version:2.0.0' -H 'X-RestLi-Method: get' 'http://localhost:8080/datasets/($params:(),name:x.y,origin:PROD,platform:urn%3Ali%3AdataPlatform%3Afoo)/schema/0' | jq
```

## Debugging Kafka
GMS fires a MetadataAuditEvent after a new record is created through snapshot endpoint. We can check if this message is correctly fired using kafkacat.
```
Install kafkacat through this link https://github.com/edenhill/kafkacat
```
To consume messages on MetadataAuditEvent topic, run below command. It doesn't support Avro deserialization just yet, but they have an ongoing [work](https://github.com/edenhill/kafkacat/pull/151) for that.
```
kafkacat -b localhost:9092 -t MetadataAuditEvent
```

## Community
Want help? Check out the [Gitter chat room][GITTER] and [Google Groups][LIST]
## Starting Elasticsearch Indexing Job
Run below to start Elasticsearch indexing job.
```
./gradlew :metadata-jobs:elasticsearch-index-job:run
```
To test the job, you should've already started Kafka, GMS, MySQL and ElasticSearch/Kibana.
After starting all the services, you can create a record in GMS by Snapshot endpoint as below.
```
curl 'http://localhost:8080/metrics/($params:(),name:a.b.c01,type:UMP)/snapshot' -X POST -H 'X-RestLi-Method: create' -H 'X-RestLi-Protocol-Version:2.0.0' --data '{"aspects": [{"com.linkedin.common.Ownership":{"owners":[{"owner":"urn:li:corpuser:ksahin","type":"DATAOWNER"}]}}], "urn": "urn:li:metric:(UMP,a.b.c01)"}' -v
```
This will fire an MAE and search index will be updated by indexing job after reading MAE from Kafka.
Then, you can check ES index if document is populated by below command.
```
curl localhost:9200/metricdocument/_search -d '{"query":{"match":{"urn":"urn:li:metric:(UMP,a.b.c01)"}}}' | jq
```

## Starting MetadataChangeEvent Consuming Job
Run below to start MCE consuming job.
```
./gradlew :metadata-jobs:mce-consumer-job:run
```
Create your own MCE to align the models in sample_MCE.dat.
Tips: one liner per MCE with Python syntax.

[wiki]: https://github.com/LinkedIn/Wherehows/wiki
[GS]: https://github.com/linkedin/WhereHows/blob/master/wherehows-docs/getting-started.md
[CON]: https://github.com/linkedin/WhereHows/blob/master/wherehows-docs/contributing.md
[USE]: https://github.com/linkedin/WhereHows/blob/master/wherehows-docs/use-cases.md
[RM]: https://github.com/linkedin/WhereHows/blob/master/wherehows-docs/roadmap.md
[VM]: https://github.com/LinkedIn/Wherehows/wiki/Quick-Start-With-VM
[EXJAR]: https://github.com/linkedin/WhereHows/tree/master/wherehows-etl/extralibs
[DDL]: https://github.com/linkedin/WhereHows/tree/master/wherehows-data-model/DDL
[DB]: https://github.com/linkedin/WhereHows/blob/master/wherehows-docs/getting-started.md#database-setup
[LIST]: https://groups.google.com/forum/#!forum/wherehows
[GITTER]: https://gitter.im/wherehows
Then you can produce MCE to feed your GMS.
```
cd metadata-ingestion/src
python avro_cli.py produce
```

## Starting Datahub Frontend
Run below to start datahub-frontend Play server.
```
cd datahub-frontend/run
./run-local-frontend
```
Then you can connect to Datahub on your web browser via below link
```
http://localhost:9001
```
Loading

0 comments on commit 23339df

Please sign in to comment.