Skip to content

Latest commit

 

History

History
57 lines (33 loc) · 2.84 KB

README.md

File metadata and controls

57 lines (33 loc) · 2.84 KB

Incremental-Data-Ingestion

A customer approaches you and wants to have this public dataset (https://data.medicare.gov/Home-Health-Compare/Home-Health-Care-Agencies/6jpm-sxkc) periodically uploaded into Hive and indexed into Elasticsearch, so he can run further analytics on top of it.

What you need to do:

  • Use http://hortonworks.com/products/sandbox/

  • Use https://www.elastic.co/products/elasticsearch or whole ELK stack

  • Set up some way of scheduling ingest

  • Provide automation for the ingest itself (using yarn queue)

  • As result the customer wants to

    1. see simple query on Hive showing date histogram(e.g. daily) with number of new entries

    2. query elastic index on analyzed as well as non analyzed fields

  • Add new user with the admin permissions and ban connecting to server as root, only admins can use sudo

  • Allow access to ingested data only for “hive-ingest-users” group

  • Show us cluster performance stats

OPTIONAL

  • Provide way for doing transformation after the ingest into Hive
  • Provide way for data retention – i.e. how to move older data to lower redundancy(cheaper) storage and eventually delete them

Solution Screenshots

Architecture

Solution Architecture

Upsert Algorithm used for Incremental Update on Hive Table - FOUR STEPS STRATEGY FOR INCREMENTAL UPDATES IN APACHE HIVE ON HADOOP

Data Ingestion - Azkaban - Scheduling Jobs

Azkaban Project / Flows

Executing Ingestion Process

Time Stastics - Ingestion Process

Elasticsearch - Kibana

ElasticSearch Table

Splunk - Records Logs

Splunk

Splunk

Cluster Performance Metrics

Ambari - Cluster Performance

Ambari - HDFS Performance

Ambari - YARN Performance