Skip to content

Example of Continuous Ingestion of Public Datasets Snapsots into Hive and indexing into Elasticsearch

Notifications You must be signed in to change notification settings

sylviaxxy/Incremental-Data-Ingestion

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Incremental-Data-Ingestion

A customer approaches you and wants to have this public dataset (https://data.medicare.gov/Home-Health-Compare/Home-Health-Care-Agencies/6jpm-sxkc) periodically uploaded into Hive and indexed into Elasticsearch, so he can run further analytics on top of it.

What you need to do:

  • Use http://hortonworks.com/products/sandbox/

  • Use https://www.elastic.co/products/elasticsearch or whole ELK stack

  • Set up some way of scheduling ingest

  • Provide automation for the ingest itself (using yarn queue)

  • As result the customer wants to

    1. see simple query on Hive showing date histogram(e.g. daily) with number of new entries

    2. query elastic index on analyzed as well as non analyzed fields

  • Add new user with the admin permissions and ban connecting to server as root, only admins can use sudo

  • Allow access to ingested data only for “hive-ingest-users” group

  • Show us cluster performance stats

OPTIONAL

  • Provide way for doing transformation after the ingest into Hive
  • Provide way for data retention – i.e. how to move older data to lower redundancy(cheaper) storage and eventually delete them

Solution Screenshots

Architecture

Solution Architecture

Upsert Algorithm used for Incremental Update on Hive Table - FOUR STEPS STRATEGY FOR INCREMENTAL UPDATES IN APACHE HIVE ON HADOOP

Data Ingestion - Azkaban - Scheduling Jobs

Azkaban Project / Flows

Executing Ingestion Process

Time Stastics - Ingestion Process

Elasticsearch - Kibana

ElasticSearch Table

Splunk - Records Logs

Splunk

Splunk

Cluster Performance Metrics

Ambari - Cluster Performance

Ambari - HDFS Performance

Ambari - YARN Performance

About

Example of Continuous Ingestion of Public Datasets Snapsots into Hive and indexing into Elasticsearch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 100.0%