A customer approaches you and wants to have this public dataset (https://data.medicare.gov/Home-Health-Compare/Home-Health-Care-Agencies/6jpm-sxkc) periodically uploaded into Hive and indexed into Elasticsearch, so he can run further analytics on top of it.
What you need to do:
-
Use https://www.elastic.co/products/elasticsearch or whole ELK stack
-
Set up some way of scheduling ingest
-
Provide automation for the ingest itself (using yarn queue)
-
As result the customer wants to
-
see simple query on Hive showing date histogram(e.g. daily) with number of new entries
-
query elastic index on analyzed as well as non analyzed fields
-
-
Add new user with the admin permissions and ban connecting to server as root, only admins can use sudo
-
Allow access to ingested data only for “hive-ingest-users” group
-
Show us cluster performance stats
- Provide way for doing transformation after the ingest into Hive
- Provide way for data retention – i.e. how to move older data to lower redundancy(cheaper) storage and eventually delete them
Upsert Algorithm used for Incremental Update on Hive Table - FOUR STEPS STRATEGY FOR INCREMENTAL UPDATES IN APACHE HIVE ON HADOOP