Skip to content

Latest commit

 

History

History
64 lines (49 loc) · 3.97 KB

README.md

File metadata and controls

64 lines (49 loc) · 3.97 KB

Car demo data infrastructure and analytics

For the analytical purpose, the insightful data generated by microservices brings to the databricks Lakehouse platform.

The Microservices publish data to pubsub topics. The streaming data pipeline, which basically receive the streaming event from cloud pubsub and writing to the several databricks Delta Lake table. There is a medallion architecture while creating the pipeline. There is a bronze table for keeping the raw data. From the raw data we are doing some basic transformations. Basically doing the encryption for the PII columns, then writing to the Silver Table and from the Silver Table we are doing some aggregation. and writing aggregated result to the Golden Table and this golden table we are using for creating the visualization and reporting.

img.png

There is a streaming pipeline(PubSub-shipping-ingestion.py) created with spark structured streaming to read shipment events from pubsub topic and write it to different delta lake tables.

Prerequisite

  1. Databricks workspace created with data plane resides on GCP account running the gcp pubsub and other services. follow the (document) to create workspace on GCP
  2. Create a Unity Catalog metastore Note: Unity Catalog provides centralized access control, auditing, and data discovery capabilities across Databricks workspaces. so Unity catalog is using for data governance. img_1.png
  3. All the microservices deployed on GKE and publishing event to pubsub topic. the streaming pipeline(PubSub-shipping-ingestion.py) read events from topic shipping-notification and subscription. so topic and subscription should be created.
  4. To connect databricks to gcp pubsub will be needed gcp credentials. The credential should be stores in
    client_id_secret = dbutils.secrets.get(scope = "gcp-pubsub", key = "client_id")
    client_email_secret = dbutils.secrets.get(scope = "gcp-pubsub", key = "client_email")
    private_key_secret = dbutils.secrets.get(scope = "gcp-pubsub", key = "private_key")
    private_key_id_secret = dbutils.secrets.get(scope = "gcp-pubsub", key = "private_key_id")
    
    pick these secrets from gcp service account json key.
  5. Create a small size compute cluster on data bricks.
  6. For Visualization, Install power BI desktop on your machine. To connect Power BI desktop to databricks cluster follow - document

How to run

  1. Import this (PubSub-shipping-ingestion.py) this file to databricks notebook follow document to import
  2. Attach a cluster created to the notebook.
  3. Run the notebook.

Result

The pipeline will crete three table in unity catalog - a) main.car_demo_data_lake.shipping_bronze b) main.car_demo_data_lake.shipping_sliver c) main.car_demo_all_brands.shipping_data_gold

Note: these delta lake tables Creates automatically.

img_2.png

This gold table connects to powerBI desktop for the visualization img_3.png

Data governance

All data assets tables ,view, databases stores in unity catalog. So, Grant and revoke of permission to a user and service principle will be executed on databricks unity catalog. Data discovery will be done on unity catalog. img_4.png