Near Real-Time End-to-End Data Warehouses (`nrt-e2e-data-warehouses`)

This project demonstrates a near real-time data pipeline using Apache Pinot, Apache Superset, Apache Airflow, and Redpanda Kafka for ingesting and querying dimension and fact data.

Getting Started

Prerequisites

Ensure you have the following installed:

Docker and Docker Compose
A web browser for accessing the services

Setup Instructions

Step 1: Start the Environment

Run the following command to build and start all services:

docker compose up --build -d

Access Services

Airflow: http://localhost:8180
- Login with: airflow / airflow
Redpanda Console: http://localhost:8080
Apache Pinot: http://localhost:9091
Apache Superset: http://localhost:8088
- Login with: admin / admin

Steps to Run the Pipeline

Step 1: Generate Dimension Data

In Airflow, run the following DAGs once to generate dimension data:
- account_dim_generator
- branch_dim_generator
- customer_dim_generator
- date_dim_generator

Step 2: Generate Transaction Facts

Run the transaction_facts_generator DAG once.
Navigate to Redpanda Console and confirm the transaction_facts topic has been created.

Step 3: Configure Pinot

Run the following Airflow DAGs in sequence:
- schema_dag
- table_dag
- load_dag
Navigate to Apache Pinot, click on Tables and Query Console, and verify that data has been consumed from Kafka.

Step 4: Configure Superset

Navigate to Apache Superset at http://localhost:8088.
- Login with: admin / admin
Connect a database:
- Go to + -> Data -> Connect a database.
- Select Apache Pinot as the database.
- Fill in the SQLAlchemy URI:
```
pinot://pinot-broker:8099/query?server=http://pinot-controller:9000
```
- Click TEST CONNECTION and ensure "Looks good".
- Press Connect.

Step 5: Query the Data

Go to SQL Lab > SQL LAB.
- Choose a table and view its schema with SEE TABLE SCHEMA.

Run the following query in SQL Lab:

SELECT
  tf.*,
  CONCAT(cd.first_name, ' ', cd.last_name) AS full_name,
  email,
  phone_number,
  registration_date,
  branch_name,
  branch_address,
  city,
  state,
  zipcode,
  account_type,
  status,
  balance
FROM
  transaction_facts tf
  LEFT JOIN account_dim ad ON tf.account_id = ad.account_id
  LEFT JOIN customer_dim cd ON tf.customer_id = cd.customer_id
  LEFT JOIN branch_dim bd ON tf.branch_id = bd.branch_id;

Save the query results as a dataset:
- Click Save Dropdown > Save as New.
- Name the dataset: transaction_fact_combined.
- Click SAVE & EXPLORE.

Step 6: Create a Superset Dashboard

Charts to Add:

Bar Chart:
- X-Axis: branch_name
- Metrics: transaction_amount (SUM)
- Row Limit: 10
- Name the chart: Top 10 Profitable Branches.
- Save it to the dashboard.
Big Number Chart:
- Metric: Count
- Name: Total Records.
- Save it to the dashboard.
Pie Chart (Currency Distribution):
- Dimensions: currency
- Metrics: transaction_amount
- Name: Currency Distribution.
- Save it to the dashboard.
Pie Chart (Account Type Distribution):
- Dimensions: account_type
- Metrics: transaction_amount
- Name: Account Type Distribution.
- Under CUSTOMIZE, enable:
  - SHOW TOTAL
  - Currency Format: NIS
- Save it to the dashboard.

Dashboard Settings:

Go to the dashboard, click ... > Edit.
Set the dashboard refresh rate to 10 seconds.

Step 7: Watch Real-Time Updates

In Airflow, run the transaction_facts_generator DAG to simulate new transactions.
Navigate to Superset > Dashboard.
- Watch as the values update dynamically with the new data from Kafka!

Production Notes

For production, scale Kafka with more brokers to ensure reliability and high throughput.
And Better way more the csvs from airflow to pinot.

Services Overview

Airflow: Orchestrates data pipelines (http://localhost:8180).
Redpanda (Kafka): Real-time message broker (http://localhost:8080).
Pinot: OLAP datastore for real-time analytics (http://localhost:9091).
Superset: Data visualization and dashboarding (http://localhost:8088).

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
.idea		.idea
csvs		csvs
dags		dags
docs		docs
images		images
plugins		plugins
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Near Real-Time End-to-End Data Warehouses (`nrt-e2e-data-warehouses`)

Getting Started

Prerequisites

Setup Instructions

Step 1: Start the Environment

Access Services

Steps to Run the Pipeline

Step 1: Generate Dimension Data

Step 2: Generate Transaction Facts

Step 3: Configure Pinot

Step 4: Configure Superset

Step 5: Query the Data

Step 6: Create a Superset Dashboard

Charts to Add:

Dashboard Settings:

Step 7: Watch Real-Time Updates

Production Notes

Services Overview

About

Releases

Packages

Languages

License

hassonor/nrt-e2e-data-warehouses

Folders and files

Latest commit

History

Repository files navigation

Near Real-Time End-to-End Data Warehouses (nrt-e2e-data-warehouses)

Getting Started

Prerequisites

Setup Instructions

Step 1: Start the Environment

Access Services

Steps to Run the Pipeline

Step 1: Generate Dimension Data

Step 2: Generate Transaction Facts

Step 3: Configure Pinot

Step 4: Configure Superset

Step 5: Query the Data

Step 6: Create a Superset Dashboard

Charts to Add:

Dashboard Settings:

Step 7: Watch Real-Time Updates

Production Notes

Services Overview

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Near Real-Time End-to-End Data Warehouses (`nrt-e2e-data-warehouses`)

Packages