This project demonstrates a near real-time data pipeline using Apache Pinot, Apache Superset, Apache Airflow, and Redpanda Kafka for ingesting and querying dimension and fact data.
Ensure you have the following installed:
- Docker and Docker Compose
- A web browser for accessing the services
Run the following command to build and start all services:
docker compose up --build -d
- Airflow: http://localhost:8180
- Login with:
airflow
/airflow
- Login with:
- Redpanda Console: http://localhost:8080
- Apache Pinot: http://localhost:9091
- Apache Superset: http://localhost:8088
- Login with:
admin
/admin
- Login with:
- In Airflow, run the following DAGs once to generate dimension data:
account_dim_generator
branch_dim_generator
customer_dim_generator
date_dim_generator
- Run the
transaction_facts_generator
DAG once. - Navigate to Redpanda Console and confirm the
transaction_facts
topic has been created.
- Run the following Airflow DAGs in sequence:
schema_dag
table_dag
load_dag
- Navigate to Apache Pinot, click on
Tables
andQuery Console
, and verify that data has been consumed from Kafka.
- Navigate to Apache Superset at http://localhost:8088.
- Login with:
admin
/admin
- Login with:
- Connect a database:
- Go to
+ -> Data -> Connect a database
. - Select
Apache Pinot
as the database. - Fill in the SQLAlchemy URI:
pinot://pinot-broker:8099/query?server=http://pinot-controller:9000
- Click
TEST CONNECTION
and ensure "Looks good". - Press
Connect
.
- Go to
-
Go to SQL Lab >
SQL LAB
.- Choose a table and view its schema with
SEE TABLE SCHEMA
.
- Choose a table and view its schema with
-
Run the following query in SQL Lab:
SELECT tf.*, CONCAT(cd.first_name, ' ', cd.last_name) AS full_name, email, phone_number, registration_date, branch_name, branch_address, city, state, zipcode, account_type, status, balance FROM transaction_facts tf LEFT JOIN account_dim ad ON tf.account_id = ad.account_id LEFT JOIN customer_dim cd ON tf.customer_id = cd.customer_id LEFT JOIN branch_dim bd ON tf.branch_id = bd.branch_id;
-
Save the query results as a dataset:
- Click
Save Dropdown
>Save as New
. - Name the dataset:
transaction_fact_combined
. - Click
SAVE & EXPLORE
.
- Click
-
Bar Chart:
- X-Axis:
branch_name
- Metrics:
transaction_amount (SUM)
- Row Limit: 10
- Name the chart:
Top 10 Profitable Branches
. - Save it to the dashboard.
- X-Axis:
-
Big Number Chart:
- Metric:
Count
- Name:
Total Records
. - Save it to the dashboard.
- Metric:
-
Pie Chart (Currency Distribution):
- Dimensions:
currency
- Metrics:
transaction_amount
- Name:
Currency Distribution
. - Save it to the dashboard.
- Dimensions:
-
Pie Chart (Account Type Distribution):
- Dimensions:
account_type
- Metrics:
transaction_amount
- Name:
Account Type Distribution
. - Under
CUSTOMIZE
, enable:SHOW TOTAL
- Currency Format:
NIS
- Save it to the dashboard.
- Dimensions:
- Go to the dashboard, click
...
>Edit
. - Set the dashboard refresh rate to 10 seconds.
- In Airflow, run the
transaction_facts_generator
DAG to simulate new transactions. - Navigate to Superset > Dashboard.
- Watch as the values update dynamically with the new data from Kafka!
- For production, scale Kafka with more brokers to ensure reliability and high throughput.
- And Better way more the csvs from airflow to pinot.
- Airflow: Orchestrates data pipelines (http://localhost:8180).
- Redpanda (Kafka): Real-time message broker (http://localhost:8080).
- Pinot: OLAP datastore for real-time analytics (http://localhost:9091).
- Superset: Data visualization and dashboarding (http://localhost:8088).