Skip to content

VincentWei2021/Building-an-end-to-end-data-streaming-pipeiline-with-Apache-kafka-spark-and-mongoDB

Repository files navigation

Building an end to end ETL data-streaming-with-Apache-kafka-spark-and-mongoDB

This project involves building an end to end data streaming pipeline with the use of APIs, Apache kafka, Spark and MongoDB as a document store and visualised with streamlit dashboard all together packaged in a docker enviroment with individual docker containers running.

Introduction

A data set gotten from [kaggle] (https://www.kaggle.com/datasets/carrie1/ecommerce-data) was used. The data set in CSV format was first transformed into json format. A python script (input client) was used to ingest the data into the API. Postman was used for testing. Kafka, Spark and MongoDB was hosted as a docker container. The API takes data and writes it into a Kafka topic where messages are buffered and processed with Spark. A jupyter notebook was created for Spark to listen to Kafka and takes Data out of Kafka, process it and store it into MongoDB. Streamlit was us used to access data from MongoDB and also visualize the data.

Getting started

The data was downloaded and transformed into the JSON data frame format using some python functions on transformer.py folder, and data is read into an output.txt file.

Api Ingest

Two APIs were created. A root API (hello world) was first created to test if the API actually work, Get and Post to put the data into the invoice item part. code img hello img

Testing

Postman was used for testing the API. Data was written into postman and request was sent. response 201 gotton means API is working.

Capture

Starting up kafka

A docker-compose file for kafka was created and run in the docker network. Kafka producer, consumer and local injestion topics was created and enabled. kafka producer was used to enable the topics and tested with postman where consumer received the data. ingestion topic 1 testing kafka with postman kafka postman kafka consumer kafka created topic

Ingesting and deploying API in docker container

A dockerfile and requirement.txt file was created to build the image API inside docker enviroment. The Local consumer topic was sterted, request was sent to postman again, the invoice data was sent and wirking successfully. testing api-ingest with postman api-ingest img2 api-ingest 3 api-ingest4

Setting up Apache Spark for connecting to Kafka (Apache Spark Config)

A docker-compose file that combines Kafka and Spark (with jupyter notebook interface) was created to run in a single network. Jupyter notebook was used to stream data from kafka to spark Capture the spark session was used to load data and postman was also used to test the API.

Setting up MongoDB

A docker-compose file that combines Kafka, Spark and MongoDB was created to run on a single network, to send each message to MongoDB. MongoDB was choosen because it is the best and most used when working with Document stores (JSON documents).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published