Building an end to end ETL data-streaming-with-Apache-kafka-spark-and-mongoDB

This project involves building an end to end data streaming pipeline with the use of APIs, Apache kafka, Spark and MongoDB as a document store and visualised with streamlit dashboard all together packaged in a docker enviroment with individual docker containers running.

Introduction

A data set gotten from [kaggle] (https://www.kaggle.com/datasets/carrie1/ecommerce-data) was used. The data set in CSV format was first transformed into json format. A python script (input client) was used to ingest the data into the API. Postman was used for testing. Kafka, Spark and MongoDB was hosted as a docker container. The API takes data and writes it into a Kafka topic where messages are buffered and processed with Spark. A jupyter notebook was created for Spark to listen to Kafka and takes Data out of Kafka, process it and store it into MongoDB. Streamlit was us used to access data from MongoDB and also visualize the data.

Getting started

The data was downloaded and transformed into the JSON data frame format using some python functions on transformer.py folder, and data is read into an output.txt file.

Api Ingest

Two APIs were created. A root API (hello world) was first created to test if the API actually work, Get and Post to put the data into the invoice item part.

Testing

Postman was used for testing the API. Data was written into postman and request was sent. response 201 gotton means API is working.

Starting up kafka

A docker-compose file for kafka was created and run in the docker network. Kafka producer, consumer and local injestion topics was created and enabled. kafka producer was used to enable the topics and tested with postman where consumer received the data. testing kafka with postman kafka consumer

Ingesting and deploying API in docker container

A dockerfile and requirement.txt file was created to build the image API inside docker enviroment. The Local consumer topic was sterted, request was sent to postman again, the invoice data was sent and wirking successfully. testing api-ingest with postman

Setting up Apache Spark for connecting to Kafka (Apache Spark Config)

A docker-compose file that combines Kafka and Spark (with jupyter notebook interface) was created to run in a single network. Jupyter notebook was used to stream data from kafka to spark the spark session was used to load data and postman was also used to test the API.

Setting up MongoDB

A docker-compose file that combines Kafka, Spark and MongoDB was created to run on a single network, to send each message to MongoDB. MongoDB was choosen because it is the best and most used when working with Document stores (JSON documents).

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
ApacheSpark		ApacheSpark
Kafka consumer		Kafka consumer
Postman		Postman
Streamlit		Streamlit
client		client
.gitignore		.gitignore
Kafka Commands.txt		Kafka Commands.txt
README.md		README.md
build command.txt		build command.txt
docker-compose-kafka-spark-mongodb.yml		docker-compose-kafka-spark-mongodb.yml
docker-compose-kafka-spark.yml		docker-compose-kafka-spark.yml
docker-compose-kafka.yml		docker-compose-kafka.yml
dockerfile		dockerfile
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Building an end to end ETL data-streaming-with-Apache-kafka-spark-and-mongoDB

Introduction

Getting started

Api Ingest

Testing

Starting up kafka

Ingesting and deploying API in docker container

Setting up Apache Spark for connecting to Kafka (Apache Spark Config)

Setting up MongoDB

About

Releases

Packages

Languages

VincentWei2021/Building-an-end-to-end-data-streaming-pipeiline-with-Apache-kafka-spark-and-mongoDB

Folders and files

Latest commit

History

Repository files navigation

Building an end to end ETL data-streaming-with-Apache-kafka-spark-and-mongoDB

Introduction

Getting started

Api Ingest

Testing

Starting up kafka

Ingesting and deploying API in docker container

Setting up Apache Spark for connecting to Kafka (Apache Spark Config)

Setting up MongoDB

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages