This project involves building an end to end data streaming pipeline with the use of APIs, Apache kafka, Spark and MongoDB as a document store and visualised with streamlit dashboard all together packaged in a docker enviroment with individual docker containers running.
A data set gotten from [kaggle] (https://www.kaggle.com/datasets/carrie1/ecommerce-data) was used. The data set in CSV format was first transformed into json format. A python script (input client) was used to ingest the data into the API. Postman was used for testing. Kafka, Spark and MongoDB was hosted as a docker container. The API takes data and writes it into a Kafka topic where messages are buffered and processed with Spark. A jupyter notebook was created for Spark to listen to Kafka and takes Data out of Kafka, process it and store it into MongoDB. Streamlit was us used to access data from MongoDB and also visualize the data.
The data was downloaded and transformed into the JSON data frame format using some python functions on transformer.py folder, and data is read into an output.txt file.
Two APIs were created. A root API (hello world) was first created to test if the API actually work, Get and Post to put the data into the invoice item part.
Postman was used for testing the API. Data was written into postman and request was sent. response 201 gotton means API is working.
A docker-compose file for kafka was created and run in the docker network. Kafka producer, consumer and local injestion topics was created and enabled. kafka producer was used to enable the topics and tested with postman where consumer received the data. testing kafka with postman kafka consumer
A dockerfile and requirement.txt file was created to build the image API inside docker enviroment. The Local consumer topic was sterted, request was sent to postman again, the invoice data was sent and wirking successfully. testing api-ingest with postman
A docker-compose file that combines Kafka and Spark (with jupyter notebook interface) was created to run in a single network. Jupyter notebook was used to stream data from kafka to spark the spark session was used to load data and postman was also used to test the API.
A docker-compose file that combines Kafka, Spark and MongoDB was created to run on a single network, to send each message to MongoDB. MongoDB was choosen because it is the best and most used when working with Document stores (JSON documents).