GitHub - terrencestella/TS-data-engineer-assignment

KommatiPara Data Pipeline The primary goal of this project is to build a data pipeline using PySpark for a small company named KommatiPara which deals with Bitcoin trading. The pipeline processes two separate datasets containing client and financial details to prepare a consolidated dataset for a new marketing initiative. This dataset will focus on clients from the United Kingdom and the Netherlands, providing insights into their financial interactions.

Pipeline Steps Initializing Spark Session: A Spark session is initialized to enable the processing of data using Spark DataFrame operations.

Loading Data: Two datasets are loaded into DataFrames from specified file paths.

Filtering Data: The client dataset is filtered to retain only the records of clients from by the user inputed countries.

Dropping Unwanted Columns: Personal identifiable information, excluding emails, is removed from the client dataset. Credit card number is removed from the financial dataset.

Joining DataFrames: The client and financial datasets are joined on the id field to form a unified dataset.

Renaming Columns: Column names are modified for better readability and understanding.

Saving the Output: The final dataset is saved to the client_data directory in the root directory of the project.

Logging Logging has been implemented to track the progress of the data pipeline and catch any errors that might occur during the data processing.

Testing Unit tests have been written to ensure the correctness of the data transformations and the overall functionality of the pipeline.

GitHub Actions GitHub Actions has been utilized to set up a continuous integration pipeline to automate the testing and ensure the codebase remains in a deployable state.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
data		data
dist		dist
logs		logs
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

terrencestella/TS-data-engineer-assignment

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages