Skip to content

ravitejainti/data-processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Problem Statement The objective was to:

Generate a CSV file containing the following columns:

first_name last_name address date_of_birth Process the generated CSV file to anonymize sensitive information. The columns to be anonymized are:

first_name last_name address Ensure the solution works with a large dataset, specifically around 2GB in size, and demonstrate that it can handle even larger datasets efficiently.

Utilize a distributed computing platform to process large datasets effectively. In this project, Snowflake was chosen for this purpose.

Approach: Python: For generating synthetic data using the Faker library. Snowflake: A cloud-based data warehousing platform used for large-scale data processing and anonymization. SQL: To perform data manipulation and anonymization within Snowflake. GitHub: sharing the project. Google Drive: For sharing large datasets, as GitHub has file size limitations.

Step 1: Data Generation Python’s Faker library was used to create names, addresses, and dates of birth. Python code can be seen in repository as samplefakedatagenerator.py

Step 2: Loading Data into Snowflake Since Snowflake’s web UI has a file upload limit of 250MB, the generated dataset needed to be split into smaller parts before loading.

Splitting the Large CSV File - large_dataset.csv To split the large CSV file into manageable parts, the following command was used in the terminal:

command: split -b 200m large_dataset.csv part

this resulted files named part_aa, part_ab, part_ac, to part aj.

Step 3: Data Anonymization: create table and performed Anonymization using SHA-256 hashing algorithm. SQL script attached - Anonymization.sql and Exporting Anonymized Data in a final csv file named anonymized_data.csv

Attaching the google drive link: https://drive.google.com/drive/folders/1tnv5quKsPEqH7pZsuEiPP-go8kLkg8sO?usp=sharing

About

Data Processing the csv file to anonymise the data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages