Skip to content

Data Sweeper Pro+ is an advanced data cleaning and transformation platform built with Streamlit. It allows users to upload datasets, clean them, analyze them with interactive profiling reports, and export the cleaned data in multiple formats. The app is designed for both technical and non-technical users.

Notifications You must be signed in to change notification settings

Pranshu936/Data_cleaning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧹 Data Sweeper Pro+

Data Sweeper Pro+ is an advanced data cleaning and transformation platform built with Streamlit. It allows users to upload datasets, clean them, analyze them with interactive profiling reports, and export the cleaned data in multiple formats. The app is designed for both technical and non-technical users, offering a user-friendly interface with powerful data processing capabilities.


🚀 Features

1. File Upload

  • Supports multiple file uploads in CSV and Excel formats.
  • Handles large datasets efficiently.

2. Data Profiling

  • Generate an interactive Profile Report using ydata-profiling to explore:
    • Missing values
    • Duplicate rows
    • Data types
    • Statistical summaries
    • Correlations
  • Fully interactive HTML report embedded in the app.

3. Data Cleaning

  • Remove duplicate rows.
  • Handle missing values with strategies like:
    • Drop rows
    • Fill with mean/median
    • KNN imputation.
  • Normalize numerical columns.

4. Data Transformations

Column Operations:

  • Select specific columns to keep or reorder them.

Data Type Conversion:

  • Convert columns to desired data types: string, integer, float, or datetime.

Feature Engineering:

  • Add new columns based on existing ones (e.g., sum of two columns).
  • Extract date parts (e.g., year from a date column).
  • Apply custom formulas for advanced transformations.

5. Visualization

  • Generate interactive charts using Plotly:
    • Histograms
    • Scatter plots
    • Box plots
    • Line charts

6. Export Options

  • Export cleaned data in multiple formats:
    • CSV
    • Excel
    • JSON

🛠️ Installation

Prerequisites:

  • Python >= 3.12
  • pip (Python package manager)

Step-by-Step Guide:

  1. Clone the repository:

    git clone https://github.com/your-repo/data-sweeper-pro.git
    cd data-sweeper-pro
  2. Create a virtual environment (optional but recommended):

    python -m venv myenv
    source myenv/bin/activate    # On Linux/MacOS
    myenv\Scripts\activate       # On Windows
  3. Install dependencies:

    pip install -r requirements.txt
  4. Run the app:

    streamlit run app.py
  5. Open the app in your browser at http://localhost:8501.


📂 Directory Structure

data-sweeper-pro/
├── .streamlit/
│   └── config.toml       # Streamlit theme configuration
├── app.py                # Main Streamlit application script
├── requirements.txt      # Python dependencies list
├── large_test_data.csv   # Example large dataset for testing (optional)
└── README.md             # Project documentation (this file)

📊 Example Use Case

  1. Upload a dataset (large_test_data.csv) containing missing values, duplicates, and mixed data types.
  2. Generate a full profile report to explore the dataset.
  3. Clean the data by removing duplicates, handling missing values, and normalizing numerical columns.
  4. Apply transformations like converting column types or creating new features.
  5. Visualize trends and patterns using interactive charts.
  6. Export the cleaned dataset as a CSV or Excel file.

🧩 Dependencies

The following Python libraries are used in this project:

streamlit==1.29.0
pandas==2.1.3
numpy==1.26.4
plotly==5.18.0
ydata-profiling==4.12.2
scikit-learn==1.3.2
openpyxl==3.1.2
scipy==1.11.4

Install them using:

pip install -r requirements.txt

🤝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork this repository.
  2. Create a new branch (git checkout -b feature-name).
  3. Commit your changes (git commit -m "Add feature-name").
  4. Push to your branch (git push origin feature-name).
  5. Open a pull request.

About

Data Sweeper Pro+ is an advanced data cleaning and transformation platform built with Streamlit. It allows users to upload datasets, clean them, analyze them with interactive profiling reports, and export the cleaned data in multiple formats. The app is designed for both technical and non-technical users.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages