Data Sweeper Pro+ is an advanced data cleaning and transformation platform built with Streamlit. It allows users to upload datasets, clean them, analyze them with interactive profiling reports, and export the cleaned data in multiple formats. The app is designed for both technical and non-technical users, offering a user-friendly interface with powerful data processing capabilities.
- Supports multiple file uploads in CSV and Excel formats.
- Handles large datasets efficiently.
- Generate an interactive Profile Report using
ydata-profiling
to explore:- Missing values
- Duplicate rows
- Data types
- Statistical summaries
- Correlations
- Fully interactive HTML report embedded in the app.
- Remove duplicate rows.
- Handle missing values with strategies like:
- Drop rows
- Fill with mean/median
- KNN imputation.
- Normalize numerical columns.
- Select specific columns to keep or reorder them.
- Convert columns to desired data types:
string
,integer
,float
, ordatetime
.
- Add new columns based on existing ones (e.g., sum of two columns).
- Extract date parts (e.g., year from a date column).
- Apply custom formulas for advanced transformations.
- Generate interactive charts using Plotly:
- Histograms
- Scatter plots
- Box plots
- Line charts
- Export cleaned data in multiple formats:
- CSV
- Excel
- JSON
- Python >= 3.12
- pip (Python package manager)
-
Clone the repository:
git clone https://github.com/your-repo/data-sweeper-pro.git cd data-sweeper-pro
-
Create a virtual environment (optional but recommended):
python -m venv myenv source myenv/bin/activate # On Linux/MacOS myenv\Scripts\activate # On Windows
-
Install dependencies:
pip install -r requirements.txt
-
Run the app:
streamlit run app.py
-
Open the app in your browser at
http://localhost:8501
.
data-sweeper-pro/
├── .streamlit/
│ └── config.toml # Streamlit theme configuration
├── app.py # Main Streamlit application script
├── requirements.txt # Python dependencies list
├── large_test_data.csv # Example large dataset for testing (optional)
└── README.md # Project documentation (this file)
- Upload a dataset (
large_test_data.csv
) containing missing values, duplicates, and mixed data types. - Generate a full profile report to explore the dataset.
- Clean the data by removing duplicates, handling missing values, and normalizing numerical columns.
- Apply transformations like converting column types or creating new features.
- Visualize trends and patterns using interactive charts.
- Export the cleaned dataset as a CSV or Excel file.
The following Python libraries are used in this project:
streamlit==1.29.0
pandas==2.1.3
numpy==1.26.4
plotly==5.18.0
ydata-profiling==4.12.2
scikit-learn==1.3.2
openpyxl==3.1.2
scipy==1.11.4
Install them using:
pip install -r requirements.txt
Contributions are welcome! Please follow these steps:
- Fork this repository.
- Create a new branch (
git checkout -b feature-name
). - Commit your changes (
git commit -m "Add feature-name"
). - Push to your branch (
git push origin feature-name
). - Open a pull request.