Welcome to the Clean-and-Transform project, where we dive into the realm of data refinement and enhancement using Python's powerful pandas and numpy libraries. In this endeavor, I aim to demonstrate my expertise in the art of data cleaning and transformation.
Data is at the heart of every data-driven project, and the quality of your data directly impacts the success of your analysis, modeling, and decision-making. In many real-world scenarios, data can be messy, inconsistent, or riddled with inaccuracies. This project provides an opportunity to showcase my skills in addressing these challenges, making the data more reliable, and preparing it for further analysis.
For this project, I've chosen a well-documented dataset with a substantial number of data points. This dataset has been generated by a machine, but it's not immune to discrepancies. These discrepancies present us with a valuable opportunity to hone our data cleaning and transformation skills, turning raw data into a refined, trustworthy resource.
Throughout this project, I will tackle a variety of data issues, including missing values, inconsistent formats, outliers, and more. I'll employ pandas and numpy to perform these transformations systematically and efficiently. By the end of this project, you can expect to see the data in a much-improved state, ready for further analysis, visualization, or machine learning applications.
Let's roll up our sleeves and get started!
You can view a rendered version of the notebook here.
Or a pdf version of the notebook here
- Python: Ensure that Python is installed on your machine. You can download it from python.org.
- Jupyter Lab: Install Jupyter Lab using the following command in your terminal or command prompt:
pip install jupyter lab
- External Libraries: Use
pip install
for library installation.pip install pandas numpy matplotlib
-
Download: Download the Jupyter notebook file SO-2023-survey.ipynb from this repository to your local machine.
-
Run Jupyter Lab Server:
- Open a terminal or command prompt.
- Navigate to the directory where you saved the notebook file.
- Run the following command:
jupyter lab
- Access the notebook:
- Open your web browser and go to the URL displayed in the terminal.
- Navigate to the notebook file and click on it to open.
- Interact with the Notebook:
- Execute code cells using the "Run" button or by pressing Shift + Enter.
It is recommended to Run all cells
as this ensures all cells to execute properly.
-
youtube.ipynb: Jupyter notebook containing all the steps of the data cleaning process
-
youtube.pdf Pdf rendition of the aforementioned jupyter notebook.
-
Global YouTube Statistics.csv csv file of the original dataset.
-
README.md: Instructions on how to get started, install dependencies, and use the Jupyter notebook.
-
license.txt Text file listing the MIT open-source software license
- Python: Version 3.10.12
- Jupyter Lab: Version 4.0.5
- Libraries:
- NumPy: Version 1.25.2
- pandas: Version 2.0.3
- matplotlib: Version 3.7.2
This project is licensed under the MIT License - see the LICENSE file for details.