Skip to content

jkuatdsc/dsc-pandas-data-cleaning-lab

 
 

Repository files navigation

Pandas Data Cleaning - Cumulative Lab

Introduction

In this lab, we'll make use of everything we've learned about pandas, data cleaning, and exploratory data analysis. In order to complete this lab, you'll have to import, clean, combine, reshape, and visualize data to answer questions provided, as well as your own questions!

Objectives

You will be able to:

  • Practice opening and inspecting the contents of CSVs using pandas dataframes
  • Practice identifying and handling missing values
  • Practice identifying and handling invalid values
  • Practice cleaning text data by removing whitespace and fixing typos
  • Practice joining multiple dataframes

Your Task: Clean the Superheroes Dataset with Pandas

LEGO superheroes

Photo by Yulia Matvienko on Unsplash

Data Understanding

In this lab, we'll work with a version of the comprehensive Superheroes Dataset, which can be found on Kaggle and was originally scraped from SuperHeroDb. We have modified the structure and contents of the dataset somewhat for the purposes of this lab. Note that this data was collected in June 2017, so it may not reflect the most up-to-date superhero lore.

The data is contained in two separate CSV files:

  1. heroes_information.csv: each record represents a superhero, with attributes of that superhero (e.g. eye color). Height is measured in centimeters, and weight is measured in pounds.
  2. super_hero_powers.csv: each record represents a superpower, then has True/False values representing whether each superhero has that power

Business Understanding

The business questions you have been provided are:

  1. What is the distribution of superheroes by publisher?
  2. What is the relationship between height and number of superpowers? And does this differ based on gender?
  3. What are the 5 most common superpowers in Marvel Comics vs. DC Comics?

This lab also simulates something you are likely to encounter at some point or another in your career in data science: someone has given you access to a dataset, as well as a few questions, and has told you to "find something interesting".

So, in addition to completing the basic data cleaning tasks and the aggregation and reshaping tasks needed to answer the provided questions, you will also need to formulate a question of your own and perform any additional cleaning/aggregation/reshaping that is needed to answer it.

Requirements

1. Load the Data with Pandas

Create a dataframes heroes_df and powers_df that represent the two CSV files. Use pandas methods to inspect the shape and other attributes of these dataframes.

2. Perform Data Cleaning Required to Answer First Question

The first question is: What is the distribution of superheroes by publisher?

In order to answer this question, you will need to:

  • Identify and handle missing values
  • Identify and handle text data requiring cleaning

3. Perform Data Aggregation and Cleaning Required to Answer Second Question

The second question is: What is the relationship between height and number of superpowers? And does this differ based on gender?

In order to answer this question, you will need to:

  • Join the dataframes together
  • Identify and handle invalid values

4. Perform Data Aggregation Required to Answer Third Question

The third question is: What are the 5 most common superpowers in Marvel Comics vs. DC Comics?

This should not require any additional data cleaning or joining of tables, but it will require some additional aggregation.

5. Formulate and Answer Your Own Question

This part is fairly open-ended. Think of a question that can be answered with the available data, and perform any cleaning or aggregation required to answer that question.

1. Load the Data with Pandas

In the cell below, we:

  • Import and alias pandas as pd
  • Import and alias numpy as np
  • Import and alias seaborn as sns
  • Import and alias matplotlib.pyplot as plt
  • Set Matplotlib visualizations to display inline in the notebook
# Run this cell without changes

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Superheroes

In the cell below, load heroes_information.csv as heroes_df:

# Your code here

heroes_df.head()

It looks like that CSV came with an index column, resulting in an extra column called Unnamed: 0. We don't need that column, so write code to get rid of it below.

There are two ways to do this:

  1. Re-load with read_csv, and specify the parameter index_col=0
  2. Drop the column Unnamed: 0 with axis=1
# Your code here

heroes_df.head()

The following code checks that the dataframe was loaded correctly.

# Run this cell without changes

# There should be 734 rows
assert heroes_df.shape[0] == 734

# There should be 10 columns. If this fails, make sure you got rid of
# the extra index column
assert heroes_df.shape[1] == 10

# These should be the columns
assert list(heroes_df.columns) == ['name', 'Gender', 'Eye color', 'Race',
 'Hair color', 'Height', 'Publisher', 'Skin color', 'Alignment', 'Weight']

Now you want to get familiar with the data. This step includes:

  • Understanding the dimensionality of your dataset
  • Investigating what type of data it contains, and the data types used to store it
  • Discovering how missing values are encoded, and how many there are
  • Getting a feel for what information it does and doesn't contain

In the cell below, inspect the overall shape of the dataframe:

# Your code here

Now let's look at the info printout:

# Run this cell without changes
heroes_df.info()

In the cell below, interpret that information. Do the data types line up with what we expect? Are there any missing values?

# Replace None with appropriate text
"""
None
"""

Superpowers

Now, repeat the same process with super_hero_powers.csv. Name the dataframe powers_df. This time, make sure you use index_col=0 when opening the CSV because the index contains important information.

# Your code here (create more cells as needed)

The following code will check if it was loaded correctly:

# Run this cell without changes

# There should be 167 rows, 667 columns
assert powers_df.shape == (167, 667)

# The first column should be '3-D Man'
assert powers_df.columns[0] == '3-D Man'

# The last column should be 'Zoom'
assert powers_df.columns[-1] == 'Zoom'

# The first index should be 'Agility'
assert powers_df.index[0] == 'Agility'

# The last index should be 'Omniscient'
assert powers_df.index[-1] == 'Omniscient'

2. Perform Data Cleaning Required to Answer First Question

Recall that the first question is: What is the distribution of superheroes by publisher?

To answer this question, we will only need to use heroes_df, which contains the Publisher column.

Identifying and Handling Missing Values

As you likely noted above, the Publisher column is missing some values. Let's take a look at some samples with and without missing publisher values:

# Run this cell without changes
has_publisher_sample = heroes_df[heroes_df["Publisher"].notna()].sample(5, random_state=1)
has_publisher_sample
# Run this cell without changes
missing_publisher_sample = heroes_df[heroes_df["Publisher"].isna()].sample(5, random_state=1)
missing_publisher_sample

What do we want to do about these missing values?

Recall that there are two general strategies for dealing with missing values:

  1. Fill in missing values (either using another value from the column, e.g. the mean or mode, or using some other value like "Unknown")
  2. Drop rows with missing values

Write your answer below, and explain how it relates to the information we have:

# Replace None with appropriate text
"""
None
"""

Now, implement your chosen strategy using code. (You can also check the solution branch for the answer to the question above if you're really not sure.)

# Your code here

Now there should be no missing values in the publisher column:

# Run this cell without changes
assert heroes_df["Publisher"].isna().sum() == 0

Identifying and Handling Text Data Requiring Cleaning

The overall field of natural language processing (NLP) is quite broad, and we're not going to get into any advanced text processing, but it's useful to be able to clean up minor issues in text data.

Let's take a look at the counts of heroes grouped by publisher:

# Run this cell without changes
heroes_df["Publisher"].value_counts()

There are two cases where we appear to have data entry issues, and publishers that should be encoded the same have not been. In other words, there are four categories present that really should be counted as two categories (and you do not need specific comic book knowledge to be able to identify them).

Identify those two cases below:

# Replace None with appropriate text
"""
None
"""

Now, write some code to handle these cases. If you're not sure where to start, look at the pandas documentation for replacing values and stripping off whitespace.

# Your code here

Check your work below:

# Run this cell without changes
heroes_df["Publisher"].value_counts()

Answering the Question

Now we should be able to answer What is the distribution of superheroes by publisher?

If your data cleaning was done correctly, this code should work without any further changes:

# Run this cell without changes

# Set up plots
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(16, 5))

# Create variables for easier reuse
value_counts = heroes_df["Publisher"].value_counts()
top_5_counts = value_counts.iloc[:5]

# Plot data
ax1.bar(value_counts.index, value_counts.values)
ax2.bar(top_5_counts.index, top_5_counts.values)

# Customize appearance
ax1.tick_params(axis="x", labelrotation=90)
ax2.tick_params(axis="x", labelrotation=45)
ax1.set_ylabel("Count of Superheroes")
ax2.set_ylabel("Count of Superheroes")
ax1.set_title("Distribution of Superheroes by Publisher")
ax2.set_title("Top 5 Publishers by Count of Superheroes");

3. Perform Data Aggregation and Cleaning Required to Answer Second Question

Recall that the second question is: What is the relationship between height and number of superpowers? And does this differ based on gender?

Unlike the previous question, we won't be able to answer this with just heroes_df, since information about height is contained in heroes_df, while information about superpowers is contained in powers_df.

Joining the Dataframes Together

First, identify the shared key between heroes_df and powers_df. (Shared key meaning, the values you want to join on.) Let's look at them again:

# Run this cell without changes
heroes_df
# Run this cell without changes
powers_df

In the cell below, identify the shared key, and your strategy for joining the data (e.g. what will one record represent after you join, will you do a left/right/inner/outer join):

# Replace None with appropriate text
"""
None
"""

In the cell below, create a new dataframe called heroes_and_powers_df that contains the joined data. You can look at the above answer in the solution branch if you're not sure where to start.

Hint: Note that the .join method requires that the two dataframes share an index (documentation here) whereas the .merge method can join using any columns (documentation here). It is up to you which one you want to use.

# Your code here (create more cells as needed)

Run the code below to check your work:

# Run this cell without changes

# Confirms you have created a dataframe with the specified name
assert type(heroes_and_powers_df) == pd.DataFrame

# Confirms you have the right number of rows
assert heroes_and_powers_df.shape[0] == 647

# Confirms you have the necessary columns
# (If you modified the value of powers_df along the way, you might need to
# modify this test. We are checking that all of the powers are present as
# columns.)
assert [power in heroes_and_powers_df.columns for power in powers_df.index]
# (If you modified the value of heroes_df along the way, you mgith need to 
# modify this as well. We are checking that all of the attribute columns from
# heroes_df are present as columns in the joined df)
assert [attribute in heroes_and_powers_df.columns for attribute in heroes_df.columns]

Now that we have created a joined dataframe, we can aggregate the number of superpowers by superhero. This code is written for you:

# Run this cell without changes

# Note: we can use sum() with True and False values and they will
# automatically be cast to 1s and 0s
heroes_and_powers_df["Power Count"] = sum([heroes_and_powers_df[power_name] for power_name in powers_df.index])
heroes_and_powers_df

Answering the Question

Now we can plot the height vs. the count of powers:

# Run this cell without changes

fig, ax = plt.subplots(figsize=(16, 8))

ax.scatter(
    x=heroes_and_powers_df["Height"],
    y=heroes_and_powers_df["Power Count"],
    alpha=0.3
)

ax.set_xlabel("Height (cm)")
ax.set_ylabel("Number of Superpowers")
ax.set_title("Height vs. Power Count");

Hmm...what is that stack of values off below zero? What is a "negative" height?

Identifying and Handling Invalid values

One of the trickier tasks in data cleaning is identifying invalid or impossible values. In these cases, you have to apply your domain knowledge rather than any particular computational technique. For example, if you were looking at data containing dates of past home sales, and one of those dates was 100 years in the future, pandas wouldn't flag that as an issue, but you as a data scientist should be able to identify it.

In this case, we are looking at heights, which are 1-dimensional, positive numbers. In theory we could have a very tiny height close to 0 cm because the hero is microscopic, but it does not make sense that we would have a height below zero.

Let's take a look at a sample of those negative heights:

# Run this cell without changes
heroes_and_powers_df[heroes_and_powers_df["Height"] < 0].sample(5, random_state=1)

It looks like not only are those heights negative, those weights are negative also, and all of them are set to exactly -99.0.

It seems like this data source probably filled in -99.0 as the height or weight whenever it was unknown, instead of just leaving it as NaN.

Depending on the purpose of the analysis, maybe this would be a useful piece of information, but for our current question, let's go ahead and drop the records where the height is -99.0. We'll make a new temporary dataframe to make sure we don't accidentally delete anything that will be needed in a future question.

# Run this cell without changes
question_2_df = heroes_and_powers_df[heroes_and_powers_df["Height"] != -99.0].copy()
question_2_df

Answering the Question, Again

Now we can redo that plot without those negative heights:

# Run this cell without changes

fig, ax = plt.subplots(figsize=(16, 8))

ax.scatter(
    x=question_2_df["Height"],
    y=question_2_df["Power Count"],
    alpha=0.3
)

ax.set_xlabel("Height (cm)")
ax.set_ylabel("Number of Superpowers")
ax.set_title("Height vs. Power Count");

Ok, that makes more sense. It looks like there is not much of a relationship between height and number of superpowers.

Now we can go on to answering the second half of question 2: And does this differ based on gender?

To indicate multiple categories within a scatter plot, we can use color to add a third dimension:

# Run this cell without changes

fig, ax = plt.subplots(figsize=(16, 8))

# Select subsets
question_2_male = question_2_df[question_2_df["Gender"] == "Male"]
question_2_female = question_2_df[question_2_df["Gender"] == "Female"]
question_2_other = question_2_df[(question_2_df["Gender"] != "Male") & (question_2_df["Gender"] != "Female")]

# Plot data with different colors
ax.scatter(
    x=question_2_male["Height"],
    y=question_2_male["Power Count"],
    alpha=0.5,
    color="cyan",
    label="Male"
)
ax.scatter(
    x=question_2_female["Height"],
    y=question_2_female["Power Count"],
    alpha=0.5,
    color="gray",
    label="Female"
)
ax.scatter(
    x=question_2_other["Height"],
    y=question_2_other["Power Count"],
    alpha=0.5,
    color="yellow",
    label="Other"
)

# Customize appearance
ax.set_xlabel("Height (cm)")
ax.set_ylabel("Number of Superpowers")
ax.set_title("Height vs. Power Count")
ax.legend();

It appears that there is still no clear relationship between count of powers and height, regardless of gender. We do however note that "Male" is the most common gender, and that male superheroes tend to be taller, on average.

4. Perform Data Aggregation Required to Answer Third Question

Recall that the third question is: What are the 5 most common superpowers in Marvel Comics vs. DC Comics?

We'll need to keep using heroes_and_powers_df since we require information from both heroes_df and powers_df.

Your resulting question_3_df should contain aggregated data, with columns Superpower Name, Marvel Comics (containing the count of occurrences in Marvel Comics), and DC Comics (containing the count of occurrences in DC Comics). Each row should represent a superpower.

In other words, question_3_df should look like this:

question 3 df

Don't worry if the rows or columns are in a different order, all that matters is that you have the right rows and columns with all the data.

Hint: refer to the documentation for .groupby and treat each publisher as a group.

# Your code here (create more cells as needed)

The code below checks that you have the correct dataframe structure:

# Run this cell without changes

# Checking that you made a dataframe called question_3_df
assert type(question_3_df) == pd.DataFrame

# Checking the shape
assert question_3_df.shape == (167, 3)

# Checking the column names
assert sorted(list(question_3_df.columns)) == ['DC Comics', 'Marvel Comics', 'Superpower Name']

Answering the Question

The code below uses the dataframe you created to find and plot the most common superpowers in Marvel Comics and DC Comics.

# Run this cell without changes

marvel_most_common = question_3_df.drop("DC Comics", axis=1)
marvel_most_common = marvel_most_common.sort_values(by="Marvel Comics", ascending=False)[:5]
marvel_most_common
# Run this cell without changes

dc_most_common = question_3_df.drop("Marvel Comics", axis=1)
dc_most_common = dc_most_common.sort_values(by="DC Comics", ascending=False)[:5]
dc_most_common
# Run this cell without changes

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 5))

ax1.bar(
    x=marvel_most_common["Superpower Name"],
    height=marvel_most_common["Marvel Comics"]
)
ax2.bar(
    x=dc_most_common["Superpower Name"],
    height=dc_most_common["DC Comics"]
)

ax1.set_ylabel("Count of Superheroes")
ax2.set_ylabel("Count of Superheroes")
ax1.set_title("Frequency of Top Superpowers in Marvel Comics")
ax2.set_title("Frequency of Top Superpowers in DC Comics");

It looks like super strength is the most popular power in both Marvel Comics and DC Comics. Overall, the top 5 powers are fairly similar — 4 out of 5 overlap, although Marvel contains agility whereas DC contains flight.

5. Formulate and Answer Your Own Question

For the remainder of this lab, you'll be focusing on coming up with and answering your own question, just like we did above. Your question should not be overly simple, and should require both descriptive statistics and data visualization to answer. In case you're unsure of what questions to ask, some sample questions have been provided below.

Pick one of the following questions to investigate and answer, or come up with one of your own!

  • Which powers have the highest chance of co-occurring in a hero (e.g. super strength and flight)?
  • What is the distribution of skin colors amongst alien heroes?
  • How are eye color and hair color related in this dataset?

Explain your question below:

# Replace None with appropriate text:
"""
None
"""

Some sample cells have been provided to give you room to work. Feel free to create more cells as needed.

Be sure to include thoughtful, well-labeled visualizations to back up your analysis!

(There is no solution branch for this part, and feel free to move on to the next lesson if you have already spent more than 90 minutes.)

Summary

In this lab, you demonstrated your mastery of using pandas to clean and aggregate data in order to answer several business questions. This included identifying and handling missing values, text requiring preprocessing, and invalid values. You also performed aggregation and reshaping tasks such as transposing, joining, and grouping data. Great job, there was a lot here!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%