- Checking dtypes of columns:
df.dtypes
# dytpesdf.info()
# dtypes and missing values
- String to integers
df['col'] = df['col'].str.strip('character e.g $')
# remove special charactersdf['col'] = df['col'].astype('int')
# convert col to intassert df['col'].dtype == 'int'
# confirming col is now integer; returns nothing if true, and assertionerror otherwise
- Numerical variables to categorical e.g if Marriage col has values 0 for unmarried, 1 for married, and 2 for divorced
df['col'] = df['col'].astype('category')
-
Outliers in categorical data can be seen when you plot a histogram, and there are connected bars
-
Options to deal with out of range data:
- Drop the data, but only drop data when a small proportion of your dataset is affected by out of range values
df['col'] = df[df['col'] <= 5]
# Drop values using filtering
df.drop(df[df['col'] <= 5].index, inplace = True)
# Drop using .drop() - Setting custom minimums or maximums to your columns
df.loc[df['col'] > 5, 'col'] = 5
# Set col > 5 to 5 - Treat data as missing, and impute
- Setting a custom value dependent on the business assumptions behind our data
- Drop the data, but only drop data when a small proportion of your dataset is affected by out of range values
-
Converting object to date type
df['col'] = pd.to_datetime(df['col']).dt.date
# converting object to date
today_date = dt.date.today()
# date today
-
To find duplicate rows:
cols = ['col1', 'col2', 'col3']
# column names to check for duplication
duplicates = df.duplicated(subset = 'cols', keep = 'first')
# keep can be first/last/False (False keeps all) -
To drop duplicates:
df.drop_duplicates(inplace = True)
# without subset defined, drops complete duplicates. Keep first is default behaviour -
Exercise:
-
To sort dataframe:
df.sort_values(by= ['col1, col2'], ascending=False)[['col1, col2']]
#sorts in descending order
- Categorical data represent variables that have predefined finite set of categories. Examples of this range from marriage status, household income categories, loan status and others
- Since categorical data represent a predefined set of categories, they can't have values that go beyond these predefined categories
- To deal with inconsistent categories, we use two main types of joins:
- Anti joins, take in two DataFrames A and B, and return data from one DataFrame that is not contained in another
- Inner joins, return only the data that is contained in both DataFrames
- To deal with inconsistent categories:
inconsistent_categories = set(df1['col']).difference(df_categories['col'])
# get inconsistent categories
inconsistent_rows = df1['col'].isin(inconsistent_categories)
# get inconsistent rows
inconsistent_data = df1['inconsistent_rows']
# get inconsistent data
consistent_data = df1[~inconsistent_rows]
# get consistent data
- A common categorical data problem is having values that slightly differ because of capitalization
- To count unique values:
df['col'].value_counts()
# works with series onlydf.groupby('col').count()
# works with dataframe
- To sort issue, you can uppercase or lowercase column as:
df['col'] = df['col'].str.lower()
- Another common problem with categorical values are leading or trailing spaces. Sorted by:
df = df['col'].str.strip()
- To create named categories out of data e.g. creating incoming groups from income data, we use pd.cut():
ranges = [0, 200000, 500000, np.inf]
# define rangesgroup_names = ['0-200K', '200K-500K', '500K+']
# define labelsdf['income_group'] = pd.cut(df['income'], bins=ranges, labels=group_names)
# create income group column
- To reduce categories to fewer ones, we create a mapping dictionary and replace as:
mapping_dict = {'old_cat1' : 'new_cat1', 'old_cat2' : 'new_cat1', ....}
df['col'] = df['col'].replace(mapping_dict)
- Converting column with multiple date formats to datetime object:
df['col'] = pd.to_datetime(df['col'], infer_datetime_format=True, errors='coerce')
# errors='coerce' return NaT for rows whose conversion failed
df['col'] = df['col'].dt.strftime("%d-%m-%Y")
#format datetime col
df['year'] = df['col'].dt.strftime("%Y")
#Extract Year Column
- A common challenge when merging data from different sources is data integrity
- CFD is the use of multiple fields in your dataset to sanity check the integrity of your data
- Example of CFD in flights data:
- Example of CFD in Age and Birthday data:
- Get summary of missingness:
df.isna().sum()
- Missingness types:
- Missing Completely at Random (MCAR) - no systematic relationship between missing data and other values. Stems from data entry errors
- Missing at Random (MAR) - is when there is a systematic relationship between missing data and other observed values e.g CO2 data being missing for low temperatures
- Missing Not at Random (MNAR) - there is a systematic relationship between the missing data and unobserved values e.g. when it's really hot outside, the thermometer might stop working, so we don't have temperature measurements for days with high temperatures
- Dropping missing values:
df = df.dropna(subset = ['col'])
- Replacing with statistical measures:
col_mean = df['col'].mean()
df_imputed = df.fillna({'col': col_mean})
- Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings
- Commonly used to link records by calculating the similarity between strings— which can be used to join two datasets into one clean master dataset
-
Minimum edit distance is a systematic way to identify how close 2 strings are
-
For example, consider the following two words: intention, and execution
-
The minimum edit distance between them is the least possible amount of steps, that could get us from the word intention to execution, with the available operations being:
- inserting new characters, deleting them, substituting them, and transposing consecutive characters.
-
The lower the edit distance, the closer two words are
-
To perform simple string comparison:
from thefuzz import fuzz
fuzz.WRatio('Reeding', 'Reading')
# compare two words -
To compare a string with an array of strings:
-
To collapse many categories, the .replace() method may not be feasible. To achieve this, we can use string similarity: